Dealing with missing values – part 2

(1)

Dealing with missing values – part 2

Applied Multivariate Statistics – Spring 2012

(2)

Overview

 More on Single Imputation: Shortcomings

 Multiple Imputation: Accounting for uncertainty

(3)

Single Imputation

 Unconditional Mean

 Unconditional Distribution

 Conditional Mean

 Conditional Distribution

Easy / Inaccurate

Hard / Accurate

(4)

Example: Blood Pressure - Revisited

 30 participants in January (X) and February (Y)

 MCAR: Delete 23 Y values randomly

 MAR: Keep Y only where X > 140 (follow-up)

 MNAR: Record Y only where Y > 140 (test everybody again but only keep values of critical participants)

(5)

Example: Blood Pressure

Black points are missing (MAR)

(6)

Unconditional Mean

+ Mean of Y ok

- Variance of Y wrong

(7)

Unconditional Distribution

+ Mean of Y ok, Variance better - Correlation btw X and Y wrong

(8)

Conditional Mean

+ Conditional Mean of Y ok + Correlation ok

- (Conditional) Variance wrong

Y = 84 + 0.3*X

(9)

Conditional Distribution

+ Conditional Mean of Y ok + Correlation ok

+ Conditional Variance of Y ok

Y = 84 + 0.3*X + e e ~ N(0, 23²)

(10)

Conditional Distribution

Y = 84 + 0.3*X + e e ~ N(0, 23²)

95%-CI: [-234; 402] 95%-CI: [-1.7; 2.4]

Problem: We ignore uncertainty

(11)

Problem of Single Imputation

 Too optimistic: Imputation model (e.g. in Y = a + bX) is just estimated, but not the true model

 Thus, imputed values have some uncertainty

 Single Imputation ignores this uncertainty

 Coverage probability of confidence intervals is wrong

 Solution: Multiple Imputation Incorporates both

- residual error

- model uncertainty (excluding model mis-specification)

(12)

?

Multiple Imputation: Idea

?

Impute several times

Do standard analysis for each imputed data set;

get estimate and std.error

Aggregate results

(13)

Multiple Imputation: Idea

 Need special imputation schemes that include both - uncertainty of residuals

- uncertainty of model

(e.g. values of intercept a and slope b)

 Rough idea:

- Fill in random values

- Iteratively predict values for each variable until some convergence is reached (as in missForest)

- Sample values for residuals AND for (a,b)

 Gibbs sampler is used

 Excellent for intuition (by one of the big guys in the field):

(14)

Multiple Imputation: Intuition

Predict missing values accounting for

- Uncertainty of residuals

- Uncertainty of parameter estimates

(15)

Multiple Imputation: Intuition

(16)

Multiple Imputation: Intuition

(17)

Multiple Imputation: Intuition

(18)

Multiple Imputation: Intuition

(19)

Multiple Imputation: Intuition

(20)

Multiple Imputation: Gibbs sampler (Not for exam)

 Iteration t; repeat until convergence:

For each variable i:

where

µ _i ^¤ ^(t) » P (µ _i j Y _i ^obs ; Y _¡ ^(t) _i )

Y _i ^¤ ^(t) » P (Y _i j Y _i ^obs ; Y _¡ ^(t) _i ; µ _i ^¤ ^(t) )

Y_i^(t) = (Y_i^obs; Y_j^¤^(t))

Sample (a,b)

Predict missings using y = a + bx + e

Intuition

(21)

R package: MICE

Multiple Imputation with Chained Equations

 MICE has good default settings; don’t worry about the data type

 Defaults for data types of columns:

- numeric: Predictive Mean Matching (pmm)

(like fancy linear regression; faster alternative: linear regression)

- factor, 2 lev: Logistic Regression (logreg)

- factor, >2 lev: Multinomial logit model (polyreg) - ordered, >2 lev: Ordered logit model (polr)

(22)

Aggregation of estimates

 : Estimate of imputation i

: Variance of estimate (= square of std. error)

 Assume:

 Average estimate:

 Within-imputation variance:

 Between-imputation variance:

 Total variance:

 Approximately: with

 95%-CI:

Q¹ = _m¹ Pm

j=1 Q^_j

U¹ = _m¹ Pm

j=1 U^_j B = _m¹_¡₁ Pm

j=1( ^Q_j ¡ Q)¹ ² T = ¹U + _m¹_¡₁B

Q^¡Q

pU ¼ N(0;1) Q^_i

U_i

Q¹¡Q

pT » t_º _º _{= (m} _¡ ₁₎^³_{1 +} ^m^U^¹

(1+m)B

´2

Q¹ §t_º;0:975p T

(23)

Multiple Imputation with MICE

Do manually, if you have non standard analysis

(24)

How much uncertainty due to missings?

 Relative increase in variance due to nonrespose:

 Fraction (or rate) of missing information fmi:

(!! Not the same as fraction of missing OBSERVATIONS)

 Proportion of the total variance that is attributed to the missing data:

fmi =

^r+

2 º+3

r+1

r =

⁽¹⁺_U_¹^m¹ ^)B

¸ =

^B(1+_T ^m¹ ⁾ Returned by mice

(25)

How many imputations?

 Surprisingly few!

 Efficiency compared to depends on fmi:

 Examples (eff in %):

m = 1

eff = ³

1 + ^{f mi}_m ´_¡1

M fmi=0.1 fmi=0.3 fmi=0.5 fmi=0.7 fmi=0.9

3 97 91 86 81 77

5 98 94 91 88 85

10 99 97 95 93 92

Oftentimes OK

Perfect ! Rule of thumb:

- Preliminary analysis: m = 5 - Paper: m = 20 or even m = 50

(26)

Concepts to know

 Idea of mice

 How to aggregate results from imputed data sets?

 How many imputations?

(27)

R functions to know

 mice, with, pool

(28)

Next time

 Multidimensional Scaling

 Distance metrics

Dealing with missing values – part 2