• Keine Ergebnisse gefunden

Dealing with missing values – part 2

N/A
N/A
Protected

Academic year: 2022

Aktie "Dealing with missing values – part 2"

Copied!
28
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Dealing with missing values – part 2

Applied Multivariate Statistics – Spring 2012

(2)

Overview

 More on Single Imputation: Shortcomings

 Multiple Imputation: Accounting for uncertainty

(3)

Single Imputation

 Unconditional Mean

 Unconditional Distribution

 Conditional Mean

 Conditional Distribution

Easy / Inaccurate

Hard / Accurate

(4)

Example: Blood Pressure - Revisited

 30 participants in January (X) and February (Y)

 MCAR: Delete 23 Y values randomly

 MAR: Keep Y only where X > 140 (follow-up)

 MNAR: Record Y only where Y > 140 (test everybody again but only keep values of critical participants)

(5)

Example: Blood Pressure

Black points are missing (MAR)

(6)

Unconditional Mean

+ Mean of Y ok

- Variance of Y wrong

(7)

Unconditional Distribution

+ Mean of Y ok, Variance better - Correlation btw X and Y wrong

(8)

Conditional Mean

+ Conditional Mean of Y ok + Correlation ok

- (Conditional) Variance wrong

Y = 84 + 0.3*X

(9)

Conditional Distribution

+ Conditional Mean of Y ok + Correlation ok

+ Conditional Variance of Y ok

Y = 84 + 0.3*X + e e ~ N(0, 232)

(10)

Conditional Distribution

Y = 84 + 0.3*X + e e ~ N(0, 232)

95%-CI: [-234; 402] 95%-CI: [-1.7; 2.4]

Problem: We ignore uncertainty

(11)

Problem of Single Imputation

 Too optimistic: Imputation model (e.g. in Y = a + bX) is just estimated, but not the true model

 Thus, imputed values have some uncertainty

 Single Imputation ignores this uncertainty

 Coverage probability of confidence intervals is wrong

 Solution: Multiple Imputation Incorporates both

- residual error

- model uncertainty (excluding model mis-specification)

(12)

?

Multiple Imputation: Idea

?

Impute several times

Do standard analysis for each imputed data set;

get estimate and std.error

Aggregate results

(13)

Multiple Imputation: Idea

 Need special imputation schemes that include both - uncertainty of residuals

- uncertainty of model

(e.g. values of intercept a and slope b)

 Rough idea:

- Fill in random values

- Iteratively predict values for each variable until some convergence is reached (as in missForest)

- Sample values for residuals AND for (a,b)

 Gibbs sampler is used

 Excellent for intuition (by one of the big guys in the field):

(14)

Multiple Imputation: Intuition

Predict missing values accounting for

- Uncertainty of residuals

- Uncertainty of parameter estimates

(15)

Multiple Imputation: Intuition

Predict missing values accounting for

- Uncertainty of residuals

(16)

Multiple Imputation: Intuition

Predict missing values accounting for

- Uncertainty of residuals

- Uncertainty of parameter estimates

(17)

Multiple Imputation: Intuition

Predict missing values accounting for

- Uncertainty of residuals

(18)

Multiple Imputation: Intuition

Predict missing values accounting for

- Uncertainty of residuals

- Uncertainty of parameter estimates

(19)

Multiple Imputation: Intuition

Predict missing values accounting for

- Uncertainty of residuals

(20)

Multiple Imputation: Gibbs sampler (Not for exam)

 Iteration t; repeat until convergence:

For each variable i:

where

µ i ¤ (t) » P (µ i j Y i obs ; Y ¡ (t) i )

Y i ¤ (t) » P (Y i j Y i obs ; Y ¡ (t) i ; µ i ¤ (t) )

Yi(t) = (Yiobs; Yj¤(t))

Sample (a,b)

Predict missings using y = a + bx + e

Intuition

(21)

R package: MICE

Multiple Imputation with Chained Equations

 MICE has good default settings; don’t worry about the data type

 Defaults for data types of columns:

- numeric: Predictive Mean Matching (pmm)

(like fancy linear regression; faster alternative: linear regression)

- factor, 2 lev: Logistic Regression (logreg)

- factor, >2 lev: Multinomial logit model (polyreg) - ordered, >2 lev: Ordered logit model (polr)

(22)

Aggregation of estimates

 : Estimate of imputation i

: Variance of estimate (= square of std. error)

 Assume:

 Average estimate:

 Within-imputation variance:

 Between-imputation variance:

 Total variance:

 Approximately: with

 95%-CI:

Q¹ = m1 Pm

j=1 Q^j

U¹ = m1 Pm

j=1 U^j B = m1¡1 Pm

j=1( ^Qj ¡ Q)¹ 2 T = ¹U + m1¡1B

Q^¡Q

pU ¼ N(0;1) Q^i

Ui

Q¹¡Q

pT » tº º = (m ¡ 1)³1 + mU¹

(1+m)B

´2

Q¹ §tº;0:975p T

(23)

Multiple Imputation with MICE

Do manually, if you have non standard analysis

(24)

How much uncertainty due to missings?

 Relative increase in variance due to nonrespose:

 Fraction (or rate) of missing information fmi:

(!! Not the same as fraction of missing OBSERVATIONS)

 Proportion of the total variance that is attributed to the missing data:

fmi =

r+

2 º+3

r+1

r =

(1+U¹m1 )B

¸ =

B(1+T m1 ) Returned by mice

(25)

How many imputations?

 Surprisingly few!

 Efficiency compared to depends on fmi:

 Examples (eff in %):

m = 1

eff = ³

1 + f mim ´¡1

M fmi=0.1 fmi=0.3 fmi=0.5 fmi=0.7 fmi=0.9

3 97 91 86 81 77

5 98 94 91 88 85

10 99 97 95 93 92

Oftentimes OK

Perfect ! Rule of thumb:

- Preliminary analysis: m = 5 - Paper: m = 20 or even m = 50

(26)

Concepts to know

 Idea of mice

 How to aggregate results from imputed data sets?

 How many imputations?

(27)

R functions to know

 mice, with, pool

(28)

Next time

 Multidimensional Scaling

 Distance metrics

Referenzen

ÄHNLICHE DOKUMENTE

RESULTS: (i) Return for one year education for women is %12.4 for man %9.98 (when sector dummies added, %8.3 and %7.7 respectively) (ii) For low level of education - no

 Data Processing Inequality and connection to missing values.  Distributions of

- factor, >2 lev: Multinomial logit model (polyreg) - ordered, >2 lev: Ordered logit model (polr).. Multiple Imputation

 Data Processing Inequality and connection to missing values.  Distributions of

The connection between these inuencing variables and the choice of a product is typically studied by using a statistical choice model for disaggregated data.. A classic choice model

Note that in the case of urban expansion, higher cropland rents also signi fi cantly increase the chance of forest being converted to urban land (but not its component classes arti

A random-effects panel logit model is proposed, in which the unmeasured attributes of an individual are represented by a descrete-valued random variable,

Bhat (1995) developed a random utility model with independent, but not-identically error terms distributed with a type I extreme value distri- bution, allowing the utility