Multivariate Missing Data - Multiple Imputation for Complex Data Sets

Real-world data sets with missing values will often have more than one incompletely observed variable. So far, this chapter has focused on the justification and inferential aspects of the MI estimator without considerations on how to select and specify the imputation model. The following sections define the two main approaches available:

Joint Modeling (JM) and Fully Conditional Specification (FCS).

3.5.1 Joint Modeling

Joint Modeling supposes that the data can be described by a multivariate distribution and assuming ignorability, imputations are created by drawing from said fitted distri-bution. Common imputation models are based on the multivariate normal distribution (Schafer, 1997). For simplicity, let’s assume that

Y ∼N(µ,Σ), (3.32)

whereµ= (µ1, . . . ,µp)andΣap×pcovariance matrix. Taking a flat prior distribution forµand aW_p(ν,S_p)prior forΣ⁻¹, ifY were fully observed, the posterior distribution of(µ,Σ)givenY could be written as the product of

µ|Y,Σ∼N(Y,n⁻¹Σ) (3.33)

and

Σ⁻¹|Y ∼W_p(n+ν,(S⁻_p¹+S)⁻¹) (3.34) whereY and(n−1)⁻¹Sare the sample mean and covariance matrix respectively (Car-penter and Kenward, 2012, Appendix B).

IfY is incompletely observed, the estimation of equations (3.33) and (3.34) can be achieved with the use of the Gibbs sampler as described in algorithm 1. The procedure will draw parameters in an alternate fashion, conditional on all others and the data.

In the first step the missing data is commonly initialized with a bootstrap sample of the observed data. After the sampler reached its stationary distribution, multiple imputa-tions can be generated by takingY_?^mis draws sufficiently spaced from each other. The

“?” symbol denotes that the variable or parameter is a random draw from a posterior

Algorithm 1Joint Modeling Gibbs Sampler

1: Fill in missing dataY^mis bootstrapping the observed dataY^obs

2: EstimateY andS

3: DrawΣ⁻_?¹ andµ_? using equations (3.34) and (3.33)

4: DrawY_?^mis∼N(µ_?,Σ_?)

5: Update the estimation ofY andS

6: Repeat steps 3 to 5 a large number of times to allow the sampler to reach its stationary distribution.

conditional distribution.

This methodology is attractive if the multivariate distribution is a good model for the data but may lack the flexibility needed to represent complex data sets encoun-tered in real applications. In such cases, the joint modeling approach is difficult to implement because the typical specifications of multivariate distributions are not suf-ficiently flexible to accommodate these features (He and Raghunathan, 2009). Also, most of the existing model-based methods and software implementations assume that the data originate from a multivariate normal distribution (e.g., Honaker, King, and Blackwell, 2011; Templ, Kowarik, and Filzmoser, 2011; van Buuren, 2007).

Demirtas, Freels, and Yucel (2008) showed in a simulation study, that imputations generated with the multivariate normal model can yield correct estimates, even in the presence of non-normal data. Nevertheless, the assumption of normality is inappro-priate as soon as there are outliers in the data, or in the case of skewed, heavy-tailed or multimodal distributions, potentially leading to deficient results (He and Raghu-nathan, 2009; van Buuren, 2012). To generate imputations when variables in the data set are binary or categorical, latent normal model (Albert and Chib, 1993) or the general location model (Little and Rubin, 2002) are also alternatives.

3.5.2 Fully Conditional Specification

Sometimes the assumption of a joint distribution on the data can not be justified, espe-cially with a complex data set consisting of a mix of several different continuous and categorical variables. An alternative multivariate approach is given by the Fully Con-ditional Specification. The method requires the specification of an imputation model for each incompletely observed variable and impute values iteratively one variable at a time. This is one of the great advantages of this method, since it decomposes a high dimensional imputation model into one-dimensional problems, making it a general-ization of univariate imputation (van Buuren, 2012).

This method is most commonly applied through the Multivariate Imputation by Chained Equations (MICE) algorithm (van Buuren and Groothuis-Oudshoorn, 2011).

This method is summarized in algorithm 2. For each variable with missings a density, f_j(Y_j|Y_j−,Θj), conditional on all other variables is specified, where Θj are the impu-tation model parameters. MICE, essentially a MCMC method, visits sequentially each variable with missings and draws alternately the imputation parameters and the im-puted values.

Algorithm 2MICE (FCS)

1: Fill in missing dataY^mis bootstrapping the observed dataY^obs

2: For j=1, . . . ,p

a. DrawΘ^?_j, from the posterior distribution of the imputation parameters.

b. ImputeY_j^?from the conditional model f_j(Y_j|Y_j−,Θ^?_j)

3: Repeat step 2Ktimes to allow the Markov chain to reach its stationary distribution.

The FCS approach splits high-dimensional imputation models into multiple one-dimensional problems and is appealing as an alternative to joint modeling in cases where a proper multivariate distribution can not be found or when it does not exist.

The choice of imputation models in this setting can be varied, for example, paramet-ric models like the Bayesian linear regression, logistic regression, logit or multilevel models. Liu et al. (2013) studied the asymptotic properties of this iterative imputation procedure and provided sufficient conditions under which the imputation distribution converges to the posterior distribution of a joint model.

van Buuren (2012) claims that, in practice, K in step 3 of algorithm 2 can be set to a value between 5 and 20. This is a strong claim, since usual applications of MCMC methods require a large number of iterations. The justification is based on the fact that the random variability introduced by using imputed data in step 2, will reduce the autocorrelation between iterations in the Markov Chain, speeding up the convergence.

3.5.3 Compatibility

To discuss the validity of the FCS approach it is necessary to define the term “compat-ibility” first. A set of density functions,{f₁, . . . ,f_j}, is said to be compatible if there is a joint distribution f that generates such set.

The same flexibility of MICE that allows for very special conditional distributions and imputation models has as a drawback the fact that the joint distribution is not explicitly known, and there is the possibility that it doesn’t even exists. This is the case if the conditional distributions specified are incompatible.

Incompatibility in MICE can be the result of imputing deterministic functions of

variables in the data along with those same original variables. For example, there could be interaction terms or nonlinear functions of the data in the imputation models, introducing feedback loops and impossible combination in the algorithm which would lead to invalid imputations (van Buuren and Groothuis-Oudshoorn, 2011). For that reason, the discussion about the congeniality of the imputation and substantive models is replaced by an analysis of their compatibility.

Although FCS is only justified to work if the conditional models are compatible, Buuren et al. (2006) reports a simulation study with models with strong incompatibil-ities where the estimates after performing multiple imputation were still acceptable.

Chapter 4 Imputation Methods

Van Buuren (Appendix A, 2012) contains an overview of available MI libraries for programming languages and statistical software likeR,SPSS,SAS,S-PlusandStata.

Salfran and Spiess (2015) described some of the most common imputation methods included in these software packages. This chapter provides more details about the imputing algorithms, incorporating also the methods that will be used later in Chapter 6 in the simulation experiment.

Section 4.1 illustrates the Bayesian Linear regression, one of the older and most popular methods. Section 4.2 describesAmelia a method published in 2010. Section 4.3 outlines algorithms in the family of Hot Deck imputation methods, like the PMM approach. Section 4.4 depicts a rather new method based on the software IVEware.

Section 4.5 present a class of imputation methods based on recursive partitioning.

Im Dokument Multiple Imputation for Complex Data Sets (Seite 34-38)