Missing Data - Data cleaning - Biomathematical exploration of the MARK-AGE database

1.3 Data cleaning

1.3.4 Missing Data

Missing data is a problem that frequently arises in clinical trials (Altman and Böand, 2007;

Rubin, 1976; Sterne et al., 2009). The design of clinical trials is performed, to keep levels of missing data as low as possible. High percentages of missing data can lead to misleading interpretations of the study results (Banks et al., 2004). The cause of missing data is various, and covers reasons like incorrect patient’s reports, personnel error, faulty measurements, and so forth.

1.3.4.1 Detection of missing data

How to select an appropriate way for statistical analysis, depends on the reasons why missing data occur (Sterneet al., 2009). To find the right method, missing data and their causes, should be monitored during the ongoing trial. Thereby two important terms, the missing data pattern and the missing data mechanism, visualizing and characterizing the situation of the data, should be considered (Toutenburget al., 2002).

1.3.4.1.1 Missing data pattern

Missing data pattern describes the data structure that occurs due to missing values (Fig. 1.8).

The inspection addresses questions like, does a missing value occur for a single subject or for a complete group. And do they miss at the same time or completely randomly. In the end this strategy can give a hint on complex dependencies between observed and incomplete variables (Toutenburget al., 2002).

Figure 1.8 Overview of missing data pattern (adapted from Toutenburget al., 2002)

The four figures represent different missing data pattern. (1) Univariate Missing Data Pattern, (2) Monotone Missing Data Pattern, (3) Special Missing Data Pattern, (4) General Missing Data Pattern.

1.3.4.1.2 Missing data mechanism

Missing data mechanism observes the process responsible for the generation of missing data and is in general split in three parts (Ibrahimet al., 2012; Toutenburget al., 2002).

Missing completely at random (MCAR)

The reason why a missing value occurs is not dependent on the data itself. They were introduced because of lost data, accidental omission of an answer on a questionnaire, accidental breaking of samples or laboratory instrument and unknowing personnel error. Under MCAR the observed data are just a random sample of the population (Ibrahimet al., 2012; Little and Rubin, 2002; Toutenburget al., 2002).

Missing at random (MAR)

Missing data are related to a specific variable but are not related to the values of the variable that offer missing data (Little and Rubin, 2002).

1.3.4.2 Handling of missing data 1.3.4.2.1 Complete Case Analysis (CCA)

Many methods have been published how to address the identified missing data (Enders, 2012;

Rubin, 1987; Sterne et al., 2009). An available or complete case analysis (CCA) is the most frequently used method in handling missing data and exclude all subjects with missing values from the analysis (Altman and Böand, 2007; Donders et al., 2006; Sterne et al., 2009; White and Thompson, 2005). This is also the standard technique for most statistical software like SAS, STATA and R. The CCA is only usable if the missing data occurs with equally distributed covariates, and in an unbiased fashion. A covariate in statistics is a variable that is of direct interest in the study and possibly predictive for the outcome of the results. Guidelines for randomized trials indicate that adjustment for covariates can be considered to reduce bias and increase precision and should be pre-specified in the trial protocol (Lewis, 1942; Products, 2004). Bias is thereby defined, as the average difference between model parameter estimates and their true values (Ibrahimet al., 2012). In data that are MAR where missing values depend only on the observed covariates and not on the response, a CCA analysis will lead to unbiased estimates (Little and Rubin, 2002). Whereas for MNAR data, a CCA analysis will lead to biased and inefficient parameter estimates when missing values were excluded (Ibrahimet al., 2012;

Toutenburget al., 2002).

1.3.4.2.2 Imputation methods

The exclusion of subjects with missing data can have a big impact on the analysis (Little and Rubin, 2002; Molenberghs and Verbeke, 2005; Verbeke and Molenberghs, 2009). As a consequence, methods to fill missing gaps in data are necessary. One common method to deal with missing data is imputation (Rubin, 1987; Schafer, 1997; Sterneet al., 2009; Vach, 1994).

There are methods available were only a single estimate (single imputation) is used, but also multiple imputation methods. The single imputation is simple and easy to perform, but can lead after a poor choice to an incorrect conclusion. Common methods are the insertion of the mean or the statistical more robust median (Enders, 2012). Multiple imputation methods were introduced in 1987, and reflect an efficient way to handle missing data in general (Rubin, 1987), as they also address the term of variability (McCleary, 2002; Patrician, 2002). Rubin described the ways to estimate missing values with a set of plausible data. The method assumes multivariate normal distributed MAR data.

The imputation is split in three steps. Missing values are replaced five times generating 5 different data sets. For each set the mean and the standard deviation (sd) are calculated. These results are used to calculate a combined overall mean and sd for each missing value. In this process the missing values are estimated based on a specific list of characteristics that are used as predictors (Rubin, 1976). Regression analysis, a main tool in statistical analysis of dependencies, is often affected by missing values (Toutenburg et al., 2002). Whereas parametric regression has been investigated extensively (Rubin, 1987; Schafer, 1997; Vach, 1994) nonparametric methods (Chena and Tang, 2011; González-Manteiga and Pérez-González, 2011) has poorly been considered within this context so far. Multiple imputation, however, results in less biased estimates than not addressing missing values at all (Moons et al., 2006).

Im Dokument Biomathematical exploration of the MARK-AGE database (Seite 37-40)