21
M is si n g D a ta
aMissingDataarefrequentin
•SurveySampling
•Datacollections/registersthatarenotcollectedforthepurposeofstatisticalanalysis
22
bDistinguishfrom
•CensoredData(survival,detectionlimit)
•TruncatedData
Subjectsthatearnmorethan200000.–donotanswerthequestionaboutincome.Censored,truncated,missing?
cInformativemissingness−→Bias=systematicerrorNon-informativemissingness−→increasedstandarderrorscanbetreatedbystatisticalmethods.
23dModel:Completedataset−→Selectionmechanism−→Observed
Z e=[ eZ (j)i ]=[X e,Y e]andM=[M (j)i ]−→Z
Z (j)i = eZ (j)i ifM (j)i =0NAifM (j)i =1
•MissingCompletelyatRandom(MCAR)
M (j)i independentof eZi.
•Missingatrandom(MAR)
M (j)i onlydependsonobservedZ (j)i .
•InformativeMissing(IM)Usuallyhopeless.ButseeSurveySamplingfortricks.
24
eProceduresfordealingwithnon-informativemissings:
•Dropobservationswith≥1missings(na.omit)Okiftheremainingdatasetisbigenough.
Regression:
•AlwaysDropobservationsforwhichtheresponsevariableismissing.
•Indicatorvariables:–Factor:addalevel’NA’–Continuousvariable:addanindicatorvariableM (j)
LetX (j)i =someconstantcj(=0)insteadof’NA’.
25
General:
•Imputation:Replaceeachmissingbyaplausiblevalue–Useregression:Forecast eZ (j)i fromobserved eZ (k)i .–Nearestneighbor(s).
•MultipleImputation:Repl.eachmissingby5plausiblev.tomimikrandomvariation.
•MaximumLikelihood:DetermineMLestimatesbasedonallobserveddata.
26fComments:
•na.omitissimpleandusefulupto5-10%missings–causesproblemswithadd1oflmandthereforewithstep(...,direction=’forward’)
•MaximumLikelihoodneedsstochasticmodelforallvars.
•Regression:”kindof”model.
•Observationsmaybeweightedaccordingtonumberofimputedvalues.
•Remember”attenuation”for”errors-in-variables”models.−→Singleimputationofinputvariablesleadstobias.
27gStrategyforregressionproblems(tobetested...)
•DeletecaseswithmissingYi.
•Chooseinputvar.X (k1)withsmallestnumberofmissings.LetI1=setofmissings,={i|M (j)i =1}.DetermineallX (j)whichhavenon-miss.valuesforalli∈I1(CalltheseX (K0).)Predict eX (k1)i fromregressingX (k1)onX (K0)−→ bX (k1)i .
•X (k2):secondsmallestnumberofmissingsUseonlyX (K0),X (k1)(withimputedv.)topredict eX (k2)....
•X (kℓ):ℓthsmallestnumberofmissingsUseonlyX (K0),X (k1),...,X (kℓ−1)topredict eX (kℓ).
28
LLiteratur:
•F.Harrell(2002).RegressionModelingStrategies,Ch.3
•J.L.Schafer(1999).Analysisofincompletemultivariatedata
hR-packages
normAnalysisofmultivariatenormaldatasetswithmissingvaluesmixEstimation/multipleImputationforMixedCategoricalandContinuousDatamitoolsToolsformultipleimputationofmissingdata
29
Functions:
Hmisc::transcanTransformations/ImputationsusingCanonicalVariates(Harrell)e1071::imputeReplaceMissingValuesfts::fillFillMissingValuesmlmmm::mlmmm.emMLestimationviaEM-algorithmundermultivariatelinearmixedmodelswithmissingvaluesmvnmle::getclfCreatelikelihoodfunctionformultivariatedatawithmissingvalues.scrime::knncatimputeMissingValueImputationwithkNNscrime::knncatimputeLargeMissingValueImputationwithkNNforHigh-DimensionalData
30
MessagesMissingValues
•Missingvaluesareoftenasevereprobleminobservationaldataandsurveysampling.
•Thereareseveralwaystodealwithit.”Forecasting”missingvaluesonthebasisofregressioninputvariablesonotherinputvariablesappearstobemostpromising.
•TherearemanyRfunctions&afewspecializedpackages.