MechanismsofAgeingandDevelopment151(2015)38–44
ContentslistsavailableatScienceDirect
Mechanisms of Ageing and Development
j o ur na l h o me p a g e:w w w . e l s e v i e r . c o m / l o c a te / m e c h a g e d e v
MARK-AGE data management: Cleaning, exploration and visualization of data
Jennifer Baur
a, Maria Moreno-Villanueva
a, Tobias Kötter
b, Thilo Sindlinger
a, Alexander Bürkle
a,∗, Michael R. Berthold
b, Michael Junk
caChairforMolecularToxicology,UniversityofKonstanz,78457Konstanz,Germany
bChairforBioinformaticsandInformationMining,UniversityofKonstanz,78457Konstanz,Germany
cDepartmentforMathematicsandStatistics,UniversityofKonstanz,78457Konstanz,Germany
a r t i c l e i n f o
Articlehistory:
Received30January2015
Receivedinrevisedform13May2015 Accepted18May2015
Availableonline21May2015
Keywords:
Datacleaning Missingdata Batcheffects Outliers Datavisualization
a b s t r a c t
Databasesareanorganizedcollectionofdataandnecessarytoinvestigateawidespectrumofresearch questions.Fordataevaluationanalyzersshouldbeawareofpossibledataqualityproblemsthatcan compromiseresultsvalidity.Thereforedatacleaningisanessentialpartofthedatamanagementprocess, whichdealswiththeidentificationandcorrectionoferrorsinordertoimprovedataquality.
Inourcross-sectionalstudy,biomarkersofageing,analytical,anthropometricanddemographicdata fromabout3000volunteershavebeencollectedintheMARK-AGEdatabase.Althoughseveralpreventive strategieswereappliedbeforedataentry,errorslikemiscoding,missingvalues,batchproblemsetc.,could notbeavoidedcompletely.Sucherrorscanresultinmisleadinginformationandaffectthevalidityofthe performeddataanalysis.
HerewepresentanoverviewofthemethodsweappliedfordealingwitherrorsintheMARK-AGE database.Weespeciallydescribeourstrategiesforthedetectionofmissingvalues,outliersandbatch effectsandexplainhowtheycanbehandledtoimprovedataquality.Finallywereportaboutthetools usedfordataexplorationanddatasharingbetweenMARK-AGEcollaborators.
©2015ElsevierIrelandLtd.Allrightsreserved.
1. Introduction
Today’shighthroughputtechniquesenablethegenerationof hugedatavolumesinrelativelyshorttimeperiods.Withincreas- ingamountsofaggregatedinformation,theissueofdatabasesgets moreandmoreimportant.Adatabaseisacollectionofdatainan organizedformandcanbecategorizedonthebasisoftheirfunc- tion.Themostcommontypeistherelationaldatabasewherethe informationisstoredinvariousdatatables.Thistypeiswidelyused inthefieldsofgenomicsandproteomics,wherelargeamountof datamustbestoredforasinglesubject(Pearson,2004;Yuand Salomon,2010).Therearesoftwareprogramsavailablethatenable theusertostore,modifyandextractinformationfromadatabase, theso-called database management systems (DBMS). They are especiallydesignedtoprovideaninteractionbetweenuser,other applications,and thedatabaseitself.Typical softwareprograms thatallowthecreationandadministrationofrelationaldatabases areMicrosoftSQLServer,IBMDB2orOracle.
∗Correspondingauthor.Tel.:+497531884035;fax:+4907531884033.
E-mailaddress:alexander.buerkle@uni-konstanz.de(A.Bürkle).
Toensurethatdataiswellorganized,enteredinthecorrectfor- matandannotated,adatamanagementplanshouldbeprepared beforethebeginningofastudy.Anaccurateplanningincludesnot onlydatahandlingduringthedatacollectionbut alsoafterthe projectiscompleted.However,besteffortsestablishedinproject’s designstoavoiderrorsduringdatacollection,cannotpreventfrom incorrectorincompletedata.Errorsinreal-worlddataarecom- monandaretobeexpected(Orr,1998;Redman,1998).Misleading or missing information in databases disables the confirmation ofresultsandconclusionsafterdatainterpretationandanalysis.
Therefore datacleaningis anessential stepfor theInformation ManagementChainbeforestoringandanalyzingdata(Chapman, 2005).Errorsourcesinmany casesarenotcleardetectableand occurina varietyoffashions. Obviousexamplesaredata entry errors,measurementerrorsordataintegrationerrors(Hellerstein, 2008).
Manualcleaningofdataislaboriousandtimeconsuming,and initselfpronetoerrors(MaleticandMarcus,2000).Dataclean- ingstrategies includingtheuseof machine learningfor guided databaserepair (Yakout etal., 2010),inferring and imputingof missingvalues(Mayfieldetal.,2010)and resolvingofinconsis- tenciesusingfunctionaldependencies(Fanetal.,2008)havebeen
http://dx.doi.org/10.1016/j.mad.2015.05.007
0047-6374/©2015ElsevierIrelandLtd.Allrightsreserved.
Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-0-295717
Erschienen in: Mechanisms of Ageing and Development ; 151 (2015). - S. 38-44 https://dx.doi.org/10.1016/j.mad.2015.05.007
described before. FurthermoreBatini and collaborators provide severalgeneralmethodologiesfortheimprovementofdataqual- ity(Batinietal.,2009).Afterinvestigatingoncleaningpurposes, dataareofhighqualityif,theyarefitfortheirintendedusesin operations,“decisionmakingandplanning.”(Juranetal.,1974).
Controlledhighqualitydatadonotaffectaccuracyandefficiency ofdataanalysisandcanbefurtherprocessedbyusers.
Theprocessoftransformingdataintosensorystimuliandvisual imagesiscalleddatavisualisation(Schroederetal.,2003).Power- fulcharts,diagramsormapsprovidesolutionstoexplore,analyse, andpresentdata.Furthermoredatavisualisationisanimportant tool foreffective datacommunicationin largeresearchconsor- tia.Howeverdatacanpossessdifferentformssuchasnumbers, graphs,images,ortextsandtheirvisualisationentailssomechal- lenges.Finallystored andanalyzeddata mightberequestedby otherresearches.Asa result,beneficial strategiesfor datashar- ingarealsonecessary.Inthefieldofageingresearchdatafrom databasesaretypicallyrepresentedovertheinternet.Onewell- knownexampleistheDigitalAgeingAtlaswhereage-relateddata of differentbiological levelswere collectedfromtheliterature, storedandprovidedonawebpage(Craigetal.,2015).Eachindi- vidualparametercanberequestedwithadefinitionaccordingto ageandaconnectiontotherawdata.Asabroadcollectionofdata increasethechancetoobtainpowerfulresults,theHumanAgeing GenomicResearchcombinesthreedatabaseslinkingaspectsofage- relatedgeneticandevolutionarystudiesontheirwebpages(Tacutu etal.,2013).Databetweenallpartsislinked,anduserscansimply requesttheinformationofinterestontheuser-friendlyinterface.
ThisworkdescribesMARK-AGEdatamanagementeffortsfocus- ing onstrategies such asdata retrieval, identification of errors andassuranceofdataquality.Furthermorewecreatedautomatic reportsonawebportalusingKonstanzInformationMiner(KNIME) asinterfacefordatasharingandvisualization.
2. Material
2.1. MARK-AGEdatabase
TheMARK-AGEdatabaseisarelationaldatabaseandwasestab- lished using Structured Query Language (SQL) (see Kötter and Moreno-Villanuevaet al.,this issue) and prepared for usageas describedin(seeBauretal.,thisissue).SQLisacommonlyused databaselanguageandallowstheretrievingofdatafromadatabase fastandefficiently.Thelanguagecanbeusednotonlytocreate databasesbutalsoforupdating,retrievingandsharingdatawith otherusers.TheMARK-AGEdatabaseitselfconsistsofanalytical, anthropometricanddemographicdatacollectedfromabout3300 subjectsrecruitedacrossEurope(seeBürkleetal.,thisissue).
2.2. KonstanzInformationMiner(KNIME)
The KonstanzInformation Mineris a modular environment, whichenablesvisualassemblyandinteractiveexecutionofadata pipeline.Itisdesignedasateaching,researchand collaboration platform,whichenableseasyintegrationofnewalgorithms,data manipulationorvisualizationmethodsasnewmodulesornodes (Bertholdetal.,2007).
2.3. KNIMEteamspace
KNIME can be used in a team to share work with other researches,bykeepingdatafilesanddataanalysisworkflowsinone centralsharedplace.Alsometanodesincludingpre-programmed workflows can be used as a reference to the centrally stored
version byallteammembers intheirlocalworkflows.(https://
www.knime.org/knime-teamspace).
2.4. KNIMEserverandwebportal
KNIMEServerallows storing workflows and accessingthem fromanywhereviatheinternet.Useraccessrightscontrolhowdata isgroupedforprojects,workgroupsordepartments.Thewebportal istheperfectwaytodistributepreconfiguredworkflows,created byadministrativeusers,toallendusers.(https://www.knime.org/
knime-server).
2.5. ProgramminglanguageR
Risafreeavailablescriptinglanguageforstatisticalcomputing andthegenerationofhighqualitygraphs(IhakaandGentleman, 1996).Awidechoiceofpre-programmedRpackagescaneasilybe implementedandusedfordataanalysis.KNIMEincorporatesan Rplugin,enablingtheuseoftheRlanguageanditspackagesin workflows.
3. Results
3.1. Dataquality:strategiesforcleaningdata
Dataqualityreflectsthegoodnessoftheevaluateddata.Unfor- tunately,inspiteofeffortsputintodataentry(seeBürkleetal.,this issue),errorsstilloccurredandthereforedatacleaningwasnec- essary.Inordertovisualizeanddetecterrors,automaticstandard analyses(Table1)wereperformedonentereddata.Theyconsisted ofhistograms,scatter-andboxplotsgenerallyusedindescriptive statistics. Afteridentification,the MARK-AGEdatabasecleaning strategyincludes(1)clearingofmissingvalues,(2)removalofout- liersand(3)detectionof batcheffects. Allthree typesoferrors could,ifuntreated, compromisetheconclusionsthataredrawn fromthedata.
3.2. Dealingwithmissingvalues
Two main scenarios are responsible for missing data in the MARK-AGEdatabase.Ontheonehandamissingvaluecanoccur completely at random because the sample tube was broken, defrosted,lostetc.,Ontheotherhandmissingdatawereintroduced becausespecificparameterswereonlymeasuredinlowthroughput analysis.Inthesecases,insteadofallrecruitedsubjects,only300 individualsweremeasured.Aprecisedefinitionontheoccurrence ofmissingvaluesisthereforeessentialtodeterminethehandling strategies.Inadditionmissingdataanalysisdependsontheextent ofmissingvaluesandtheirinfluenceoncovariates.Alargeper- centageofmissinginformationinacovariate,e.g.,females,canlead togroupunder-representationandrequiresfurtherinvestigation.
Thisleadstotheeffectthatselectedpartsofthedatabaseofferdif- ferentstatesofmissingvaluesandrequiresadjustedmissingvalue handling.Thereforeageneralreplacementofmissingdatainthe originalDatabaseisnotpossible.Thehandlingofmissingvalues wasperformedintheindividualdownstreamanalysis,usingeither completecaseanalysisorasubstitutionmethodasdescribedinthe following.
Ifcovariatesareequallydistributedandvaluesaremissingcom- pletelyatrandomasubstitutionisnotobligatory.Inthis casea completecaseanalysiswasperformedincludingonlydataavailable forallparameters.
To compensate for under-representation of a covariate, the missingvaluescanbereplacedwithstatisticalestimates.Onepop- ularmethodismeansubstitution,i.e.,replacingthemissingvalues
Table1
Tableofautomatedanalysesthatwereavailableforthequalitychecksofparameters.
Reportname Outcome
Scatterplotwithage ScatterplotforaparameteragainstagewithalinearregressionlineandthePearsonand Spearmancorrelationcoefficients
Boxplotforselectedsubgroups Boxplotforaparametergroupedforapre-definedsubgroup(age,recruitmentcenter,gendere.g.)
Histogram Simplehistogramforoneparameter
Correlationoftwoparameters ScatterplotfortwoselectedparametersandtheSpearmancorrelationcoefficient Batchcheckrecruitmenttime Scatterplotofaparameteragainsttherecruitmenttime
Batchcheckinternalid Scatterplotofaparameteragainstthetimeofbiochemicalanalysis Empiricaldistributionfunctionfor
selectedsubgroups
Empiricaldistributionfunctionforselectedsubgroupsofaparameterwiththeinformationabout thesignificantdifferencescalculatedwithbootstrapanalysis.
bytheaverageofavailablevalues.However,sincemeansubstitu- tioningeneralcannotberecommendedduetobiasedresults,more robuststatistical methods like median substitution or multiple imputationsshouldbechosen.Wheneverreplacementofmissing valuesisnecessary,itshouldbecheckedifthesubstitutionstrategy affectstheintendedanalysis.Toaccomplishthistask,onecanstart fromarepresentativecompletedatasetinwhichmissingvalues areintroducedrandomly.Sincethenumberofmissingvaluescan becontrolledinthisscenario,theeffectofthereplacementstrategy dependingonthenumberofmissingvaluescanbeanalyzedsta- tistically.Theresultinginformationisquiteusefulforthedecision whetherthereplacementstrategyshouldbeappliedtoaspecific dataset.Thesubstitutionmethodwasadjustedforeachindividual analysis.
Asrepresentativeexample,weselectedadatasetwithoutmiss- ingvaluesfromtheMARK-AGEdatabase.In(Fig.1)thespearman correlationoftwoparameterswasdetermined.Thereplacement stepswereperformedrandomlyandrepeatedahundredtimes.
Afterareplacementof5%ofalldatavaluesthecorrelationdiffers significantlyfromtheoriginaloutcome.Thismeansthatreplac- ingmorethan5%ofmissingvalueswouldaffecttheconclusions derivedfromtheinitialdata.Inthiscaseanothermethodthanthe medianimputationmustbefound.
3.3. Dealingwithoutliers
Outlierdetectionisanimportanttasktoachieveprevioustodata analysis.Labellingmethodsfordetectingoutlierscanbeusedifthe distributionofdatasetsisdifficulttoidentifyorthedatacannot betransformedinaproperdistribution.Wefavouredthemethod basedontheinterquartilerangeoverotherclassicalapproaches, suchasstandarddeviationorz-score,becausequartilesaremore resistanttoextremevalues.Thereforeanoutlierwasdefinedasany
Fig.1. Overviewofthechangeincorrelationbetweentworandomlyselected parametersaftermedianimputation.Increasingnumbersofdatapointswereran- domlyremovedfromtheoriginaldatasetandreplacedbytheparametersmedian.
Withincreasingpercentagesofreplaceddatathecalculatedcorrelationsdeviate fromtheoriginalvalue.Thefirstsignificantdeviationoccursforthisexampleat5%
(onewayANOVAp<0.05).
datapointoutsidetherangebetweenthespecifiedlowerandupper quantiles(Fig.2).
Ageneralfirstcheckforextremeoutliersistheinspectionofval- uesexceedingthebiologicalborders,forexampleasystolicblood pressureof400.Thosesuspiciousvalueswereclearlyincludedby mistakeandeithercorrectedorifnotpossibleexcludedfromdata analysis.Incasesuspectvalueswereeitheralegitimatepartofthe dataorthecausewasunclear,thedatapointswereonlyexcluded afterbeenidentifiedasoutliersbytheinterquartilerangeapproach.
Forstandardanalysisoninter-parametercorrelationsandlinear modellinganalysis3%quantilesweredefinedforoutlierremoval.
Becauseboththeupperand lowerquantileswereconsidered,a totalof 6%of alldatavalues wereexcludedforeachparameter duringanalysis.Excludingthisamountthedataremovesthemost prominentoutliersfromtheanalysis,andtheresultsreflectreliable outcomes.Ifbyexcludingdatapointsthetotalamountofvaluesis decreasedtoomuchtheanalysiscanagaingetdistorted.There- fore,theminimalamountofdatarequiredforapplyingthisoutlier removalstrategywassetto300datapoints.
Inspecificcaseswhenonlysmallersubgroupsofdatasetsare analyzedit canhappenthattheselectedvalues offervery high distributions.Thequantilelimitsmustthenbeadapteduntilthe outcomesstaystatisticallystable.Ifparametersoffertoohighdis- tributionsorundergotheamountofminimalrequiredvaluesthey mustbeexcludedforthespecificanalysis.
3.4. Dealingwithbatcheffects
Abatcheffectistheresultofsystematicerrorintroducedarti- ficiallyduringsampleprocessing.Asimplemethodfordetecting batcheffectsistoperformtestsofassociationbetweentheanalyzed valuesandvariousexperimentalvariables(e.g.,reagentlots,labo- ratoryconditions,personneldifferences,processingdayorchanges inprotocols)toseeiftherearealargenumberofartificialassocia- tionsacrossthedata.
IntheMARK-AGEstudyabout3000subjectswererecruited(see Bürkleetal.,thisissue).Inordertoobtainthenecessaryaliquots, bodyfluidsfromeachsubjectwereprocessedaccordingtothesam- plingprocedures(seeCapriandMoreno-Villanuevaetal.,thisissue) andshippedtotheanalyticcenters.Dependingontheassaysused, thesampleswereprocessedatonce,inbatchesorcontinuously.
Todetectbatcheffectsallmeasuredparameterswereexamined, eitherforthedateof thesamplingattherecruitmentcenteror fortheprocessingorderoruploaddayattheanalytic center.A straightlinethatbestrepresentsthedataonascatterplot(lineof bestfit)wasusedtovisualizerelationshipsamongvariables.Incase ofbatchesthemaximalrateofchange(slope)ofthelinedifferscon- siderablyfromzero(Fig.3A),indicatingabatchproblem.Inorder todeterminenotonlycontinuoustrendsbutalsodiscontinuous changes,thepointsonthelineatwhichchangestakeplacewere calculated.Thedatalocatedbetweentwopointswereconsidered abatch.
Fig.2.Scatterplotsshowingthedistributionofarepresentativeparameterandtheremovalofoutlierswiththeinterquartileapproach.Inthegraphstherepresentative parameterisplottedagainsttheageofthesubjectsexpressedasAgeDaysAsYears.Thegraphsrepresenttheoriginaldistribution(A)oftheparameterandaftertheexclusion of1%quantiles(B)and3%quantiles(C).Itisapparentthatoutlierswereremovedinastepwisefashionbyusingthistechnique.
Sinceoriginsofbatcheffectscanbemanyandvaried,ageneral repairmechanismisnotadvisable.Infact,anysuchmechanism isbasedonamathematicalmodeloftheerrorandaslongasthe modeldoesnotreflectthenatureofthemistake,modifyingthedata willnotimprovethequality.Iftheerrormodelisknown,thebatch effectwascorrectedinthedatabase(Fig.3B).Whennoerrormodels areavailable,weexcludedsuspiciousparametersdisplayinghuge batchproblemswithunexplainablenoisefromfurtheranalysis.
4. Datavisualizationandsharing
Monitoringandcommunicationofdataisanessentialstepfor thesuccessfulcompletionofbigdataprojects.Astrategy toget fastandreliableresults,whichcanbedistributedunderhighsafety conditions,isrequired.IntheMARK-AGEprojectKNIMEwasused
asstandardtoolfor(1)dataquery(2)datavisualizationand(3) datadistribution.
4.1. Dataquery
Forthemultifunctionalanalysisperformedintheprojectdiffer- entsetofparametersandcovariatesmustberequestedregularly fromthedatabase.Pre-programmedKNIMEnodeswereusedto generateaclearstructuredtooltofilterdesiredconditions.Fig.4 showsthestandardviewoftheprogrammedselectionmenufor covariates.Theusercanchosetherequiredgroupconditionsfrom theassortment.Inasecondwindowalistofallavailableparame- tersappears.Theycaneitherbeselectedbypartnernumber,work packagenumber,ormanually.Ifforspecialanalysismoreselection optionswerenecessarytheycouldeasilybeattachedbytheadmin- istrator.Withthisapowerfultoolwasgeneratedandadjustedfor
Fig.3. Arepresentativeexampleforthesuccessfulcorrectionofabatcheffect.Thegraphsshowscatterplotsforaparameterxagainstthemeasuringorder.Thelineofbest fit(black)offersaslopewhichisrecognizedbythefirstdeviation.AsinsuchcasesthealgorithmautomaticallyaddsthetermATTENTIONtothegraphtheycaneasilybe selectedfromcleanparameters.Thebatchintheexampleoccursforadefinedtimebecausethenormalizationofthevariableswasforgotten(A).Thiswasdocumentedby thestaffintherespectivelaboratory.Thereforethenormalizationcouldbeperformedsubsequently.Afterthecalculationtheparameterisfreefromanybatcheffect(B).
Fig.4.Screenshotofthestandardselectionmenutoquerysubgroupsfromthe database.Thedesiredconditionscaneasilybeselectedinonestep.
thespecificrequirementsontheMARK-AGEdatabaseforafastand easydataquery.
4.2. Datavisualization
VisualizationofdataintheMARK-AGEprojectisimportantfor twofacts(1)uploadeddatamustbemonitoredandcontrolledfor anearlyrecognitionandpreventionofproblems(2)researchers needtoextractthebiologicalinformationfromthedata.
4.2.1. Dataexploration
For data exploration general plotting tools from descriptive statisticswereused(Table1).KNIMEworkflowsweredesignedto automaticallygeneratethegraphsforallparametersavailablein thedatabase.Withthoseanalysisexpertsinthefieldcouldtestthe dataforplausibilityandcorrectness.Alsoevidencefortheunder- lyingdistributionand,asdescribedabove,forparametersquality couldbeprovided.
4.2.2. Extractbiologicalinformation
For the extraction of inter-parameter dependencies in the MARK-AGEDatabase,atooltovisualizecorrelationswasnecessary.
Atypicallyusednumericalcorrelationmatrix, calculatedover a wholedatabase,wouldhavebeentoolargefortheextractionofsig- nificantinformation.ThereforeaKNIMEworkflowwasestablished automaticallygeneratinganetworkforapre-selectedparameterin themiddle,showingthetoptencorrelatingparametersarranged around(Fig.5).Toclearlyvisualizethedependencybetweenthe parameters,informationisrepresentedbythelengthandcolorof theconnectionlines,indicatingthecorrelationstrengthanddirec- tion.Theavailablesamplenumberissuggestedbythesizeofthe parametercircles.Throughthisarrangementtheuserisabletoper- formplausibilitychecksforknowndependenciesataglance.Thus efficientdetectionofunknownorratherunexpecteddependen- ciesinthedataispossible.Asanupstreamselectionmenuallows
Fig.5.Networkanalysistocheckforcorrelations.Theautomatedanalysispresents the10bestcorrelatingparameters(A–J)fortheparameterX,selectedbytheuser.
Thecolorofthelinesreflectsthedirectionofthecorrelation(darkgreypositive correlation,lightgreynegativecorrelation).Thelengthofthelinesindicatesthe strengthofthecorrelation(thelongerthelinestheweakeristhecorrelation).
Thesizeofthecirclesstandfortheamountofdataconsideredforthecalculation (thelargerthediameterthemoredataisavailable).Whichsubgroupsshouldbe consideredintheanalysisisdefinedintheselectionmenu(seeFig.4).
theseparationofdifferentsubgroups,andadditionalstratification onthedatabackgroundis possible.Changesinthecorrelations between different subgroupsgive a hint for the interaction of specificbodysystemsortheinfluenceofenvironmentalfactors.
Theidentifieddependenciescanleadtothedevelopmentofnew hypothesisandfurtherdetailedanalysis.
4.2.3. Datasharing
InordertoprovideMARK-AGEpartnerswithgraphicalresults based onthe ongoing analysis the KNIMEteam space and the KNIMEWebPortalwereestablished.
4.2.3.1. KNIMEteamspace. Tosharethedatabase,metanodesand analysisworkflows,theKNIMEteamspacewasused.Workflows wereplacedonadedicatedMARK-AGEserver.Accesstotheflows was restricted to team members responsible for the coordina- tionof theMARK-AGEanalysis.Thesemembers couldnot only accesstheuploadedflowsbutalsoprovideothercolleagueswith self-generatedworkflows.Theworkflowsprovidedcouldbedown- loadedasalocalcopyOriginalversions,however,couldonlybe modifiedbytheirauthors.Thus,theKNIMEteamspaceprovideda collaborativeinformationexchangeinadocumentedmannerand underhigh-levelsecurityconditions.
4.2.3.2. KNIMEWebPortal. TheKNIMEteamspacewasrestricted tomembersresponsibleforthecoordinationofMARK-AGEanal- ysis.ThereforetheKNIMEWebPortal(Fig.6)wasused,enabling allprojectpartnerstoreceiveanalysis-feedbackonownmeasure- ments.TheKNIMEWebPortalwasconnectedtotheMARK-AGE serverandworkflowswereorganizedinspecifiedfolders. Users couldselectthedesiredanalysisconditionsfromamenu(Fig.4) andpre-programmed,automaticalgorithmsruninthebackground.
Afterwardsuserscoulddownloadgraphicalresultseitheraspdf,
Fig.6. ScreenshotoftheKNIMEWebPortaluserinterfacewiththelistofavailableanalysisthatcouldbeselected.Onlyactivateduserscanloginandreceiveanalysisfrom theautomaticreports.
xls, pptx,or word document (Fig. 7).Thisstrategy ensuresthe blindingofMARK-AGEproject,sincethesubjectscodeslinkingthe subjectwiththeanalysisresultsremainprotected.
5. Discussion
Inthis paperwedescribehow thedatacleaningandvisual- izationprocesseswereperformedduringtheMARK-AGEproject.
Problemswereidentifiedespeciallywithregardstooutliers,miss- ingvaluesandbatcheffects.Someofourobservationsrepresent already knownproblems thatcan occurin databases,but pub- lishedhandlingstrategiescannotbeuseddirectlyonMARK-AGE data.Thereforeadjustmentsonthespecificproject’srequirements arereported.Inaddition,newdevelopmentsweredescribedthat couldworkforpreventionandasprototypeexamplesinupcoming ageingstudies.
AsmentionedunderSection2,theMARK-AGEdatabasecon- tainsseveraltypesofinformation,i.e.,(1)valuesfromdemographic data,(2)valuesfromanthropometricmeasurementsand(3)values fromanalyticdata(seeBürkleetal.,thisissue).Duringdataanalysis amainchallengewastodealandpreparethedifferentkindofdata forreliableanalysis.Whileinformationonsubjectswascollectedin questionnairesduringinterviews,bioanalyticaldatawereobtained frombiologicalmaterialanalysedinthecorrespondinglaborato- ries.Both,questionnairesand analyticdataareerror-proneand
evenourbesteffortsintheprojectdesigncouldnotpreventsuch errors.Severalcircumstancescanleadtomissingvalues,outliersor batcheffects,whichinturncansignificantlyimpactstatisticalanal- ysis.Thereforestrategiesforcleaningdataarenecessary.However, thereisnostandardstrategyavailable,andtheprocedurestobe useditmainlydependonthetypeof dataandtype ofanalysis.
Inordertochecktheeffectivenessofdatacleaningstrategies,Dasu andLohintroducedtheconceptofstatisticaldistortion.Theyargued thatdatacleaningstrategiescouldhaveanimpactonresultsand nolongerrepresenttherealprocessthatgeneratesthedata.There- fore,‘cleaner’datadoesnotnecessarilymeanmoreusefulorusable data(DasuandLoh,2012).
Incaseofmissingvalues,moststatisticalproceduresautomat- ically excludethesubjectsconcerned.Thisleadstoa reduction of dataavailable forperforming statisticalanalysis.As aresult, outcomesmaynotbestatisticallysignificantduetolackofstatis- ticalpower.Furthermoremissingvaluescanalsocausemisleading resultsbyintroducingbias.DuetothenatureoftheMARK-AGE projectinsomecasesthereasonformissingvalueswasdifficultto identify.Oftenitisnotobviousifmissingdatawillcauseaprob- lem.In somecasesresultsmight beaffected,whileothers stay unchanged.
Likeformissingvalues,variationsthatarisefromoutlierscan influencethestatisticaloutcomeandthereliabilityofdata.Most criteriaidentifyingpossibleoutliersareeffectiveifdatapossessa
Fig.7.ArepresentativeexamplefortheoutcomeofananalysisontheKNIMEWebPortal.Theworkflowrunsinthebackgroundandprovidestheuserwithgraphsandtables.
Obtainedresultscanbesavedinoneofthelistedfileformats(bottomleftcorner).
normaldistribution.Althoughthesemethodsarepowerful,itmay beproblematictoapplythemtonon-normallydistributeddataor smallsamplesizeswithoutinformationabouttheircharacteristics.
Outlierscanoccurduetobiologicalvarianceortechnicalreasons.
IntheMARK-AGEprojecttheinter-quartileapproach wasused inordertoeliminatethemostseverecases.However,wemight excludethemostinterestingsubjectsthatofferarealbiological outlier.Unfortunatelyinsomecasesitisnotpossibletodetermine thebackgroundofsuchanoutlier.Bycontrast,batcheffectsare asourceofnon-biologicalvariation.Althoughtherearemethods availabletofitdifferentslopesorstepsinadistributionthisshould notbeappliedblindlytoeachsituation.Furthermore,duringthe analysisofbatcheffectsitisessentialtocheckwhetheroutliers andbatcheffectsofferoverlappingproblems.
Due to the huge amount and different type of data gener- atedduringtheMARK-AGEproject,amanualdatabaseexploration wouldhavebeentimeconsuming.Howeverautomateddataexplo- rationtools cannotbeappliedineach case. Weconcluded that hereisa needfor usefuland powerfultools thatautomatethe datacleaningprocess,oratleastassistmanualprocedures.How- ever,automatedmethodscanonlybepartoftheprocedure.There isaongoingneedforthedevelopmentofnewtoolstoassistthis process,especiallyfortheuseinbestpracticeroutines.
Last but not least, data sharing is indispensable in large researchconsortiasuchasMARK-AGE.Budin-Ljøsneandcolleagues describethechallengesofsharingdatabasedontheirexperiences in theEuropean Network for Geneticand Genomic Epidemiol- ogy,ENGAGE(Budin-Ljosneetal.,2014).Thesciencecommunity expectsauthorstoshareresearchresults,thereforethereisaneed fordataarchivingespeciallywhentheresearchdealswithhealth issuesorpublicpolicyformation.Dataarchivingmeansthestoring oflargeamountsofdatatobeaccessedfromcentrallocations.Dur- ingtheMARK-AGEprojecttheKNIMEWebPortalwasusedtoshare resultswithintheConsortium.AccesstotheWebPortalisrestricted toMARK-AGEmembersbutthiscouldbeopenedtothescientific communityinthefuture.Ifthishappensitisnecessaryforeachuser tounderstandtheunderlyingfactsaboutthedatabasedevelop- ment(seeBauretal.,KötterandMoreno-Villanuevaetal.,thisissue) andpossiblequalityissuesdescribedinthiswork.Especiallyfor biogerontologistsnotonlytheDatabaseopeningbutalsoupcom- ingpublicationontheMARK-AGEdatawillbeofgreatinterest.The interpretationofthoseresultsaswellasthedevelopmentofnew hypotheseswillmainlyrelyonthequalityandtheconfidingwork onthedatabaseitself.Informationontheage-dependentparam- eters identified in theMARK-AGE projectmight also belinked withotherdatabasesliketheHAGR(Tacutuetal.,2013).Buteven priortothat,theconclusionsreachedduringthedatacleaningand visualizationstepsinMARK-AGEcanbeusedtopreventerrorsin upcomingstudiesinvestigatinginthehumanageingprocess.
Acknowledgements
WewishtothanktheEuropeanCommissionforfinancialsup- portthroughtheFP7large-scaleintegratingprojectEuropeanStudy toEstablishBiomarkersofHumanAgeing(MARK-AGE;grantagree- mentno.:200880)andallMARK-AGEConsortiumpartnersforthe excellentcollaboration.
References
Batini,C.,Cappiello,C.,Francalanci,C.,2009.Methodologiesfordataquality assessmentandimprovement.ACMComput.Surv.41,http://dx.doi.org/10.
1145/1541880.1541883
Berthold,M.R.,Cebron,N.,Dill,F.,Gabriel,T.R.,Kötter,T.,Meinl,T.,Ohl,P.,Sieb,C., Thiel,K.,Wiswedel,B.,2007.KNIME:TheKonstanzInformationMiner.Studies inClassification,DataAnalysis,andKnowledge.Springer-Verlag,
Heidelberg-Berlin.
Budin-Ljosne,I.,Isaeva,J.,Knoppers,B.M.,Tasse,A.M.,Shen,H.Y.,McCarthy,M.I., Harris,J.R.,2014.Datasharinginlargeresearchconsortia:experiencesand recommendationsfromENGAGE.Eur.J.Hum.Genet.22(3),317–321.
Chapman,A.D.,2005.Principlesandmethodsofdatacleaning–primaryspecies andspecies-occurrencedata.In:Version1.0,ReportfortheGlobalBiodiversity InformationFacility.GlobalBiodiversityInformationFacility,Copenhagen.
Craig,T.,Smelick,C.,Tacutu,R.,Wuttke,D.,Wood,S.H.,Stanley,H.,Janssens,G., Savitskaya,E.,Moskalev,A.,Arking,R.,deMagalhaes,J.P.,2015.TheDigital AgeingAtlas:integratingthediversityofage-relatedchangesintoaunified resource.NucleicAcidsRes.43,D873–D878.
Dasu,T.andLohJ.M.(2012).StatisticalDistortion:ConsequencesofDataCleaning.
ProceedingsoftheVLDBEndowment.Z.M.Özsoyo˘glu.Istanbul,Turkey, VLDB2012.5:1674-1683.
Fan,W.,Geerts,F.,Xibei,J.,Kementsietsides,A.,2008.Conditionalfunctional dependenciesforcapturingdatainconsistencies.ACMTrans.DatabaseSyst.33 (2),1–48.
Hellerstein,J.M.(2008).QuantitativeDataCleaningforLargeDatabases.Surveyfor theUnitedNationsEconomicCommissionforEurope.http://db.cs.berkeley.
edu/jmh
Ihaka,R.,Gentleman,R.,1996.R:alanguagefordataanalysisandgraphics.J.
Comput.Graph.Stat.5(3),299–314.
Juran,J.M.,Gryna,F.M.,Bingham,R.S.,1974.QualityControlHandbook,3rdedition.
McGraw-Hill,NewYork,ISBN0070331758.
Maletic,J.I.,Marcus,A.,2000.DataCleansing:BeyondIntegrityAnalysis.
MassachusettsInstituteofTechnology,Boston,pp.200–209.
Mayfield,C.,Neville,J.,Prabhaker,S.,2010.ERACER:adatabaseapproachfor statisticalinferenceanddatacleaning.SIGMOD,75–86.
Orr,K.,1998.DataQualityandSystemsTheory.CACM41(2),66–71.
Pearson,A.A.(2004).RelationalDatabases.Currentprotocolsinbioinformatic.
supplement7:9.4.1-9.4.25.
Redman,T.,1998.Theimpactofpoordataqualityonthetypicalenterprise.CACM 41(2),79–82.
Schroeder,W.,Martin,K.,Lorensen,B.,2003.Thevisualisationtoolkit.In:ASystem forGuidedDataRepair.SIGMOD.PrenticeHallPTR,NewJersey,pp.1223–1226.
Tacutu,R.,Craig,T.,Budovsky,A.,Wuttke,D.,Lehmann,G.,Taranukha,D.,Costa,J., Fraifeld,V.E.,DeMagalhaes,J.P.,2013.HumanAgeingGenomicResources:
integrateddatabasesandtoolsforthebiologyandgeneticsofageing.Nucleic AcidsRes.41,D1027–D1033.
Yakout,M.,Elmagarmid,A.K.,Neville,J.,Ouzzani,M.,2010.GDR:asystemfor guideddatarepairSIGMOD10.In:Proceedingsofthe2010ACMSIGMOD InternationalConferenceonManagementofdataACM,NewYork,pp.
1223–1226.
Yu,K.,Salomon,R.,2010.Peptidedepot:flexiblerelaitonaldatabaseforvisual analysisofquantitativeproteomicdataandintegrationofexistingprotein information.Proteomics9(23),5350–5358.