MARK-AGE data management : Cleaning, exploration and visualization of data

(1)

MechanismsofAgeingandDevelopment151(2015)38–44

ContentslistsavailableatScienceDirect

Mechanisms of Ageing and Development

j o ur na l h o me p a g e:w w w . e l s e v i e r . c o m / l o c a te / m e c h a g e d e v

MARK-AGE data management: Cleaning, exploration and visualization of data

Jennifer Baur

^a

, Maria Moreno-Villanueva

^a

, Tobias Kötter

^b

, Thilo Sindlinger

^a

, Alexander Bürkle

^a,∗

, Michael R. Berthold

^b

, Michael Junk

^c

aChairforMolecularToxicology,UniversityofKonstanz,78457Konstanz,Germany

bChairforBioinformaticsandInformationMining,UniversityofKonstanz,78457Konstanz,Germany

cDepartmentforMathematicsandStatistics,UniversityofKonstanz,78457Konstanz,Germany

a r t i c l e i n f o

Articlehistory:

Received30January2015

Receivedinrevisedform13May2015 Accepted18May2015

Availableonline21May2015

Keywords:

Datacleaning Missingdata Batcheffects Outliers Datavisualization

a b s t r a c t

Databasesareanorganizedcollectionofdataandnecessarytoinvestigateawidespectrumofresearch questions.Fordataevaluationanalyzersshouldbeawareofpossibledataqualityproblemsthatcan compromiseresultsvalidity.Thereforedatacleaningisanessentialpartofthedatamanagementprocess, whichdealswiththeidentiﬁcationandcorrectionoferrorsinordertoimprovedataquality.

Inourcross-sectionalstudy,biomarkersofageing,analytical,anthropometricanddemographicdata fromabout3000volunteershavebeencollectedintheMARK-AGEdatabase.Althoughseveralpreventive strategieswereappliedbeforedataentry,errorslikemiscoding,missingvalues,batchproblemsetc.,could notbeavoidedcompletely.Sucherrorscanresultinmisleadinginformationandaffectthevalidityofthe performeddataanalysis.

HerewepresentanoverviewofthemethodsweappliedfordealingwitherrorsintheMARK-AGE database.Weespeciallydescribeourstrategiesforthedetectionofmissingvalues,outliersandbatch effectsandexplainhowtheycanbehandledtoimprovedataquality.Finallywereportaboutthetools usedfordataexplorationanddatasharingbetweenMARK-AGEcollaborators.

1. Introduction

Today’shighthroughputtechniquesenablethegenerationof hugedatavolumesinrelativelyshorttimeperiods.Withincreas- ingamountsofaggregatedinformation,theissueofdatabasesgets moreandmoreimportant.Adatabaseisacollectionofdatainan organizedformandcanbecategorizedonthebasisoftheirfunc- tion.Themostcommontypeistherelationaldatabasewherethe informationisstoredinvariousdatatables.Thistypeiswidelyused intheﬁeldsofgenomicsandproteomics,wherelargeamountof datamustbestoredforasinglesubject(Pearson,2004;Yuand Salomon,2010).Therearesoftwareprogramsavailablethatenable theusertostore,modifyandextractinformationfromadatabase, theso-called database management systems (DBMS). They are especiallydesignedtoprovideaninteractionbetweenuser,other applications,and thedatabaseitself.Typical softwareprograms thatallowthecreationandadministrationofrelationaldatabases areMicrosoftSQLServer,IBMDB2orOracle.

∗Correspondingauthor.Tel.:+497531884035;fax:+4907531884033.

E-mailaddress:alexander.buerkle@uni-konstanz.de(A.Bürkle).

Toensurethatdataiswellorganized,enteredinthecorrectfor- matandannotated,adatamanagementplanshouldbeprepared beforethebeginningofastudy.Anaccurateplanningincludesnot onlydatahandlingduringthedatacollectionbut alsoafterthe projectiscompleted.However,besteffortsestablishedinproject’s designstoavoiderrorsduringdatacollection,cannotpreventfrom incorrectorincompletedata.Errorsinreal-worlddataarecom- monandaretobeexpected(Orr,1998;Redman,1998).Misleading or missing information in databases disables the conﬁrmation ofresultsandconclusionsafterdatainterpretationandanalysis.

Therefore datacleaningis anessential stepfor theInformation ManagementChainbeforestoringandanalyzingdata(Chapman, 2005).Errorsourcesinmany casesarenotcleardetectableand occurina varietyoffashions. Obviousexamplesaredata entry errors,measurementerrorsordataintegrationerrors(Hellerstein, 2008).

Manualcleaningofdataislaboriousandtimeconsuming,and initselfpronetoerrors(MaleticandMarcus,2000).Dataclean- ingstrategies includingtheuseof machine learningfor guided databaserepair (Yakout etal., 2010),inferring and imputingof missingvalues(Mayﬁeldetal.,2010)and resolvingofinconsis- tenciesusingfunctionaldependencies(Fanetal.,2008)havebeen

http://dx.doi.org/10.1016/j.mad.2015.05.007

Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-0-295717

Erschienen in: Mechanisms of Ageing and Development ; 151 (2015). - S. 38-44 https://dx.doi.org/10.1016/j.mad.2015.05.007

(2)

described before. FurthermoreBatini and collaborators provide severalgeneralmethodologiesfortheimprovementofdataqual- ity(Batinietal.,2009).Afterinvestigatingoncleaningpurposes, dataareofhighqualityif,theyareﬁtfortheirintendedusesin operations,“decisionmakingandplanning.”(Juranetal.,1974).

Controlledhighqualitydatadonotaffectaccuracyandefﬁciency ofdataanalysisandcanbefurtherprocessedbyusers.

Theprocessoftransformingdataintosensorystimuliandvisual imagesiscalleddatavisualisation(Schroederetal.,2003).Power- fulcharts,diagramsormapsprovidesolutionstoexplore,analyse, andpresentdata.Furthermoredatavisualisationisanimportant tool foreffective datacommunicationin largeresearchconsor- tia.Howeverdatacanpossessdifferentformssuchasnumbers, graphs,images,ortextsandtheirvisualisationentailssomechal- lenges.Finallystored andanalyzeddata mightberequestedby otherresearches.Asa result,beneficial strategiesfor datashar- ingarealsonecessary.Inthefieldofageingresearchdatafrom databasesaretypicallyrepresentedovertheinternet.Onewell- knownexampleistheDigitalAgeingAtlaswhereage-relateddata of differentbiological levelswere collectedfromtheliterature, storedandprovidedonawebpage(Craigetal.,2015).Eachindi- vidualparametercanberequestedwithadefinitionaccordingto ageandaconnectiontotherawdata.Asabroadcollectionofdata increasethechancetoobtainpowerfulresults,theHumanAgeing GenomicResearchcombinesthreedatabaseslinkingaspectsofage- relatedgeneticandevolutionarystudiesontheirwebpages(Tacutu etal.,2013).Databetweenallpartsislinked,anduserscansimply requesttheinformationofinterestontheuser-friendlyinterface.

ThisworkdescribesMARK-AGEdatamanagementeffortsfocus- ing onstrategies such asdata retrieval, identiﬁcation of errors andassuranceofdataquality.Furthermorewecreatedautomatic reportsonawebportalusingKonstanzInformationMiner(KNIME) asinterfacefordatasharingandvisualization.

2. Material

2.1. MARK-AGEdatabase

TheMARK-AGEdatabaseisarelationaldatabaseandwasestab- lished using Structured Query Language (SQL) (see Kötter and Moreno-Villanuevaet al.,this issue) and prepared for usageas describedin(seeBauretal.,thisissue).SQLisacommonlyused databaselanguageandallowstheretrievingofdatafromadatabase fastandefﬁciently.Thelanguagecanbeusednotonlytocreate databasesbutalsoforupdating,retrievingandsharingdatawith otherusers.TheMARK-AGEdatabaseitselfconsistsofanalytical, anthropometricanddemographicdatacollectedfromabout3300 subjectsrecruitedacrossEurope(seeBürkleetal.,thisissue).

2.2. KonstanzInformationMiner(KNIME)

The KonstanzInformation Mineris a modular environment, whichenablesvisualassemblyandinteractiveexecutionofadata pipeline.Itisdesignedasateaching,researchand collaboration platform,whichenableseasyintegrationofnewalgorithms,data manipulationorvisualizationmethodsasnewmodulesornodes (Bertholdetal.,2007).

2.3. KNIMEteamspace

KNIME can be used in a team to share work with other researches,bykeepingdatafilesanddataanalysisworkflowsinone centralsharedplace.Alsometanodesincludingpre-programmed workflows can be used as a reference to the centrally stored

version byallteammembers intheirlocalworkﬂows.(https://

www.knime.org/knime-teamspace).

2.4. KNIMEserverandwebportal

KNIMEServerallows storing workflows and accessingthem fromanywhereviatheinternet.Useraccessrightscontrolhowdata isgroupedforprojects,workgroupsordepartments.Thewebportal istheperfectwaytodistributepreconfiguredworkflows,created byadministrativeusers,toallendusers.(https://www.knime.org/

knime-server).

2.5. ProgramminglanguageR

Risafreeavailablescriptinglanguageforstatisticalcomputing andthegenerationofhighqualitygraphs(IhakaandGentleman, 1996).Awidechoiceofpre-programmedRpackagescaneasilybe implementedandusedfordataanalysis.KNIMEincorporatesan Rplugin,enablingtheuseoftheRlanguageanditspackagesin workﬂows.

3. Results

3.1. Dataquality:strategiesforcleaningdata

Dataqualityreﬂectsthegoodnessoftheevaluateddata.Unfor- tunately,inspiteofeffortsputintodataentry(seeBürkleetal.,this issue),errorsstilloccurredandthereforedatacleaningwasnec- essary.Inordertovisualizeanddetecterrors,automaticstandard analyses(Table1)wereperformedonentereddata.Theyconsisted ofhistograms,scatter-andboxplotsgenerallyusedindescriptive statistics. Afteridentiﬁcation,the MARK-AGEdatabasecleaning strategyincludes(1)clearingofmissingvalues,(2)removalofout- liersand(3)detectionof batcheffects. Allthree typesoferrors could,ifuntreated, compromisetheconclusionsthataredrawn fromthedata.

3.2. Dealingwithmissingvalues

Two main scenarios are responsible for missing data in the MARK-AGEdatabase.Ontheonehandamissingvaluecanoccur completely at random because the sample tube was broken, defrosted,lostetc.,Ontheotherhandmissingdatawereintroduced becausespecificparameterswereonlymeasuredinlowthroughput analysis.Inthesecases,insteadofallrecruitedsubjects,only300 individualsweremeasured.Aprecisedefinitionontheoccurrence ofmissingvaluesisthereforeessentialtodeterminethehandling strategies.Inadditionmissingdataanalysisdependsontheextent ofmissingvaluesandtheirinfluenceoncovariates.Alargeper- centageofmissinginformationinacovariate,e.g.,females,canlead togroupunder-representationandrequiresfurtherinvestigation.

Thisleadstotheeffectthatselectedpartsofthedatabaseofferdif- ferentstatesofmissingvaluesandrequiresadjustedmissingvalue handling.Thereforeageneralreplacementofmissingdatainthe originalDatabaseisnotpossible.Thehandlingofmissingvalues wasperformedintheindividualdownstreamanalysis,usingeither completecaseanalysisorasubstitutionmethodasdescribedinthe following.

Ifcovariatesareequallydistributedandvaluesaremissingcom- pletelyatrandomasubstitutionisnotobligatory.Inthis casea completecaseanalysiswasperformedincludingonlydataavailable forallparameters.

To compensate for under-representation of a covariate, the missingvaluescanbereplacedwithstatisticalestimates.Onepop- ularmethodismeansubstitution,i.e.,replacingthemissingvalues

(3)

Table1

Tableofautomatedanalysesthatwereavailableforthequalitychecksofparameters.

Reportname Outcome

Scatterplotwithage ScatterplotforaparameteragainstagewithalinearregressionlineandthePearsonand Spearmancorrelationcoefﬁcients

Boxplotforselectedsubgroups Boxplotforaparametergroupedforapre-deﬁnedsubgroup(age,recruitmentcenter,gendere.g.)

Histogram Simplehistogramforoneparameter

Correlationoftwoparameters ScatterplotfortwoselectedparametersandtheSpearmancorrelationcoefﬁcient Batchcheckrecruitmenttime Scatterplotofaparameteragainsttherecruitmenttime

Batchcheckinternalid Scatterplotofaparameteragainstthetimeofbiochemicalanalysis Empiricaldistributionfunctionfor

selectedsubgroups

Empiricaldistributionfunctionforselectedsubgroupsofaparameterwiththeinformationabout thesigniﬁcantdifferencescalculatedwithbootstrapanalysis.

bytheaverageofavailablevalues.However,sincemeansubstitu- tioningeneralcannotberecommendedduetobiasedresults,more robuststatistical methods like median substitution or multiple imputationsshouldbechosen.Wheneverreplacementofmissing valuesisnecessary,itshouldbecheckedifthesubstitutionstrategy affectstheintendedanalysis.Toaccomplishthistask,onecanstart fromarepresentativecompletedatasetinwhichmissingvalues areintroducedrandomly.Sincethenumberofmissingvaluescan becontrolledinthisscenario,theeffectofthereplacementstrategy dependingonthenumberofmissingvaluescanbeanalyzedsta- tistically.Theresultinginformationisquiteusefulforthedecision whetherthereplacementstrategyshouldbeappliedtoaspeciﬁc dataset.Thesubstitutionmethodwasadjustedforeachindividual analysis.

Asrepresentativeexample,weselectedadatasetwithoutmiss- ingvaluesfromtheMARK-AGEdatabase.In(Fig.1)thespearman correlationoftwoparameterswasdetermined.Thereplacement stepswereperformedrandomlyandrepeatedahundredtimes.

Afterareplacementof5%ofalldatavaluesthecorrelationdiffers signiﬁcantlyfromtheoriginaloutcome.Thismeansthatreplac- ingmorethan5%ofmissingvalueswouldaffecttheconclusions derivedfromtheinitialdata.Inthiscaseanothermethodthanthe medianimputationmustbefound.

3.3. Dealingwithoutliers

Outlierdetectionisanimportanttasktoachieveprevioustodata analysis.Labellingmethodsfordetectingoutlierscanbeusedifthe distributionofdatasetsisdifﬁculttoidentifyorthedatacannot betransformedinaproperdistribution.Wefavouredthemethod basedontheinterquartilerangeoverotherclassicalapproaches, suchasstandarddeviationorz-score,becausequartilesaremore resistanttoextremevalues.Thereforeanoutlierwasdeﬁnedasany

Fig.1. Overviewofthechangeincorrelationbetweentworandomlyselected parametersaftermedianimputation.Increasingnumbersofdatapointswereran- domlyremovedfromtheoriginaldatasetandreplacedbytheparametersmedian.

Withincreasingpercentagesofreplaceddatathecalculatedcorrelationsdeviate fromtheoriginalvalue.Theﬁrstsigniﬁcantdeviationoccursforthisexampleat5%

(onewayANOVAp<0.05).

datapointoutsidetherangebetweenthespeciﬁedlowerandupper quantiles(Fig.2).

Ageneralﬁrstcheckforextremeoutliersistheinspectionofval- uesexceedingthebiologicalborders,forexampleasystolicblood pressureof400.Thosesuspiciousvalueswereclearlyincludedby mistakeandeithercorrectedorifnotpossibleexcludedfromdata analysis.Incasesuspectvalueswereeitheralegitimatepartofthe dataorthecausewasunclear,thedatapointswereonlyexcluded afterbeenidentiﬁedasoutliersbytheinterquartilerangeapproach.

Forstandardanalysisoninter-parametercorrelationsandlinear modellinganalysis3%quantilesweredeﬁnedforoutlierremoval.

Becauseboththeupperand lowerquantileswereconsidered,a totalof 6%of alldatavalues wereexcludedforeachparameter duringanalysis.Excludingthisamountthedataremovesthemost prominentoutliersfromtheanalysis,andtheresultsreﬂectreliable outcomes.Ifbyexcludingdatapointsthetotalamountofvaluesis decreasedtoomuchtheanalysiscanagaingetdistorted.There- fore,theminimalamountofdatarequiredforapplyingthisoutlier removalstrategywassetto300datapoints.

Inspeciﬁccaseswhenonlysmallersubgroupsofdatasetsare analyzedit canhappenthattheselectedvalues offervery high distributions.Thequantilelimitsmustthenbeadapteduntilthe outcomesstaystatisticallystable.Ifparametersoffertoohighdis- tributionsorundergotheamountofminimalrequiredvaluesthey mustbeexcludedforthespeciﬁcanalysis.

3.4. Dealingwithbatcheffects

Abatcheffectistheresultofsystematicerrorintroducedarti- ﬁciallyduringsampleprocessing.Asimplemethodfordetecting batcheffectsistoperformtestsofassociationbetweentheanalyzed valuesandvariousexperimentalvariables(e.g.,reagentlots,labo- ratoryconditions,personneldifferences,processingdayorchanges inprotocols)toseeiftherearealargenumberofartiﬁcialassocia- tionsacrossthedata.

IntheMARK-AGEstudyabout3000subjectswererecruited(see Bürkleetal.,thisissue).Inordertoobtainthenecessaryaliquots, bodyﬂuidsfromeachsubjectwereprocessedaccordingtothesam- plingprocedures(seeCapriandMoreno-Villanuevaetal.,thisissue) andshippedtotheanalyticcenters.Dependingontheassaysused, thesampleswereprocessedatonce,inbatchesorcontinuously.

Todetectbatcheffectsallmeasuredparameterswereexamined, eitherforthedateof thesamplingattherecruitmentcenteror fortheprocessingorderoruploaddayattheanalytic center.A straightlinethatbestrepresentsthedataonascatterplot(lineof bestﬁt)wasusedtovisualizerelationshipsamongvariables.Incase ofbatchesthemaximalrateofchange(slope)ofthelinedifferscon- siderablyfromzero(Fig.3A),indicatingabatchproblem.Inorder todeterminenotonlycontinuoustrendsbutalsodiscontinuous changes,thepointsonthelineatwhichchangestakeplacewere calculated.Thedatalocatedbetweentwopointswereconsidered abatch.

(4)

Fig.2.Scatterplotsshowingthedistributionofarepresentativeparameterandtheremovalofoutlierswiththeinterquartileapproach.Inthegraphstherepresentative parameterisplottedagainsttheageofthesubjectsexpressedasAgeDaysAsYears.Thegraphsrepresenttheoriginaldistribution(A)oftheparameterandaftertheexclusion of1%quantiles(B)and3%quantiles(C).Itisapparentthatoutlierswereremovedinastepwisefashionbyusingthistechnique.

Sinceoriginsofbatcheffectscanbemanyandvaried,ageneral repairmechanismisnotadvisable.Infact,anysuchmechanism isbasedonamathematicalmodeloftheerrorandaslongasthe modeldoesnotreﬂectthenatureofthemistake,modifyingthedata willnotimprovethequality.Iftheerrormodelisknown,thebatch effectwascorrectedinthedatabase(Fig.3B).Whennoerrormodels areavailable,weexcludedsuspiciousparametersdisplayinghuge batchproblemswithunexplainablenoisefromfurtheranalysis.

4. Datavisualizationandsharing

Monitoringandcommunicationofdataisanessentialstepfor thesuccessfulcompletionofbigdataprojects.Astrategy toget fastandreliableresults,whichcanbedistributedunderhighsafety conditions,isrequired.IntheMARK-AGEprojectKNIMEwasused

asstandardtoolfor(1)dataquery(2)datavisualizationand(3) datadistribution.

4.1. Dataquery

Forthemultifunctionalanalysisperformedintheprojectdiffer- entsetofparametersandcovariatesmustberequestedregularly fromthedatabase.Pre-programmedKNIMEnodeswereusedto generateaclearstructuredtooltoﬁlterdesiredconditions.Fig.4 showsthestandardviewoftheprogrammedselectionmenufor covariates.Theusercanchosetherequiredgroupconditionsfrom theassortment.Inasecondwindowalistofallavailableparame- tersappears.Theycaneitherbeselectedbypartnernumber,work packagenumber,ormanually.Ifforspecialanalysismoreselection optionswerenecessarytheycouldeasilybeattachedbytheadmin- istrator.Withthisapowerfultoolwasgeneratedandadjustedfor

Fig.3. Arepresentativeexampleforthesuccessfulcorrectionofabatcheffect.Thegraphsshowscatterplotsforaparameterxagainstthemeasuringorder.Thelineofbest fit(black)offersaslopewhichisrecognizedbythefirstdeviation.AsinsuchcasesthealgorithmautomaticallyaddsthetermATTENTIONtothegraphtheycaneasilybe selectedfromcleanparameters.Thebatchintheexampleoccursforadefinedtimebecausethenormalizationofthevariableswasforgotten(A).Thiswasdocumentedby thestaffintherespectivelaboratory.Thereforethenormalizationcouldbeperformedsubsequently.Afterthecalculationtheparameterisfreefromanybatcheffect(B).

(5)

Fig.4.Screenshotofthestandardselectionmenutoquerysubgroupsfromthe database.Thedesiredconditionscaneasilybeselectedinonestep.

thespeciﬁcrequirementsontheMARK-AGEdatabaseforafastand easydataquery.

4.2. Datavisualization

VisualizationofdataintheMARK-AGEprojectisimportantfor twofacts(1)uploadeddatamustbemonitoredandcontrolledfor anearlyrecognitionandpreventionofproblems(2)researchers needtoextractthebiologicalinformationfromthedata.

4.2.1. Dataexploration

For data exploration general plotting tools from descriptive statisticswereused(Table1).KNIMEworkﬂowsweredesignedto automaticallygeneratethegraphsforallparametersavailablein thedatabase.Withthoseanalysisexpertsintheﬁeldcouldtestthe dataforplausibilityandcorrectness.Alsoevidencefortheunder- lyingdistributionand,asdescribedabove,forparametersquality couldbeprovided.

4.2.2. Extractbiologicalinformation

For the extraction of inter-parameter dependencies in the MARK-AGEDatabase,atooltovisualizecorrelationswasnecessary.

Atypicallyusednumericalcorrelationmatrix, calculatedover a wholedatabase,wouldhavebeentoolargefortheextractionofsig- nificantinformation.ThereforeaKNIMEworkflowwasestablished automaticallygeneratinganetworkforapre-selectedparameterin themiddle,showingthetoptencorrelatingparametersarranged around(Fig.5).Toclearlyvisualizethedependencybetweenthe parameters,informationisrepresentedbythelengthandcolorof theconnectionlines,indicatingthecorrelationstrengthanddirec- tion.Theavailablesamplenumberissuggestedbythesizeofthe parametercircles.Throughthisarrangementtheuserisabletoper- formplausibilitychecksforknowndependenciesataglance.Thus efficientdetectionofunknownorratherunexpecteddependen- ciesinthedataispossible.Asanupstreamselectionmenuallows

Fig.5.Networkanalysistocheckforcorrelations.Theautomatedanalysispresents the10bestcorrelatingparameters(A–J)fortheparameterX,selectedbytheuser.

Thecolorofthelinesreﬂectsthedirectionofthecorrelation(darkgreypositive correlation,lightgreynegativecorrelation).Thelengthofthelinesindicatesthe strengthofthecorrelation(thelongerthelinestheweakeristhecorrelation).

Thesizeofthecirclesstandfortheamountofdataconsideredforthecalculation (thelargerthediameterthemoredataisavailable).Whichsubgroupsshouldbe consideredintheanalysisisdeﬁnedintheselectionmenu(seeFig.4).

theseparationofdifferentsubgroups,andadditionalstratification onthedatabackgroundis possible.Changesinthecorrelations between different subgroupsgive a hint for the interaction of specificbodysystemsortheinfluenceofenvironmentalfactors.

Theidentiﬁeddependenciescanleadtothedevelopmentofnew hypothesisandfurtherdetailedanalysis.

4.2.3. Datasharing

InordertoprovideMARK-AGEpartnerswithgraphicalresults based onthe ongoing analysis the KNIMEteam space and the KNIMEWebPortalwereestablished.

4.2.3.1. KNIMEteamspace. Tosharethedatabase,metanodesand analysisworkflows,theKNIMEteamspacewasused.Workflows wereplacedonadedicatedMARK-AGEserver.Accesstotheflows was restricted to team members responsible for the coordina- tionof theMARK-AGEanalysis.Thesemembers couldnot only accesstheuploadedflowsbutalsoprovideothercolleagueswith self-generatedworkflows.Theworkflowsprovidedcouldbedown- loadedasalocalcopyOriginalversions,however,couldonlybe modifiedbytheirauthors.Thus,theKNIMEteamspaceprovideda collaborativeinformationexchangeinadocumentedmannerand underhigh-levelsecurityconditions.

4.2.3.2. KNIMEWebPortal. TheKNIMEteamspacewasrestricted tomembersresponsibleforthecoordinationofMARK-AGEanal- ysis.ThereforetheKNIMEWebPortal(Fig.6)wasused,enabling allprojectpartnerstoreceiveanalysis-feedbackonownmeasure- ments.TheKNIMEWebPortalwasconnectedtotheMARK-AGE serverandworkﬂowswereorganizedinspeciﬁedfolders. Users couldselectthedesiredanalysisconditionsfromamenu(Fig.4) andpre-programmed,automaticalgorithmsruninthebackground.

Afterwardsuserscoulddownloadgraphicalresultseitheraspdf,

(6)

Fig.6. ScreenshotoftheKNIMEWebPortaluserinterfacewiththelistofavailableanalysisthatcouldbeselected.Onlyactivateduserscanloginandreceiveanalysisfrom theautomaticreports.

xls, pptx,or word document (Fig. 7).Thisstrategy ensuresthe blindingofMARK-AGEproject,sincethesubjectscodeslinkingthe subjectwiththeanalysisresultsremainprotected.

5. Discussion

Inthis paperwedescribehow thedatacleaningandvisual- izationprocesseswereperformedduringtheMARK-AGEproject.

Problemswereidentiﬁedespeciallywithregardstooutliers,miss- ingvaluesandbatcheffects.Someofourobservationsrepresent already knownproblems thatcan occurin databases,but pub- lishedhandlingstrategiescannotbeuseddirectlyonMARK-AGE data.Thereforeadjustmentsonthespeciﬁcproject’srequirements arereported.Inaddition,newdevelopmentsweredescribedthat couldworkforpreventionandasprototypeexamplesinupcoming ageingstudies.

AsmentionedunderSection2,theMARK-AGEdatabasecon- tainsseveraltypesofinformation,i.e.,(1)valuesfromdemographic data,(2)valuesfromanthropometricmeasurementsand(3)values fromanalyticdata(seeBürkleetal.,thisissue).Duringdataanalysis amainchallengewastodealandpreparethedifferentkindofdata forreliableanalysis.Whileinformationonsubjectswascollectedin questionnairesduringinterviews,bioanalyticaldatawereobtained frombiologicalmaterialanalysedinthecorrespondinglaborato- ries.Both,questionnairesand analyticdataareerror-proneand

evenourbesteffortsintheprojectdesigncouldnotpreventsuch errors.Severalcircumstancescanleadtomissingvalues,outliersor batcheffects,whichinturncansigniﬁcantlyimpactstatisticalanal- ysis.Thereforestrategiesforcleaningdataarenecessary.However, thereisnostandardstrategyavailable,andtheprocedurestobe useditmainlydependonthetypeof dataandtype ofanalysis.

Inordertochecktheeffectivenessofdatacleaningstrategies,Dasu andLohintroducedtheconceptofstatisticaldistortion.Theyargued thatdatacleaningstrategiescouldhaveanimpactonresultsand nolongerrepresenttherealprocessthatgeneratesthedata.There- fore,‘cleaner’datadoesnotnecessarilymeanmoreusefulorusable data(DasuandLoh,2012).

Incaseofmissingvalues,moststatisticalproceduresautomat- ically excludethesubjectsconcerned.Thisleadstoa reduction of dataavailable forperforming statisticalanalysis.As aresult, outcomesmaynotbestatisticallysigniﬁcantduetolackofstatis- ticalpower.Furthermoremissingvaluescanalsocausemisleading resultsbyintroducingbias.DuetothenatureoftheMARK-AGE projectinsomecasesthereasonformissingvalueswasdifﬁcultto identify.Oftenitisnotobviousifmissingdatawillcauseaprob- lem.In somecasesresultsmight beaffected,whileothers stay unchanged.

Likeformissingvalues,variationsthatarisefromoutlierscan inﬂuencethestatisticaloutcomeandthereliabilityofdata.Most criteriaidentifyingpossibleoutliersareeffectiveifdatapossessa

Fig.7.ArepresentativeexamplefortheoutcomeofananalysisontheKNIMEWebPortal.Theworkﬂowrunsinthebackgroundandprovidestheuserwithgraphsandtables.

Obtainedresultscanbesavedinoneofthelistedﬁleformats(bottomleftcorner).

(7)

normaldistribution.Althoughthesemethodsarepowerful,itmay beproblematictoapplythemtonon-normallydistributeddataor smallsamplesizeswithoutinformationabouttheircharacteristics.

Outlierscanoccurduetobiologicalvarianceortechnicalreasons.

IntheMARK-AGEprojecttheinter-quartileapproach wasused inordertoeliminatethemostseverecases.However,wemight excludethemostinterestingsubjectsthatofferarealbiological outlier.Unfortunatelyinsomecasesitisnotpossibletodetermine thebackgroundofsuchanoutlier.Bycontrast,batcheffectsare asourceofnon-biologicalvariation.Althoughtherearemethods availabletoﬁtdifferentslopesorstepsinadistributionthisshould notbeappliedblindlytoeachsituation.Furthermore,duringthe analysisofbatcheffectsitisessentialtocheckwhetheroutliers andbatcheffectsofferoverlappingproblems.

Due to the huge amount and different type of data gener- atedduringtheMARK-AGEproject,amanualdatabaseexploration wouldhavebeentimeconsuming.Howeverautomateddataexplo- rationtools cannotbeappliedineach case. Weconcluded that hereisa needfor usefuland powerfultools thatautomatethe datacleaningprocess,oratleastassistmanualprocedures.How- ever,automatedmethodscanonlybepartoftheprocedure.There isaongoingneedforthedevelopmentofnewtoolstoassistthis process,especiallyfortheuseinbestpracticeroutines.

Last but not least, data sharing is indispensable in large researchconsortiasuchasMARK-AGE.Budin-Ljøsneandcolleagues describethechallengesofsharingdatabasedontheirexperiences in theEuropean Network for Geneticand Genomic Epidemiol- ogy,ENGAGE(Budin-Ljosneetal.,2014).Thesciencecommunity expectsauthorstoshareresearchresults,thereforethereisaneed fordataarchivingespeciallywhentheresearchdealswithhealth issuesorpublicpolicyformation.Dataarchivingmeansthestoring oflargeamountsofdatatobeaccessedfromcentrallocations.Dur- ingtheMARK-AGEprojecttheKNIMEWebPortalwasusedtoshare resultswithintheConsortium.AccesstotheWebPortalisrestricted toMARK-AGEmembersbutthiscouldbeopenedtothescientific communityinthefuture.Ifthishappensitisnecessaryforeachuser tounderstandtheunderlyingfactsaboutthedatabasedevelop- ment(seeBauretal.,KötterandMoreno-Villanuevaetal.,thisissue) andpossiblequalityissuesdescribedinthiswork.Especiallyfor biogerontologistsnotonlytheDatabaseopeningbutalsoupcom- ingpublicationontheMARK-AGEdatawillbeofgreatinterest.The interpretationofthoseresultsaswellasthedevelopmentofnew hypotheseswillmainlyrelyonthequalityandtheconfidingwork onthedatabaseitself.Informationontheage-dependentparam- eters identified in theMARK-AGE projectmight also belinked withotherdatabasesliketheHAGR(Tacutuetal.,2013).Buteven priortothat,theconclusionsreachedduringthedatacleaningand visualizationstepsinMARK-AGEcanbeusedtopreventerrorsin upcomingstudiesinvestigatinginthehumanageingprocess.

Acknowledgements

WewishtothanktheEuropeanCommissionforﬁnancialsup- portthroughtheFP7large-scaleintegratingprojectEuropeanStudy toEstablishBiomarkersofHumanAgeing(MARK-AGE;grantagree- mentno.:200880)andallMARK-AGEConsortiumpartnersforthe excellentcollaboration.

References

Batini,C.,Cappiello,C.,Francalanci,C.,2009.Methodologiesfordataquality assessmentandimprovement.ACMComput.Surv.41,http://dx.doi.org/10.

1145/1541880.1541883

Berthold,M.R.,Cebron,N.,Dill,F.,Gabriel,T.R.,Kötter,T.,Meinl,T.,Ohl,P.,Sieb,C., Thiel,K.,Wiswedel,B.,2007.KNIME:TheKonstanzInformationMiner.Studies inClassiﬁcation,DataAnalysis,andKnowledge.Springer-Verlag,

Heidelberg-Berlin.

Budin-Ljosne,I.,Isaeva,J.,Knoppers,B.M.,Tasse,A.M.,Shen,H.Y.,McCarthy,M.I., Harris,J.R.,2014.Datasharinginlargeresearchconsortia:experiencesand recommendationsfromENGAGE.Eur.J.Hum.Genet.22(3),317–321.

Chapman,A.D.,2005.Principlesandmethodsofdatacleaning–primaryspecies andspecies-occurrencedata.In:Version1.0,ReportfortheGlobalBiodiversity InformationFacility.GlobalBiodiversityInformationFacility,Copenhagen.

Craig,T.,Smelick,C.,Tacutu,R.,Wuttke,D.,Wood,S.H.,Stanley,H.,Janssens,G., Savitskaya,E.,Moskalev,A.,Arking,R.,deMagalhaes,J.P.,2015.TheDigital AgeingAtlas:integratingthediversityofage-relatedchangesintoauniﬁed resource.NucleicAcidsRes.43,D873–D878.

Dasu,T.andLohJ.M.(2012).StatisticalDistortion:ConsequencesofDataCleaning.

ProceedingsoftheVLDBEndowment.Z.M.Özsoyo˘glu.Istanbul,Turkey, VLDB2012.5:1674-1683.

Fan,W.,Geerts,F.,Xibei,J.,Kementsietsides,A.,2008.Conditionalfunctional dependenciesforcapturingdatainconsistencies.ACMTrans.DatabaseSyst.33 (2),1–48.

Hellerstein,J.M.(2008).QuantitativeDataCleaningforLargeDatabases.Surveyfor theUnitedNationsEconomicCommissionforEurope.http://db.cs.berkeley.

edu/jmh

Ihaka,R.,Gentleman,R.,1996.R:alanguagefordataanalysisandgraphics.J.

Comput.Graph.Stat.5(3),299–314.

Juran,J.M.,Gryna,F.M.,Bingham,R.S.,1974.QualityControlHandbook,3rdedition.

McGraw-Hill,NewYork,ISBN0070331758.

Maletic,J.I.,Marcus,A.,2000.DataCleansing:BeyondIntegrityAnalysis.

MassachusettsInstituteofTechnology,Boston,pp.200–209.

Mayﬁeld,C.,Neville,J.,Prabhaker,S.,2010.ERACER:adatabaseapproachfor statisticalinferenceanddatacleaning.SIGMOD,75–86.

Orr,K.,1998.DataQualityandSystemsTheory.CACM41(2),66–71.

Pearson,A.A.(2004).RelationalDatabases.Currentprotocolsinbioinformatic.

supplement7:9.4.1-9.4.25.

Redman,T.,1998.Theimpactofpoordataqualityonthetypicalenterprise.CACM 41(2),79–82.

Schroeder,W.,Martin,K.,Lorensen,B.,2003.Thevisualisationtoolkit.In:ASystem forGuidedDataRepair.SIGMOD.PrenticeHallPTR,NewJersey,pp.1223–1226.

Tacutu,R.,Craig,T.,Budovsky,A.,Wuttke,D.,Lehmann,G.,Taranukha,D.,Costa,J., Fraifeld,V.E.,DeMagalhaes,J.P.,2013.HumanAgeingGenomicResources:

integrateddatabasesandtoolsforthebiologyandgeneticsofageing.Nucleic AcidsRes.41,D1027–D1033.

Yakout,M.,Elmagarmid,A.K.,Neville,J.,Ouzzani,M.,2010.GDR:asystemfor guideddatarepairSIGMOD10.In:Proceedingsofthe2010ACMSIGMOD InternationalConferenceonManagementofdataACM,NewYork,pp.

1223–1226.

Yu,K.,Salomon,R.,2010.Peptidedepot:ﬂexiblerelaitonaldatabaseforvisual analysisofquantitativeproteomicdataandintegrationofexistingprotein information.Proteomics9(23),5350–5358.