• Keine Ergebnisse gefunden

MARK-AGE data management : Cleaning, exploration and visualization of data

N/A
N/A
Protected

Academic year: 2022

Aktie "MARK-AGE data management : Cleaning, exploration and visualization of data"

Copied!
7
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

MechanismsofAgeingandDevelopment151(2015)38–44

ContentslistsavailableatScienceDirect

Mechanisms of Ageing and Development

j o ur na l h o me p a g e:w w w . e l s e v i e r . c o m / l o c a te / m e c h a g e d e v

MARK-AGE data management: Cleaning, exploration and visualization of data

Jennifer Baur

a

, Maria Moreno-Villanueva

a

, Tobias Kötter

b

, Thilo Sindlinger

a

, Alexander Bürkle

a,∗

, Michael R. Berthold

b

, Michael Junk

c

aChairforMolecularToxicology,UniversityofKonstanz,78457Konstanz,Germany

bChairforBioinformaticsandInformationMining,UniversityofKonstanz,78457Konstanz,Germany

cDepartmentforMathematicsandStatistics,UniversityofKonstanz,78457Konstanz,Germany

a r t i c l e i n f o

Articlehistory:

Received30January2015

Receivedinrevisedform13May2015 Accepted18May2015

Availableonline21May2015

Keywords:

Datacleaning Missingdata Batcheffects Outliers Datavisualization

a b s t r a c t

Databasesareanorganizedcollectionofdataandnecessarytoinvestigateawidespectrumofresearch questions.Fordataevaluationanalyzersshouldbeawareofpossibledataqualityproblemsthatcan compromiseresultsvalidity.Thereforedatacleaningisanessentialpartofthedatamanagementprocess, whichdealswiththeidentificationandcorrectionoferrorsinordertoimprovedataquality.

Inourcross-sectionalstudy,biomarkersofageing,analytical,anthropometricanddemographicdata fromabout3000volunteershavebeencollectedintheMARK-AGEdatabase.Althoughseveralpreventive strategieswereappliedbeforedataentry,errorslikemiscoding,missingvalues,batchproblemsetc.,could notbeavoidedcompletely.Sucherrorscanresultinmisleadinginformationandaffectthevalidityofthe performeddataanalysis.

HerewepresentanoverviewofthemethodsweappliedfordealingwitherrorsintheMARK-AGE database.Weespeciallydescribeourstrategiesforthedetectionofmissingvalues,outliersandbatch effectsandexplainhowtheycanbehandledtoimprovedataquality.Finallywereportaboutthetools usedfordataexplorationanddatasharingbetweenMARK-AGEcollaborators.

©2015ElsevierIrelandLtd.Allrightsreserved.

1. Introduction

Today’shighthroughputtechniquesenablethegenerationof hugedatavolumesinrelativelyshorttimeperiods.Withincreas- ingamountsofaggregatedinformation,theissueofdatabasesgets moreandmoreimportant.Adatabaseisacollectionofdatainan organizedformandcanbecategorizedonthebasisoftheirfunc- tion.Themostcommontypeistherelationaldatabasewherethe informationisstoredinvariousdatatables.Thistypeiswidelyused inthefieldsofgenomicsandproteomics,wherelargeamountof datamustbestoredforasinglesubject(Pearson,2004;Yuand Salomon,2010).Therearesoftwareprogramsavailablethatenable theusertostore,modifyandextractinformationfromadatabase, theso-called database management systems (DBMS). They are especiallydesignedtoprovideaninteractionbetweenuser,other applications,and thedatabaseitself.Typical softwareprograms thatallowthecreationandadministrationofrelationaldatabases areMicrosoftSQLServer,IBMDB2orOracle.

Correspondingauthor.Tel.:+497531884035;fax:+4907531884033.

E-mailaddress:alexander.buerkle@uni-konstanz.de(A.Bürkle).

Toensurethatdataiswellorganized,enteredinthecorrectfor- matandannotated,adatamanagementplanshouldbeprepared beforethebeginningofastudy.Anaccurateplanningincludesnot onlydatahandlingduringthedatacollectionbut alsoafterthe projectiscompleted.However,besteffortsestablishedinproject’s designstoavoiderrorsduringdatacollection,cannotpreventfrom incorrectorincompletedata.Errorsinreal-worlddataarecom- monandaretobeexpected(Orr,1998;Redman,1998).Misleading or missing information in databases disables the confirmation ofresultsandconclusionsafterdatainterpretationandanalysis.

Therefore datacleaningis anessential stepfor theInformation ManagementChainbeforestoringandanalyzingdata(Chapman, 2005).Errorsourcesinmany casesarenotcleardetectableand occurina varietyoffashions. Obviousexamplesaredata entry errors,measurementerrorsordataintegrationerrors(Hellerstein, 2008).

Manualcleaningofdataislaboriousandtimeconsuming,and initselfpronetoerrors(MaleticandMarcus,2000).Dataclean- ingstrategies includingtheuseof machine learningfor guided databaserepair (Yakout etal., 2010),inferring and imputingof missingvalues(Mayfieldetal.,2010)and resolvingofinconsis- tenciesusingfunctionaldependencies(Fanetal.,2008)havebeen

http://dx.doi.org/10.1016/j.mad.2015.05.007

0047-6374/©2015ElsevierIrelandLtd.Allrightsreserved.

Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-0-295717

Erschienen in: Mechanisms of Ageing and Development ; 151 (2015). - S. 38-44 https://dx.doi.org/10.1016/j.mad.2015.05.007

(2)

described before. FurthermoreBatini and collaborators provide severalgeneralmethodologiesfortheimprovementofdataqual- ity(Batinietal.,2009).Afterinvestigatingoncleaningpurposes, dataareofhighqualityif,theyarefitfortheirintendedusesin operations,“decisionmakingandplanning.”(Juranetal.,1974).

Controlledhighqualitydatadonotaffectaccuracyandefficiency ofdataanalysisandcanbefurtherprocessedbyusers.

Theprocessoftransformingdataintosensorystimuliandvisual imagesiscalleddatavisualisation(Schroederetal.,2003).Power- fulcharts,diagramsormapsprovidesolutionstoexplore,analyse, andpresentdata.Furthermoredatavisualisationisanimportant tool foreffective datacommunicationin largeresearchconsor- tia.Howeverdatacanpossessdifferentformssuchasnumbers, graphs,images,ortextsandtheirvisualisationentailssomechal- lenges.Finallystored andanalyzeddata mightberequestedby otherresearches.Asa result,beneficial strategiesfor datashar- ingarealsonecessary.Inthefieldofageingresearchdatafrom databasesaretypicallyrepresentedovertheinternet.Onewell- knownexampleistheDigitalAgeingAtlaswhereage-relateddata of differentbiological levelswere collectedfromtheliterature, storedandprovidedonawebpage(Craigetal.,2015).Eachindi- vidualparametercanberequestedwithadefinitionaccordingto ageandaconnectiontotherawdata.Asabroadcollectionofdata increasethechancetoobtainpowerfulresults,theHumanAgeing GenomicResearchcombinesthreedatabaseslinkingaspectsofage- relatedgeneticandevolutionarystudiesontheirwebpages(Tacutu etal.,2013).Databetweenallpartsislinked,anduserscansimply requesttheinformationofinterestontheuser-friendlyinterface.

ThisworkdescribesMARK-AGEdatamanagementeffortsfocus- ing onstrategies such asdata retrieval, identification of errors andassuranceofdataquality.Furthermorewecreatedautomatic reportsonawebportalusingKonstanzInformationMiner(KNIME) asinterfacefordatasharingandvisualization.

2. Material

2.1. MARK-AGEdatabase

TheMARK-AGEdatabaseisarelationaldatabaseandwasestab- lished using Structured Query Language (SQL) (see Kötter and Moreno-Villanuevaet al.,this issue) and prepared for usageas describedin(seeBauretal.,thisissue).SQLisacommonlyused databaselanguageandallowstheretrievingofdatafromadatabase fastandefficiently.Thelanguagecanbeusednotonlytocreate databasesbutalsoforupdating,retrievingandsharingdatawith otherusers.TheMARK-AGEdatabaseitselfconsistsofanalytical, anthropometricanddemographicdatacollectedfromabout3300 subjectsrecruitedacrossEurope(seeBürkleetal.,thisissue).

2.2. KonstanzInformationMiner(KNIME)

The KonstanzInformation Mineris a modular environment, whichenablesvisualassemblyandinteractiveexecutionofadata pipeline.Itisdesignedasateaching,researchand collaboration platform,whichenableseasyintegrationofnewalgorithms,data manipulationorvisualizationmethodsasnewmodulesornodes (Bertholdetal.,2007).

2.3. KNIMEteamspace

KNIME can be used in a team to share work with other researches,bykeepingdatafilesanddataanalysisworkflowsinone centralsharedplace.Alsometanodesincludingpre-programmed workflows can be used as a reference to the centrally stored

version byallteammembers intheirlocalworkflows.(https://

www.knime.org/knime-teamspace).

2.4. KNIMEserverandwebportal

KNIMEServerallows storing workflows and accessingthem fromanywhereviatheinternet.Useraccessrightscontrolhowdata isgroupedforprojects,workgroupsordepartments.Thewebportal istheperfectwaytodistributepreconfiguredworkflows,created byadministrativeusers,toallendusers.(https://www.knime.org/

knime-server).

2.5. ProgramminglanguageR

Risafreeavailablescriptinglanguageforstatisticalcomputing andthegenerationofhighqualitygraphs(IhakaandGentleman, 1996).Awidechoiceofpre-programmedRpackagescaneasilybe implementedandusedfordataanalysis.KNIMEincorporatesan Rplugin,enablingtheuseoftheRlanguageanditspackagesin workflows.

3. Results

3.1. Dataquality:strategiesforcleaningdata

Dataqualityreflectsthegoodnessoftheevaluateddata.Unfor- tunately,inspiteofeffortsputintodataentry(seeBürkleetal.,this issue),errorsstilloccurredandthereforedatacleaningwasnec- essary.Inordertovisualizeanddetecterrors,automaticstandard analyses(Table1)wereperformedonentereddata.Theyconsisted ofhistograms,scatter-andboxplotsgenerallyusedindescriptive statistics. Afteridentification,the MARK-AGEdatabasecleaning strategyincludes(1)clearingofmissingvalues,(2)removalofout- liersand(3)detectionof batcheffects. Allthree typesoferrors could,ifuntreated, compromisetheconclusionsthataredrawn fromthedata.

3.2. Dealingwithmissingvalues

Two main scenarios are responsible for missing data in the MARK-AGEdatabase.Ontheonehandamissingvaluecanoccur completely at random because the sample tube was broken, defrosted,lostetc.,Ontheotherhandmissingdatawereintroduced becausespecificparameterswereonlymeasuredinlowthroughput analysis.Inthesecases,insteadofallrecruitedsubjects,only300 individualsweremeasured.Aprecisedefinitionontheoccurrence ofmissingvaluesisthereforeessentialtodeterminethehandling strategies.Inadditionmissingdataanalysisdependsontheextent ofmissingvaluesandtheirinfluenceoncovariates.Alargeper- centageofmissinginformationinacovariate,e.g.,females,canlead togroupunder-representationandrequiresfurtherinvestigation.

Thisleadstotheeffectthatselectedpartsofthedatabaseofferdif- ferentstatesofmissingvaluesandrequiresadjustedmissingvalue handling.Thereforeageneralreplacementofmissingdatainthe originalDatabaseisnotpossible.Thehandlingofmissingvalues wasperformedintheindividualdownstreamanalysis,usingeither completecaseanalysisorasubstitutionmethodasdescribedinthe following.

Ifcovariatesareequallydistributedandvaluesaremissingcom- pletelyatrandomasubstitutionisnotobligatory.Inthis casea completecaseanalysiswasperformedincludingonlydataavailable forallparameters.

To compensate for under-representation of a covariate, the missingvaluescanbereplacedwithstatisticalestimates.Onepop- ularmethodismeansubstitution,i.e.,replacingthemissingvalues

(3)

Table1

Tableofautomatedanalysesthatwereavailableforthequalitychecksofparameters.

Reportname Outcome

Scatterplotwithage ScatterplotforaparameteragainstagewithalinearregressionlineandthePearsonand Spearmancorrelationcoefficients

Boxplotforselectedsubgroups Boxplotforaparametergroupedforapre-definedsubgroup(age,recruitmentcenter,gendere.g.)

Histogram Simplehistogramforoneparameter

Correlationoftwoparameters ScatterplotfortwoselectedparametersandtheSpearmancorrelationcoefficient Batchcheckrecruitmenttime Scatterplotofaparameteragainsttherecruitmenttime

Batchcheckinternalid Scatterplotofaparameteragainstthetimeofbiochemicalanalysis Empiricaldistributionfunctionfor

selectedsubgroups

Empiricaldistributionfunctionforselectedsubgroupsofaparameterwiththeinformationabout thesignificantdifferencescalculatedwithbootstrapanalysis.

bytheaverageofavailablevalues.However,sincemeansubstitu- tioningeneralcannotberecommendedduetobiasedresults,more robuststatistical methods like median substitution or multiple imputationsshouldbechosen.Wheneverreplacementofmissing valuesisnecessary,itshouldbecheckedifthesubstitutionstrategy affectstheintendedanalysis.Toaccomplishthistask,onecanstart fromarepresentativecompletedatasetinwhichmissingvalues areintroducedrandomly.Sincethenumberofmissingvaluescan becontrolledinthisscenario,theeffectofthereplacementstrategy dependingonthenumberofmissingvaluescanbeanalyzedsta- tistically.Theresultinginformationisquiteusefulforthedecision whetherthereplacementstrategyshouldbeappliedtoaspecific dataset.Thesubstitutionmethodwasadjustedforeachindividual analysis.

Asrepresentativeexample,weselectedadatasetwithoutmiss- ingvaluesfromtheMARK-AGEdatabase.In(Fig.1)thespearman correlationoftwoparameterswasdetermined.Thereplacement stepswereperformedrandomlyandrepeatedahundredtimes.

Afterareplacementof5%ofalldatavaluesthecorrelationdiffers significantlyfromtheoriginaloutcome.Thismeansthatreplac- ingmorethan5%ofmissingvalueswouldaffecttheconclusions derivedfromtheinitialdata.Inthiscaseanothermethodthanthe medianimputationmustbefound.

3.3. Dealingwithoutliers

Outlierdetectionisanimportanttasktoachieveprevioustodata analysis.Labellingmethodsfordetectingoutlierscanbeusedifthe distributionofdatasetsisdifficulttoidentifyorthedatacannot betransformedinaproperdistribution.Wefavouredthemethod basedontheinterquartilerangeoverotherclassicalapproaches, suchasstandarddeviationorz-score,becausequartilesaremore resistanttoextremevalues.Thereforeanoutlierwasdefinedasany

Fig.1. Overviewofthechangeincorrelationbetweentworandomlyselected parametersaftermedianimputation.Increasingnumbersofdatapointswereran- domlyremovedfromtheoriginaldatasetandreplacedbytheparametersmedian.

Withincreasingpercentagesofreplaceddatathecalculatedcorrelationsdeviate fromtheoriginalvalue.Thefirstsignificantdeviationoccursforthisexampleat5%

(onewayANOVAp<0.05).

datapointoutsidetherangebetweenthespecifiedlowerandupper quantiles(Fig.2).

Ageneralfirstcheckforextremeoutliersistheinspectionofval- uesexceedingthebiologicalborders,forexampleasystolicblood pressureof400.Thosesuspiciousvalueswereclearlyincludedby mistakeandeithercorrectedorifnotpossibleexcludedfromdata analysis.Incasesuspectvalueswereeitheralegitimatepartofthe dataorthecausewasunclear,thedatapointswereonlyexcluded afterbeenidentifiedasoutliersbytheinterquartilerangeapproach.

Forstandardanalysisoninter-parametercorrelationsandlinear modellinganalysis3%quantilesweredefinedforoutlierremoval.

Becauseboththeupperand lowerquantileswereconsidered,a totalof 6%of alldatavalues wereexcludedforeachparameter duringanalysis.Excludingthisamountthedataremovesthemost prominentoutliersfromtheanalysis,andtheresultsreflectreliable outcomes.Ifbyexcludingdatapointsthetotalamountofvaluesis decreasedtoomuchtheanalysiscanagaingetdistorted.There- fore,theminimalamountofdatarequiredforapplyingthisoutlier removalstrategywassetto300datapoints.

Inspecificcaseswhenonlysmallersubgroupsofdatasetsare analyzedit canhappenthattheselectedvalues offervery high distributions.Thequantilelimitsmustthenbeadapteduntilthe outcomesstaystatisticallystable.Ifparametersoffertoohighdis- tributionsorundergotheamountofminimalrequiredvaluesthey mustbeexcludedforthespecificanalysis.

3.4. Dealingwithbatcheffects

Abatcheffectistheresultofsystematicerrorintroducedarti- ficiallyduringsampleprocessing.Asimplemethodfordetecting batcheffectsistoperformtestsofassociationbetweentheanalyzed valuesandvariousexperimentalvariables(e.g.,reagentlots,labo- ratoryconditions,personneldifferences,processingdayorchanges inprotocols)toseeiftherearealargenumberofartificialassocia- tionsacrossthedata.

IntheMARK-AGEstudyabout3000subjectswererecruited(see Bürkleetal.,thisissue).Inordertoobtainthenecessaryaliquots, bodyfluidsfromeachsubjectwereprocessedaccordingtothesam- plingprocedures(seeCapriandMoreno-Villanuevaetal.,thisissue) andshippedtotheanalyticcenters.Dependingontheassaysused, thesampleswereprocessedatonce,inbatchesorcontinuously.

Todetectbatcheffectsallmeasuredparameterswereexamined, eitherforthedateof thesamplingattherecruitmentcenteror fortheprocessingorderoruploaddayattheanalytic center.A straightlinethatbestrepresentsthedataonascatterplot(lineof bestfit)wasusedtovisualizerelationshipsamongvariables.Incase ofbatchesthemaximalrateofchange(slope)ofthelinedifferscon- siderablyfromzero(Fig.3A),indicatingabatchproblem.Inorder todeterminenotonlycontinuoustrendsbutalsodiscontinuous changes,thepointsonthelineatwhichchangestakeplacewere calculated.Thedatalocatedbetweentwopointswereconsidered abatch.

(4)

Fig.2.Scatterplotsshowingthedistributionofarepresentativeparameterandtheremovalofoutlierswiththeinterquartileapproach.Inthegraphstherepresentative parameterisplottedagainsttheageofthesubjectsexpressedasAgeDaysAsYears.Thegraphsrepresenttheoriginaldistribution(A)oftheparameterandaftertheexclusion of1%quantiles(B)and3%quantiles(C).Itisapparentthatoutlierswereremovedinastepwisefashionbyusingthistechnique.

Sinceoriginsofbatcheffectscanbemanyandvaried,ageneral repairmechanismisnotadvisable.Infact,anysuchmechanism isbasedonamathematicalmodeloftheerrorandaslongasthe modeldoesnotreflectthenatureofthemistake,modifyingthedata willnotimprovethequality.Iftheerrormodelisknown,thebatch effectwascorrectedinthedatabase(Fig.3B).Whennoerrormodels areavailable,weexcludedsuspiciousparametersdisplayinghuge batchproblemswithunexplainablenoisefromfurtheranalysis.

4. Datavisualizationandsharing

Monitoringandcommunicationofdataisanessentialstepfor thesuccessfulcompletionofbigdataprojects.Astrategy toget fastandreliableresults,whichcanbedistributedunderhighsafety conditions,isrequired.IntheMARK-AGEprojectKNIMEwasused

asstandardtoolfor(1)dataquery(2)datavisualizationand(3) datadistribution.

4.1. Dataquery

Forthemultifunctionalanalysisperformedintheprojectdiffer- entsetofparametersandcovariatesmustberequestedregularly fromthedatabase.Pre-programmedKNIMEnodeswereusedto generateaclearstructuredtooltofilterdesiredconditions.Fig.4 showsthestandardviewoftheprogrammedselectionmenufor covariates.Theusercanchosetherequiredgroupconditionsfrom theassortment.Inasecondwindowalistofallavailableparame- tersappears.Theycaneitherbeselectedbypartnernumber,work packagenumber,ormanually.Ifforspecialanalysismoreselection optionswerenecessarytheycouldeasilybeattachedbytheadmin- istrator.Withthisapowerfultoolwasgeneratedandadjustedfor

Fig.3. Arepresentativeexampleforthesuccessfulcorrectionofabatcheffect.Thegraphsshowscatterplotsforaparameterxagainstthemeasuringorder.Thelineofbest fit(black)offersaslopewhichisrecognizedbythefirstdeviation.AsinsuchcasesthealgorithmautomaticallyaddsthetermATTENTIONtothegraphtheycaneasilybe selectedfromcleanparameters.Thebatchintheexampleoccursforadefinedtimebecausethenormalizationofthevariableswasforgotten(A).Thiswasdocumentedby thestaffintherespectivelaboratory.Thereforethenormalizationcouldbeperformedsubsequently.Afterthecalculationtheparameterisfreefromanybatcheffect(B).

(5)

Fig.4.Screenshotofthestandardselectionmenutoquerysubgroupsfromthe database.Thedesiredconditionscaneasilybeselectedinonestep.

thespecificrequirementsontheMARK-AGEdatabaseforafastand easydataquery.

4.2. Datavisualization

VisualizationofdataintheMARK-AGEprojectisimportantfor twofacts(1)uploadeddatamustbemonitoredandcontrolledfor anearlyrecognitionandpreventionofproblems(2)researchers needtoextractthebiologicalinformationfromthedata.

4.2.1. Dataexploration

For data exploration general plotting tools from descriptive statisticswereused(Table1).KNIMEworkflowsweredesignedto automaticallygeneratethegraphsforallparametersavailablein thedatabase.Withthoseanalysisexpertsinthefieldcouldtestthe dataforplausibilityandcorrectness.Alsoevidencefortheunder- lyingdistributionand,asdescribedabove,forparametersquality couldbeprovided.

4.2.2. Extractbiologicalinformation

For the extraction of inter-parameter dependencies in the MARK-AGEDatabase,atooltovisualizecorrelationswasnecessary.

Atypicallyusednumericalcorrelationmatrix, calculatedover a wholedatabase,wouldhavebeentoolargefortheextractionofsig- nificantinformation.ThereforeaKNIMEworkflowwasestablished automaticallygeneratinganetworkforapre-selectedparameterin themiddle,showingthetoptencorrelatingparametersarranged around(Fig.5).Toclearlyvisualizethedependencybetweenthe parameters,informationisrepresentedbythelengthandcolorof theconnectionlines,indicatingthecorrelationstrengthanddirec- tion.Theavailablesamplenumberissuggestedbythesizeofthe parametercircles.Throughthisarrangementtheuserisabletoper- formplausibilitychecksforknowndependenciesataglance.Thus efficientdetectionofunknownorratherunexpecteddependen- ciesinthedataispossible.Asanupstreamselectionmenuallows

Fig.5.Networkanalysistocheckforcorrelations.Theautomatedanalysispresents the10bestcorrelatingparameters(A–J)fortheparameterX,selectedbytheuser.

Thecolorofthelinesreflectsthedirectionofthecorrelation(darkgreypositive correlation,lightgreynegativecorrelation).Thelengthofthelinesindicatesthe strengthofthecorrelation(thelongerthelinestheweakeristhecorrelation).

Thesizeofthecirclesstandfortheamountofdataconsideredforthecalculation (thelargerthediameterthemoredataisavailable).Whichsubgroupsshouldbe consideredintheanalysisisdefinedintheselectionmenu(seeFig.4).

theseparationofdifferentsubgroups,andadditionalstratification onthedatabackgroundis possible.Changesinthecorrelations between different subgroupsgive a hint for the interaction of specificbodysystemsortheinfluenceofenvironmentalfactors.

Theidentifieddependenciescanleadtothedevelopmentofnew hypothesisandfurtherdetailedanalysis.

4.2.3. Datasharing

InordertoprovideMARK-AGEpartnerswithgraphicalresults based onthe ongoing analysis the KNIMEteam space and the KNIMEWebPortalwereestablished.

4.2.3.1. KNIMEteamspace. Tosharethedatabase,metanodesand analysisworkflows,theKNIMEteamspacewasused.Workflows wereplacedonadedicatedMARK-AGEserver.Accesstotheflows was restricted to team members responsible for the coordina- tionof theMARK-AGEanalysis.Thesemembers couldnot only accesstheuploadedflowsbutalsoprovideothercolleagueswith self-generatedworkflows.Theworkflowsprovidedcouldbedown- loadedasalocalcopyOriginalversions,however,couldonlybe modifiedbytheirauthors.Thus,theKNIMEteamspaceprovideda collaborativeinformationexchangeinadocumentedmannerand underhigh-levelsecurityconditions.

4.2.3.2. KNIMEWebPortal. TheKNIMEteamspacewasrestricted tomembersresponsibleforthecoordinationofMARK-AGEanal- ysis.ThereforetheKNIMEWebPortal(Fig.6)wasused,enabling allprojectpartnerstoreceiveanalysis-feedbackonownmeasure- ments.TheKNIMEWebPortalwasconnectedtotheMARK-AGE serverandworkflowswereorganizedinspecifiedfolders. Users couldselectthedesiredanalysisconditionsfromamenu(Fig.4) andpre-programmed,automaticalgorithmsruninthebackground.

Afterwardsuserscoulddownloadgraphicalresultseitheraspdf,

(6)

Fig.6. ScreenshotoftheKNIMEWebPortaluserinterfacewiththelistofavailableanalysisthatcouldbeselected.Onlyactivateduserscanloginandreceiveanalysisfrom theautomaticreports.

xls, pptx,or word document (Fig. 7).Thisstrategy ensuresthe blindingofMARK-AGEproject,sincethesubjectscodeslinkingthe subjectwiththeanalysisresultsremainprotected.

5. Discussion

Inthis paperwedescribehow thedatacleaningandvisual- izationprocesseswereperformedduringtheMARK-AGEproject.

Problemswereidentifiedespeciallywithregardstooutliers,miss- ingvaluesandbatcheffects.Someofourobservationsrepresent already knownproblems thatcan occurin databases,but pub- lishedhandlingstrategiescannotbeuseddirectlyonMARK-AGE data.Thereforeadjustmentsonthespecificproject’srequirements arereported.Inaddition,newdevelopmentsweredescribedthat couldworkforpreventionandasprototypeexamplesinupcoming ageingstudies.

AsmentionedunderSection2,theMARK-AGEdatabasecon- tainsseveraltypesofinformation,i.e.,(1)valuesfromdemographic data,(2)valuesfromanthropometricmeasurementsand(3)values fromanalyticdata(seeBürkleetal.,thisissue).Duringdataanalysis amainchallengewastodealandpreparethedifferentkindofdata forreliableanalysis.Whileinformationonsubjectswascollectedin questionnairesduringinterviews,bioanalyticaldatawereobtained frombiologicalmaterialanalysedinthecorrespondinglaborato- ries.Both,questionnairesand analyticdataareerror-proneand

evenourbesteffortsintheprojectdesigncouldnotpreventsuch errors.Severalcircumstancescanleadtomissingvalues,outliersor batcheffects,whichinturncansignificantlyimpactstatisticalanal- ysis.Thereforestrategiesforcleaningdataarenecessary.However, thereisnostandardstrategyavailable,andtheprocedurestobe useditmainlydependonthetypeof dataandtype ofanalysis.

Inordertochecktheeffectivenessofdatacleaningstrategies,Dasu andLohintroducedtheconceptofstatisticaldistortion.Theyargued thatdatacleaningstrategiescouldhaveanimpactonresultsand nolongerrepresenttherealprocessthatgeneratesthedata.There- fore,‘cleaner’datadoesnotnecessarilymeanmoreusefulorusable data(DasuandLoh,2012).

Incaseofmissingvalues,moststatisticalproceduresautomat- ically excludethesubjectsconcerned.Thisleadstoa reduction of dataavailable forperforming statisticalanalysis.As aresult, outcomesmaynotbestatisticallysignificantduetolackofstatis- ticalpower.Furthermoremissingvaluescanalsocausemisleading resultsbyintroducingbias.DuetothenatureoftheMARK-AGE projectinsomecasesthereasonformissingvalueswasdifficultto identify.Oftenitisnotobviousifmissingdatawillcauseaprob- lem.In somecasesresultsmight beaffected,whileothers stay unchanged.

Likeformissingvalues,variationsthatarisefromoutlierscan influencethestatisticaloutcomeandthereliabilityofdata.Most criteriaidentifyingpossibleoutliersareeffectiveifdatapossessa

Fig.7.ArepresentativeexamplefortheoutcomeofananalysisontheKNIMEWebPortal.Theworkflowrunsinthebackgroundandprovidestheuserwithgraphsandtables.

Obtainedresultscanbesavedinoneofthelistedfileformats(bottomleftcorner).

(7)

normaldistribution.Althoughthesemethodsarepowerful,itmay beproblematictoapplythemtonon-normallydistributeddataor smallsamplesizeswithoutinformationabouttheircharacteristics.

Outlierscanoccurduetobiologicalvarianceortechnicalreasons.

IntheMARK-AGEprojecttheinter-quartileapproach wasused inordertoeliminatethemostseverecases.However,wemight excludethemostinterestingsubjectsthatofferarealbiological outlier.Unfortunatelyinsomecasesitisnotpossibletodetermine thebackgroundofsuchanoutlier.Bycontrast,batcheffectsare asourceofnon-biologicalvariation.Althoughtherearemethods availabletofitdifferentslopesorstepsinadistributionthisshould notbeappliedblindlytoeachsituation.Furthermore,duringthe analysisofbatcheffectsitisessentialtocheckwhetheroutliers andbatcheffectsofferoverlappingproblems.

Due to the huge amount and different type of data gener- atedduringtheMARK-AGEproject,amanualdatabaseexploration wouldhavebeentimeconsuming.Howeverautomateddataexplo- rationtools cannotbeappliedineach case. Weconcluded that hereisa needfor usefuland powerfultools thatautomatethe datacleaningprocess,oratleastassistmanualprocedures.How- ever,automatedmethodscanonlybepartoftheprocedure.There isaongoingneedforthedevelopmentofnewtoolstoassistthis process,especiallyfortheuseinbestpracticeroutines.

Last but not least, data sharing is indispensable in large researchconsortiasuchasMARK-AGE.Budin-Ljøsneandcolleagues describethechallengesofsharingdatabasedontheirexperiences in theEuropean Network for Geneticand Genomic Epidemiol- ogy,ENGAGE(Budin-Ljosneetal.,2014).Thesciencecommunity expectsauthorstoshareresearchresults,thereforethereisaneed fordataarchivingespeciallywhentheresearchdealswithhealth issuesorpublicpolicyformation.Dataarchivingmeansthestoring oflargeamountsofdatatobeaccessedfromcentrallocations.Dur- ingtheMARK-AGEprojecttheKNIMEWebPortalwasusedtoshare resultswithintheConsortium.AccesstotheWebPortalisrestricted toMARK-AGEmembersbutthiscouldbeopenedtothescientific communityinthefuture.Ifthishappensitisnecessaryforeachuser tounderstandtheunderlyingfactsaboutthedatabasedevelop- ment(seeBauretal.,KötterandMoreno-Villanuevaetal.,thisissue) andpossiblequalityissuesdescribedinthiswork.Especiallyfor biogerontologistsnotonlytheDatabaseopeningbutalsoupcom- ingpublicationontheMARK-AGEdatawillbeofgreatinterest.The interpretationofthoseresultsaswellasthedevelopmentofnew hypotheseswillmainlyrelyonthequalityandtheconfidingwork onthedatabaseitself.Informationontheage-dependentparam- eters identified in theMARK-AGE projectmight also belinked withotherdatabasesliketheHAGR(Tacutuetal.,2013).Buteven priortothat,theconclusionsreachedduringthedatacleaningand visualizationstepsinMARK-AGEcanbeusedtopreventerrorsin upcomingstudiesinvestigatinginthehumanageingprocess.

Acknowledgements

WewishtothanktheEuropeanCommissionforfinancialsup- portthroughtheFP7large-scaleintegratingprojectEuropeanStudy toEstablishBiomarkersofHumanAgeing(MARK-AGE;grantagree- mentno.:200880)andallMARK-AGEConsortiumpartnersforthe excellentcollaboration.

References

Batini,C.,Cappiello,C.,Francalanci,C.,2009.Methodologiesfordataquality assessmentandimprovement.ACMComput.Surv.41,http://dx.doi.org/10.

1145/1541880.1541883

Berthold,M.R.,Cebron,N.,Dill,F.,Gabriel,T.R.,Kötter,T.,Meinl,T.,Ohl,P.,Sieb,C., Thiel,K.,Wiswedel,B.,2007.KNIME:TheKonstanzInformationMiner.Studies inClassification,DataAnalysis,andKnowledge.Springer-Verlag,

Heidelberg-Berlin.

Budin-Ljosne,I.,Isaeva,J.,Knoppers,B.M.,Tasse,A.M.,Shen,H.Y.,McCarthy,M.I., Harris,J.R.,2014.Datasharinginlargeresearchconsortia:experiencesand recommendationsfromENGAGE.Eur.J.Hum.Genet.22(3),317–321.

Chapman,A.D.,2005.Principlesandmethodsofdatacleaningprimaryspecies andspecies-occurrencedata.In:Version1.0,ReportfortheGlobalBiodiversity InformationFacility.GlobalBiodiversityInformationFacility,Copenhagen.

Craig,T.,Smelick,C.,Tacutu,R.,Wuttke,D.,Wood,S.H.,Stanley,H.,Janssens,G., Savitskaya,E.,Moskalev,A.,Arking,R.,deMagalhaes,J.P.,2015.TheDigital AgeingAtlas:integratingthediversityofage-relatedchangesintoaunified resource.NucleicAcidsRes.43,D873–D878.

Dasu,T.andLohJ.M.(2012).StatisticalDistortion:ConsequencesofDataCleaning.

ProceedingsoftheVLDBEndowment.Z.M.Özsoyo˘glu.Istanbul,Turkey, VLDB2012.5:1674-1683.

Fan,W.,Geerts,F.,Xibei,J.,Kementsietsides,A.,2008.Conditionalfunctional dependenciesforcapturingdatainconsistencies.ACMTrans.DatabaseSyst.33 (2),1–48.

Hellerstein,J.M.(2008).QuantitativeDataCleaningforLargeDatabases.Surveyfor theUnitedNationsEconomicCommissionforEurope.http://db.cs.berkeley.

edu/jmh

Ihaka,R.,Gentleman,R.,1996.R:alanguagefordataanalysisandgraphics.J.

Comput.Graph.Stat.5(3),299–314.

Juran,J.M.,Gryna,F.M.,Bingham,R.S.,1974.QualityControlHandbook,3rdedition.

McGraw-Hill,NewYork,ISBN0070331758.

Maletic,J.I.,Marcus,A.,2000.DataCleansing:BeyondIntegrityAnalysis.

MassachusettsInstituteofTechnology,Boston,pp.200–209.

Mayfield,C.,Neville,J.,Prabhaker,S.,2010.ERACER:adatabaseapproachfor statisticalinferenceanddatacleaning.SIGMOD,75–86.

Orr,K.,1998.DataQualityandSystemsTheory.CACM41(2),66–71.

Pearson,A.A.(2004).RelationalDatabases.Currentprotocolsinbioinformatic.

supplement7:9.4.1-9.4.25.

Redman,T.,1998.Theimpactofpoordataqualityonthetypicalenterprise.CACM 41(2),79–82.

Schroeder,W.,Martin,K.,Lorensen,B.,2003.Thevisualisationtoolkit.In:ASystem forGuidedDataRepair.SIGMOD.PrenticeHallPTR,NewJersey,pp.1223–1226.

Tacutu,R.,Craig,T.,Budovsky,A.,Wuttke,D.,Lehmann,G.,Taranukha,D.,Costa,J., Fraifeld,V.E.,DeMagalhaes,J.P.,2013.HumanAgeingGenomicResources:

integrateddatabasesandtoolsforthebiologyandgeneticsofageing.Nucleic AcidsRes.41,D1027–D1033.

Yakout,M.,Elmagarmid,A.K.,Neville,J.,Ouzzani,M.,2010.GDR:asystemfor guideddatarepairSIGMOD10.In:Proceedingsofthe2010ACMSIGMOD InternationalConferenceonManagementofdataACM,NewYork,pp.

1223–1226.

Yu,K.,Salomon,R.,2010.Peptidedepot:flexiblerelaitonaldatabaseforvisual analysisofquantitativeproteomicdataandintegrationofexistingprotein information.Proteomics9(23),5350–5358.

Referenzen

ÄHNLICHE DOKUMENTE

2.1 The power of Google, Apple, and Facebook 7 2.2 Concepts of data colonialism and digital sustainability 8 3 Interviews ... Data ownership by private companies 12 2.

Because the electronic coding process is quick (compared to cutting and pasting pieces of text manually) it is possible that more coding will take place in a study which makes use

The management of data and services for environmental applications using codata and metadata is of crucial importance for environmental information systems.. The

Therefore, the decentralised structure of the research data infrastructure is a tried and tested way to satisfy the demands of data producers, data users in science and research,

Building on literature related to process improvement, process performance measurement, and network analysis, the research papers propose an approach for ranking processes according

2.3 Cluster Analysis to Segment Students on Leadership Behaviors This section investigates the application of clustering techniques to the college student leadership behavior

The question of how many machines are desirable depends partly on how efficiently their use is organ- ized. A comparatively few machines can do more work than

The UFBGKSIZE (generic key size) specifies the number of characters to be considered in a comparison. After the START has been performed, UFBGKSIZE reverts to