j o ur na l h o me pa g e :w w w . e l s e v i e r . c o m / l o c a t e / m e c h a g e d e v
The MARK-AGE extended database: data integration and pre-processing
J. Baur
a, T. Kötter
b, M. Moreno-Villanueva
a, T. Sindlinger
a, M.R. Berthold
b, A. Bürkle
a,∗, M. Junk
caChairofMolecularToxicology,UniversityofKonstanz,78457Konstanz,Germany
bChairofBioinformaticsandInformationMining,UniversityofKonstanz,78457Konstanz,Germany
cDepartmentofMathematicsandStatistics,UniversityofKonstanz,78457Konstanz,Germany
a r t i c l e i n f o
Articlehistory:
Received31January2015
Receivedinrevisedform13May2015 Accepted18May2015
Availableonline21May2015
Keywords:
Database Dataentry Dataintegration Dataprocessing Dataextraction KNIME
a b s t r a c t
MARK-AGEisarecentlycompletedEuropeanpopulationstudy,wherebioanalyticalandanthropometric datawerecollectedfromhumansubjectsatalargescale.Tofacilitatedataanalysisandmathematical modelling,anextendeddatabasehadtobeconstructed,integratingthedatasourcesthatwerepartofthe project.Thisstepinvolvedchecking,transformationanddocumentationofdata.Thesuccessofdown- streamanalysismainlydependsonthepreparationandqualityoftheintegrateddata.Here,wepresent thepre-processingstepsappliedtotheMARK-AGEdatatoensurehighqualityandreliabilityinthe MARK-AGEExtendedDatabase.Variouskindsofobstaclesthataroseduringtheprojectarehighlighted andsolutionsarepresented.
©2015ElsevierIrelandLtd.Allrightsreserved.
1. Introduction
Adatabaseisa structuredcollectionofinformation thatcan beaccessedbyusingspecificsoftwaretools.Toextracttheinfor- mation and understand the fundamental structure of collected data,thesehavetobepresentedinsuchawayastoenableeffi- cientknowledgeextraction.Variousformsofdatabaseshavebeen developedandtheyaretypicallycategorizedonthebasisoftheir function.Themostcommontypeistherelationaldatabasewhere theinformationisstoredinvariousdatatables(Codd,1970),which iscommonlyusedingenomics,proteomicsandclinicalresearch wherelargeamountofdatahavetobestoredfromeachsubjector patient(MackeyandPearson,2004;YuandSalomon,2009).
Toenter,organizeandselectdatafromadatabaseadatabase management system (DBMS) is necessary. These programs are specificallydesigned toenable interaction between user,other applications,andthedatabaseitselfandcoverthefollowingissues:
1.Define,removeandmodifythedatastructureofneworexisting databasetables.
∗Correspondingauthor.Tel.:+497531884035;fax:+4907531884033.
E-mailaddress:alexander.buerkle@uni-konstanz.de(A.Bürkle).
2.Inserting,modifying,anddeletingdata.
3.Querydataforreportsandmakethemaccessibletoend-users.
4.Datasecurityandrecovery,registeringandmonitoringofusers.
TherearemanydifferenttypesofDBMSs,rangingfromsmall systems that run on personal computers tovery complex sys- temsthatrunonmainframes.Structuredquerylanguage(SQL)was developedintheearly1970satInternationalBusinessMachines (IBM)(BoyceandChamberlin,1974).Itisastandardlanguageto interactwithrelationaldatabasesandiscurrentlythemostwidely used databaselanguage.Theusage ofSQL includesdatainsert, query,updateanddelete,modificationanddataaccesscontrol.
Inlargeresearchprojectsthesharingofdatawithinaconsor- tium,betweendifferentconsortia,orwiththescientificcommunity at largeis essential in order toboost progress duringcomplex dataanalysisandmodelling.Anabsoluterequirementforsharing ofadatabaseisthatthecommunicateddataarewellorganized, have beenenteredin thecorrect format,carefullycheckedand validated.Differentorganizationalapproachesaccordingsuchpro- cessesarealreadydescribed forseveraldatabasesintheageing research(Craigetal.,2015;DeMagalhaesetal.,2005).
Duringdesigningaproject,strategiesneedtobeimplemented in order to preventerrors during data entry. Error prevention strategies,however,cannotguaranteetheabsenceofincorrector
http://dx.doi.org/10.1016/j.mad.2015.05.006
0047-6374/©2015ElsevierIrelandLtd.Allrightsreserved.
Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-0-295749
Fig.1.SchemeoftheSQLandKNIMEdatasourcefusion.
DatawereeitheruploadedtotheSQLdatabaseviatheinternet,ordirectlyimplementedinKNIME.Togenerateacompletesetofdatabothsourceswereintegratedwithin KNIMEonacentralplace.
incompletedataentryduringtheproject(VandenBroecketal., 2005).Datathathavenotbeenscreenedandcheckedformislead- inginformationmayproducefalseresultsandconclusions.‘Data cleaning’,i.e.theidentificationandcorrectionoferrorsinorderto improvedataqualitybeforestoringandanalysingdataistherefore anindispensablepartofthedatamanagementprocess(Chapman, 2005;Rahm andDo, 2000;Van denBroeck etal., 2005).There are,however,manydifferenterrortypes,andfurthermoreerror sourcesarenotalwayseasytoidentify.Typicalexamplesareerrors duringmeasurement,dataentryordataintegration(Hellerstein, 2008).Fordataevaluation,analyzersmustknowaboutpossible dataqualityproblemsthatcancompromisethevalidityofresults.
ThistopicisalsoaddressedintheDigitalAgeingAtlas,whereeach entryisconnectedtothebelongingsourceofrawdata,offeringthe possibilitytocheckoriginalcontents(Craigetal.,2015).
Below,itisexplainedhowtheMARK-AGEdatabasewaspre- paredandextendedwithintheKNIMEdataintegrationplatform.
Hiddenproblemsinvariousdatasets,aswellashandlingstrategies formisleadingdataarepresented.Asthecasesmentioneddidoccur despitecarefulstudydesign,theymayprovidesomeguidanceto avoidsimilarproblemsinfuturestudies.
2. Materialsandmethods
2.1. SQLMARK-AGEdatabase
The MARK-AGE database was established using Structured QueryLanguage(SQL)(KötterandMoreno-Villanuevaetal.,this issue).SQLisacommonlyuseddatabasemanagementsystemfor relationaldatabases.Thedatatablescontaintherawdatauploaded byMARK-AGEConsortiummembers.SQLisarathercomplextool fornon-expertsininformatics.Therefore,KNIMEwaschosenfor queryingtheSQLdatabaseandforfurtherprocessingtheMARK- AGEdataadditionally(Fig.1).
2.2. Konstanzinformationminer(KNIME)
TheKonstanzInformationMiner(KNIME)isauser-friendlydata integrationplatform,whichenablesvisualassemblyandinterac- tiveexecutionofdatapipelines.KNIMEenableseasyintegrationof newalgorithmsorvisualizationmethodsasmodulesornodes.For
aclearstructuringofbigworkflowsKNIMEofferstheopportunity tocollapseagroupofselectednodesintoaso-calledmetanode.
Inaddition,KNIMEprovidespluginsforcommonprograminglan- guagesandcanbeusedasdatabasemanagementsystem(Berthold etal.,2007).Toavoidtheriskoflosinginformation,thealready generatedpartsoftheSQLdatabasetableswereopenedinKNIME afterconnectingtotheMARK-AGEdatabaseserver(Fig.1).KNIME didnotprovidedirectreadingaccesstotheoriginaldatatables,but tomirroredorjoinedso-called‘views’thatweregeneratedbythe SQLDBMS.Thus,thissystemguaranteesamaximallevelofsafety forthealreadyestablishedSQLbaseddata.Eachstepperformed withKNIMEis executedand saved ina dedicated node, which therebyworksasdocumentationplatform.Asaresult,aworkflow ofnodeswasgeneratedperformingthecompleteintegrationofthe MARK-AGEdatabase(Fig.2).
2.3. KNIMEserver
TheKNIMEserverallowsstorageandaccessingofworkflows viatheinternet.Useraccessrightscontrolhowdataaregrouped forprojects,workgroupsordepartments.Theserverwasusedto storeanddocumentKNIMEworkflowsoftheMARK-AGEdatabase extensionprocess(https://www.knime.org/knime-server).
2.4. Ethicalclearance
TheMARK-AGEstudyhasbeenapprovedbytheappropriate ethicscommitteesandhasbeenperformedinaccordancewiththe ethicalstandardslaiddownintheDeclarationofHelsinki.Allstudy subjectsgavetheirinformedconsentpriortotheirinclusioninthe study.ThisisdescribedingreatdetailinCaprietal.,thisissue.
3. Results
3.1. Dataentry
To guarantee the blinding of the study, questionnaire data (comprisingdescriptorsoftherespectivesubjectsuchasageand gender)andbioanalyticaldatawerestoredseparately.Question- naireswereuploadedtothedatabaseunderaprimarysubjectcode (PSC)anddatafrombiochemicalanalysesofsampleswererecoded
Fig.2. OverviewoftheKNIMEpreparationworkflowfortheextendeddatabase.
Informationwasseparatelyimplementedforbioanalyticalandquestionnairedata(A)Subsequently,specificcalculations(B)andtherenaming(C)oftheparameterswere performed.Inthelaststep,subgroupsfromseveralrecruitmentphaseswereseparated(D).
andenteredintothedatabaseunderthesecondarysubjectcode (SSC)(seeBürkleetal.,thisissue).Tojoinbothtypesofinforma- tion,atranslationtable(termed‘PSC-to-SSC’table)wasestablished duringtheproject(seeMoreno-VillanuevaandKötteretal.,this issue).
3.1.1. Entryofxlsandcsvdatatables
Forlogisticreasonsnotallofthecollecteddatafrombiochem- icalanalysiswereuploadedtotheSQLdatabaseviathewebsite interfaces.Thereforexlsorcsvtablescontainingthesecondarysub- jectcodewiththerespectivedatawereintegratedwiththealready establisheddatabaseusingKNIME,thus,leadingtotheformationof an‘ExtendedDatabase’(Fig.1).Thetables,manuallygeneratedby theanalyzersofferedproblemswiththesubjectcodesassomeof thosewereeitherinvalidduetotyposormultipleusage.Tosolve thisproblem,aKNIMEworkflowwasestablishedthatautomati- callyidentifies andrejectsallrowsofa tablecontaininginvalid ormultiplesubjectcodes. Therebyeachincomingxlsorcsvfile wascheckedseparately,firstfortheuniquenessofSSCsbycount- ingandsecondforvaliditybycomparingallavailablecodeswith thePSC-to-SSCtablefromtheSQLdatabase,containingallvalid subjectcodeinformation (Fig.4).Invalidentrieswereseparated anddocumentedforcommunicationtothepartners.Applyingthis procedure,atotalof19datatableswereaddedtotheExtended Databasecontainingonaverage1.3%±1.2invalidand 0.4%±0.8 multiplyenteredsubjectcodes.Thehighvaluesofstandarddevi- ationsindicatethatsometablesdisplayedmoreerrorsincoding thanothers.Additionally,theamountofmiss-enteredsubjectcodes indirectlyreflectsthedataquality.Anotherscenariooccurringin thedatatableswasthemix-upofcodesandrespectivebioanalyt- icalvaluesinthelaboratories.Thosecasescouldonlybeidentified iftheyledto outliersin somedownstream analysis,for exam- pleifafemalesubjectisattributedvaluesthatappliedtomales only.
Table1
List of generalmodifications performedduring construction of theExtended Database.
Recalculationofnon-SIunitstoSIunits Normalizationsonparameters
Normalizationsonparameterswithvaluesmeasuredbyanotherpartner Removalofsubjectswithvaluesbeyondtheboundariesofmeasuring Calculationofratiosonparametersmeasured
Calculationofthemeansofduplicatemeasures Normalizationsforbatcheffects
Exclusionvaluesfromineligiblesamples(e.g.,frozensamplesinadvertently thawedduringshipmentorstorage)
CalculationofestablishedscoreslikeBMIandHOMAindex CalculationofnewscoresdevelopedduringtheMARK-AGEstudy
3.1.2. Additionofdatacolumns
Afterdata entry,somebioanalytical researchers(‘analyzers’) requestedcalculationslikenormalizationsandcorrectionsontheir specificparameters.Therefore,aKNIMEworkflowwasgenerated performingthestepsrequested,separatelyforeachrequestingana- lyzer (Fig. 3).Ifnormalizations wereperformedonparameters, newcolumnswereappendedtotheextendeddatabaseinorder tomaintaintheoriginaldata.TheWorkflownotonlydocuments theperformed modificationsbut alsoautomaticallyrenews the calculationseachtimenewdatawereuploaded.
Based on existing parameters, compound parameters such as ratios between individual parameters were calculated and appendedtotheExtendedDatabase,likebodymassindex(BMI) (Must and Anderson, 2006) or Homeostasis Model Assessment (HOMA)index(Matthewsetal.,1985).Suchcompoundparame- tershaveeitherbeenpublishedalreadyorwerenewlydesignedby theMARK-AGEConsortiummembers,likethe‘NutritionScore’.
Table1showsalistofgeneralcalculationmethodsperformed.
TheestablishedKNIMEworkflowisusedasdocumentationbase andorganizedinawaythatnewcolumnscouldeasilybeaddedby theuser.
Fig.3.Schematicoverviewofthecalculationworkflow.
Metanodesareusedtoperformthecalculationstepsseparatedforeachprojectpartner.Theclearstructuresimplifiestheusageandvisualizeseachstepperformed.
Fig.4.RepresentativeschematicoverviewoftheKNIMEfileappendflow.
ThreestepswereperformedtochecktheincomingfilesfromtheMARK-AGEpartners.SSCswerecounted(middlebox)andcontroledforvalidity(upperbox).Inaddition thefileformatwaschecked(lowerbox)andconvertedifnecessary.Invalidormulitplecodeswereexcludedanddocumentedinxlsfilesviatheredxlswriternodes.
3.2. Datapre-processing 3.2.1. Re-namingofparameters
ThecolumnnamesoftheSQLdatabasetableshadbeendesigned usingshorttermsandtheunderscoresign.Parameternames,how- ever,mustbeusableforheadingsingraphsetc.There-namingof theparameterswasperformedwithastandardKNIMEnodethat automaticallytranslatesthecolumnnames,accordingtoarefer- encetablecontainingoriginalcolumnnamesaswellasnewnames.
Asacentralplacetostorebothkindsofinformationwasnecessary, theywereimplementedinthemetatableestablishedduringthe project(seeKötterandMoreno-Villanuevaetal.,thisissue).Cor- rectednamescontainthemorereadablefull-lengthnameor,iftoo long,thestandardabbreviationaswellastheunitinwhichthe parameterwasmeasured.
3.2.2. Joiningofanalyticandquestionnairedata
Anecessarysteptoworkwiththeextendeddatabase,wasthe joiningofseparatelystoredbioanalyticalandquestionnairedata.
Asquestionnaireswere stored underthe PSC and bioanalytical measurementsundertheSSC,asimplejoiningprocedurewasper- formedusing the ‘PSC-to-SSC’ translation table. Upon checking ofthedata,however, adiscrepancy in thenumbersofsubjects betweenbioanalyticalandquestionnairedatawasnoted.
Asthequestionnairesweredividedinsixelectronicparts,the PSChadtobeentereduptosixtimesseparately.Therefore,insome casesdifferentPSCcodeswithtyposorreverseddigitsappeared forasinglesubject.Thedatabaserecognizedeachnewlyentered,
Table2
ChartillustratingthePSCenteringproblem.
PSC SSC quest1 quest2 quest3 quest4 quest5 analysis
1 0200123 12345 x x x x x
2 0200128 12346 x x x
3 02rt123 12347 x x x
4 02RT123 12348 x x
Row1and2reflectentriesforthesamesubjectinthefirstrecruitmentround.When questionnairetwowasenteredtheinterviewertransposedan8withthe3atthelast digitbymistake.Becausethesecondcodeisunknowntothesystemitgeneratesa newrowreflectinganewpersonbymistake.Thesameproblemoccursifindicators fromfurtherrecruitmentroundsweretransposed(rt−>tr)orindicateddifferently withlowerorupperlettersforonesubject(row3and4).
differingPSCasanewsubjectandgeneratedanewentity.Asa result,foronesubjectmorethanoneSSCcouldbeavailable,and thecompletequestionnaireinformationfromonesubjectcouldbe dispersedaswell(Table2).Consequencesofthisproblemmaywell haveaffectedanyphaseofrecruitment.
3.2.2.1. Firstrecruitment. Forthemainrecruitmentphase, fixing erroneousmultiplePSCinsertionwasnotdone.Asthereisnovalid possibilitytoassemblethedifferententriesbelongingtoonesub- ject,allcaseswhereonlypartsofquestionnaireshadbeenentered wereexcludedfromfurtheranalysis.
3.2.2.2. Subsequent recruitment phases. Additional recruitment rounds,termed‘re-sampling’or‘re-testing’,wereperformedduring
theproject.Inthesephasesonlyasubsetofsubjectswasexamined anddatawereenteredagain.Foridentifyingpurposes,analtered PSCcode,withaseparateidentifierconsistingoftwoletterswas used.Ifthoselettersweremixedupduringvariousentries,again differentPSCsweregeneratedforonesubject.Fortunatelyenough, thosemis-entriescouldberetraced,accordingtotheremaining partsofthecode.Asaresult,aKNIMEworkflowwasestablished removingtheidentifierandwritingallseparatelystoredinforma- tionintoonevalidPSCwithacorrectindicator.
3.2.3. Filteringdata
AccordingtothedesignofMARK-AGE,criteriafortheenrolment ofsubjectshadbeenestablished.Subjectshadtofallintheage range35–74.9years;andfurthermore,positivityforhepatitisBor Cwasanexclusioncriterion. Forlogisticreasons,somesubjects wereenteredintothedatabasethatdidnotsatisfythesecriteria.
GeneralfilterswereestablishedinaKNIMEnodethatexcludeand documentthosesubjects.
Twoexceptionshadtobeconsideredduringthesetupofthe algorithm. The ‘spouses of GEHA Offspring’ [SGO] group, were allowedtoexceedthedefinedagerangeastheirnumberwasrather low.Furthermoresubjectsrecruitedduringthere-testingphase obviouslyhaddifferentagerequirements:Ifa personwas73at thetimeofthefirstrecruitment,3yearslatertheageexceedthe above-mentionedlimit,which,however,wasacceptedinthiscase.
Afterthedatacleaningstepsmentionedaboveitwaspossible todeterminethenumberofvalidentriesofsubjectsthatpartici- patedinthestudy.Therefore,wedefinedrequirementsnecessary fora subjecttobeconsideredinstandard analyticalanalysis.A
‘validsubject’requiredatleastonebioanalyticalparametermea- suredsuccessfully(besidehepatitisanalysis,wherepositivitywas anexclusioncriterion)andthefullyenteredpartofthequestion- nairescontainingageandgenderinformation whilea recruited subjectrequiresonepartofthequestionnairesorbloodsampling collectionsenteredintothedatabase.Table3showsthenumberof recruitedandvalidsubjectsestablishedwiththeexplainedclean- ingstepsanddefinitions.
3.3. Correctionsofcoupledparameters
Besidecorrectionsonsingleparameters,correctionsofcoupled parameterswerealsosetup.
3.3.1. ATCcodes
Asdrugnamesvaryfromcountrytocountry,thestandardized AnatomicalTherapeuticChemical(ATC)classificationsystemwas usedtoclearlyindicatethedrugintakeofasubject.AnATCcode consistsof5levelsdefinedbyaspecificorderoflettersandnum- bers(WHO,2013).IntheMARK-AGEdatabase,typosinthesecodes occurred:forexample,thenumber0andtheletterOwereused synonymouslyorthecodingsystemwasnotmaintained.Analgo- rithmwassetupcorrectingthetyposandextractingtheinvalid codes,whichshouldsubsequentlybecorrectedbytheresponsible recruitmentcentres.Toextractinformationaboutthedisease,level 3ATCcodesofallMARK-AGEsubjectsweregroupedandadisease translationtablewasgenerated(Table4).
A 06 A Gastrointestinal
A 07 A Infectionsdisease
A 07 E Antiinflammatories/antirheumaticcompounds
A 07 F Gastrointestinal
A 08 A Diabetes
A 09 A Gastrointestinal
A 10 A,B Diabetes
A 11 A,C,D,G,H Vitamins
B 01–02 A Thrombosis/coagulationdisorders
B 03 B Vitamins
C 01 A-E Cardiacdisease
C 02 A,C,D Hypertensioncluster C 03 A,B,C,E Hypertensioncluster
C 04 A Thrombosis/coagulationdisorders C 07 A,B,C Hypertensioncluster
C 08 C Hypertensioncluster
C 08 D Cardiacdisease
C 09 A,B,C,D,X Hypertensioncluster
C 10 A,B Lipidmetabolism
D 05 A Skindiseases
D 07 A Antiinflammatories/antirheumaticcompounds
D 07 C Infectionsdisease
D 11 A Skindiseases
H 02 A Antiinflammatories/antirheumaticcompounds
H 03 A,B,C Thyroiddisorders
J 01 A,C,F,X Infectionsdisease
J 05 A Infectionsdisease
L 01 A,B,X Cancertherapy
M 01 A,C Antiinflammatories/antirheumaticcompounds
M 02 A Pain
M 04 A Gout
M 05 B Bonedisease
M 09 A Antiinflammatories/antirheumaticcompounds
N 02 A,B Pain
N 03 A CNSdisordersotherthandepression N 04 B CNSdisordersotherthandepression N 05 A CNSdisordersotherthandepression
N 05 B Depression/anxiolytics
N 06 A Depression/anxiolytics
N 06 B,D CNSdisordersotherthandepression N 07 X CNSdisordersotherthandepression
P 01 B Infectionsdisease
R 03 A,B,D Lung/bronchialdisorders R 05 C,D Lung/bronchialdisorders
S 01 A Infectionsdisease
S 01 B Antiinflammatories/antirheumaticcompounds
S 01 E,F,X Eyedisease
V 01 A Immunesystemdisorders
3.3.2. Bloodparametersinstandardunits
Foreachsubjectageneralbloodcountwasperformedduring thestudy.Asstandardbloodanalysisdevicesandtheirreporting formatvaryindifferentcountries,theuploadtablesweregener- atedsuchthat thenumber andtheunit for a singleparameter wereenteredintwodifferentcolumns.Inthis process,theunit had to be selectedfrom a drop-downmenu. In order tostan- dardizetheparameterunitsanalgorithm hadtobeestablished re-calculatingallvaluesaccordingtotheinternationalmetricsys- tem(SIunits)(Table5).Largeunstainedcells(LUC)andplatelet distributionwidth(PDW)wereonlymeasuredinonelaboratory andtherefore,excludedfromtheanalysis.
3.3.3. ZUNGscalescoring
Theself-ratingdepressionscale(Zung,1965)wasusedtomon- itorifthesubjectssufferfromdepressive disorders (seeBürkle
Table5
Listofbloodcountparametersanalyzedandunitsused.
Parameter Longname Unit
MCH meancorpuscularhaemoglobin picogram
MCHC meancorpuscularhaemoglobinconcentration g/dl
MCV meancellvolume femtoliters
HCT haematocrit %
RDW redcelldistributionwidth %
HGB haemoglobin g/dl
HDW haemoglobindistributionwidth g/dl
RBC redbloodcells million/l
WBC whitebloodcells thousand/l
Neutrophils neutrophils thousand/l
Eosinophils eosinophils number/l
Basophils basophils number/l
Monocytes monocytes number/l
Lymphocytes lymphocytes number/l
Platelets platelets thousand/l
MPV meanplateletvolume femtoliters
etal.,thisissue).Twentyquestionshadtobeansweredbyusing thefollowingoptions:alittleofthetime,someofthetime,good partofthetime,mostofthetime.Eachquestionwasscoredafter- wards to calculatethe depression status value. Gaps were not allowedinthissystembutthesubjectshadthechoicetoselect‘na’
(notapplicable)ifheorshedidnotwishtoanswerthequestion.
Therefore,itoccurredthatseveralquestionswerenotusablefor theratingsystem.Asalreadypublished(Shriveetal.,2006)these gapswerefilledwiththemeanofthepointsfromtheindividual subject.
3.3.4. Calculatinguniformtimeintervals
Informationrequestedonfoodandbeverageintakewasentered witha tabular system,in which the subject hastospecify the amountconsumedeitherperday,weekormonth.Toworkwith this data in a standardized fashion all intakes were recalcu- lated toa weekly and monthly indication with an established algorithm.
4. Conclusion
Inthispaper,wedescribethenecessarystepsthatwereper- formedontheMARK-AGEraw data togeneratea database for thefinaluser.Thetoolsusedpresentmethodstodetectandhan- dleproblemshiddeninthedatastructureofcollectedrawdata.
Problemswe reportedshouldbeusedtoimplementpreventive strategiesinnewagingresearchprojects.AdditionallyKNIMEis introducedaswebbasedtoolforthedevelopmentofaneasy-to- handledatacommunicationplatform.
Evenourbesteffortsinvestedinthedesignoftheprojectcould notguaranteecompletepreventionoferrorsorproblemsrelated withdataentryintothedatabase.Duringtheproject,errorsources inthegrowingdatabasewereidentified, andthus, dataquality improvedcontinuously.Someoftheproblematiceffectswereiden- tifiedshortlyafterthelaunchoftheprojectwhereasotherswere hiddenand onlydetectedafteranalysis. Thefactthat aninter- nationalprojectlikeMARK-AGEinvolvesseveralcountrieswith divergentstandardsand guidelines madeit even moredifficult toestablishastandardizedworkingsystem.Sincethecreationof largeEuropeandatabasesfortheanalysisofbiological systemic effectswillcontinuetobearelevanttask,strategiestogeneratereli- abledatabasesofhighestqualityarenecessary.Sofarpublications coveringthisaspecthavebeenrare.Withourabovedescription ofessentialstepsontheMARK-AGEdatabase,we providerele- vantinformationfromourhands-onexperienceonfrequenterror sourcesandideasforpreventivesolutionstrategies.
Investing sufficient time and manpower in the construction phaseof aprojectand itsdatabaseis essentialand avoidshigh costs,delaysandqualityloss.Inparticular,arealisticestimateofthe requiredfundsforaccomplishingthecrucialprogrammingworkin theearlyphaseoftheprojectisofutmostimportance.Thesepro- grammingtasksincludetheimplementationofadatabasestructure andtheestablishmentofanappropriatebackupsystem,thedesign ofweb-interfacesfordataentrywithsuitableconsistencychecks likevaluerestrictions(avoidingfreeentryframeswheneverpossi- ble),theselectionandimplementationoferror-correctingsubject codes,andtheimplementationofaframeworkforsubsequentanal- ysis.KNIMEwaschosenfordataintegrationandretrievalbecause itiseasiertohandlefornon-IT-expertsanddirectlyprovidesanal- ysistoolsandelaboratereportingfeatures.Evennon-expertscan efficientlyworkwiththissystemafteratrainingperiodofapproxi- matelyoneweek,basedontheuser-friendlyinterfaceandintuitive nodestructure.
Withoutsufficientknowledgeaboutthedatabackgroundand thewaysdatawereprepared,analysiscancausemisleadingresults.
The documentation of the extended MARK-AGE database con- structioniscompletedandadetaileddescriptionisavailablefor users.Thereforetheworkpresentedisalsonecessaryforupcom- ingpublicationspresentingresultsontheMARK-AGEdata.Ifparts ofpublisheddataweretobeintegratedin otheragingresearch databasesliketheDigitalAgeingAtlas(Craigetal.,2015)itwouldbe necessarytoknowaboutdatasourceanddatapreparationstrate- gies.Lastly,atthetimetheMARK-AGEdatabasecouldbemade publiclyavailable,itwouldbenecessarythateachuserisfamiliar withthedatabackground.
Acknowledgements
WewishtothanktheEuropeanCommissionforfinancialsup- portthroughtheFP7largescaleintegratingprojectEuropeanStudy toEstablishBiomarkersofHumanAgeing(MARK-AGE;grantagree- mentno.:200880).FurthermorewewishtothankallMARK-AGE Consortiummembersfortheireffortstomakethisworkpossible.
OurspecialthanksgotoLotharGasteiger,ThorstenMeinlandPeter Burgerforthesupportregardinghardwareandsoftwaretools.
References
Berthold,M.R.,Cebron,N.,Dill,F.,Gabriel,T.R.,K ¨otter,T.,Meinl,T.,Ohl,P.,Sieb,C., Thiel,K.,Wiswedel,B.,2007.KNIME:thekonstanzinformationminer.In:
StudiesinClassificationDataAnalysisandknowledgeorganization.
Heidelberg-Berlin,Springer-Verlag.
Chamberlin,D.D.,Boyce,R.F.,1974.SEQUEL:astructuredenglishquerylanguage, SIGFIDET,47Proceedingsofthe1974ACMSIGFIDETworkshoponData description,accessandcontrol,249-264.
Chapman,A.D.,2005.PrinciplesandMethodsofDataCleaningOccurrenceData.
Version1.0.ReportfortheGlobalBiodiversityInformationFacility.
Copenhagen,1–72.
Codd,E.F.(1970).RelationalModelofDataforLargeShareDataBanks, CommunicationsoftheACM,13:6.
Craig,T.,Smelick,C.,Tacutu,R.,Wuttke,D.,Wood,S.H.,Stanley,H.,Janssens,G., Savitskaya,E.,Moskalev,A.,Arking,R.,DeMagalhaes,J.P.,2015.Thedigital ageingatlas:integratingthediversityofage-relatedchangesintoaunified resource.NucleicAcidsRes.43,D873–D878.
DeMagalhaes,J.P.,Costa,J.,Toussaint,O.,2005.HAGR:thehumanageinggenomic resources.NucleicAcidsRes.33,D537–D543.
Hellerstein,J.M.,2008,Quantitativedatacleaningforlargedatabases,Surveyfor theUnitedNationsEconomicCommissionforEurope(UNECE),http://db.cs.
berkeley.edu/jmh
Mackey,J.,Pearson,W.R.,2004.Usingrelationaldatabasesforimprovedsequence similaritysearchingandlarge-scaleGenomicAnalyses.In:CurrentProtocolsin Bioinformatics.JohnWiley&Sons,Inc,Chapter9,Unit9.4.
Matthews,D.R.,Hosker,J.P.,Rudenski,A.S.,Naylor,B.A.,Treacher,D.F.,Turner,R.C., 1985.Homeostasismodelassessment:insulinresistanceandbeta-cell functionfromfastingplasmaglucoseandinsulinconcentrationsinman.
Diabetologia28(7),412–419.
Must,A.,Anderson,S.E.,2006.Bodymassindexinchildrenandadolescents:
considerationsforpopulation-basedapplications.Int.J.Obesity30(4), 590–594.