The MARK-AGE extended database : data integration and pre-processing

(1)

j o ur na l h o me pa g e :w w w . e l s e v i e r . c o m / l o c a t e / m e c h a g e d e v

The MARK-AGE extended database: data integration and pre-processing

J. Baur

^a

, T. Kötter

^b

, M. Moreno-Villanueva

^a

, T. Sindlinger

^a

, M.R. Berthold

^b

, A. Bürkle

^a,∗

, M. Junk

^c

aChairofMolecularToxicology,UniversityofKonstanz,78457Konstanz,Germany

bChairofBioinformaticsandInformationMining,UniversityofKonstanz,78457Konstanz,Germany

cDepartmentofMathematicsandStatistics,UniversityofKonstanz,78457Konstanz,Germany

a r t i c l e i n f o

Articlehistory:

Received31January2015

Receivedinrevisedform13May2015 Accepted18May2015

Availableonline21May2015

Keywords:

Database Dataentry Dataintegration Dataprocessing Dataextraction KNIME

a b s t r a c t

MARK-AGEisarecentlycompletedEuropeanpopulationstudy,wherebioanalyticalandanthropometric datawerecollectedfromhumansubjectsatalargescale.Tofacilitatedataanalysisandmathematical modelling,anextendeddatabasehadtobeconstructed,integratingthedatasourcesthatwerepartofthe project.Thisstepinvolvedchecking,transformationanddocumentationofdata.Thesuccessofdown- streamanalysismainlydependsonthepreparationandqualityoftheintegrateddata.Here,wepresent thepre-processingstepsappliedtotheMARK-AGEdatatoensurehighqualityandreliabilityinthe MARK-AGEExtendedDatabase.Variouskindsofobstaclesthataroseduringtheprojectarehighlighted andsolutionsarepresented.

1. Introduction

Adatabaseisa structuredcollectionofinformation thatcan beaccessedbyusingspeciﬁcsoftwaretools.Toextracttheinfor- mation and understand the fundamental structure of collected data,thesehavetobepresentedinsuchawayastoenableefﬁ- cientknowledgeextraction.Variousformsofdatabaseshavebeen developedandtheyaretypicallycategorizedonthebasisoftheir function.Themostcommontypeistherelationaldatabasewhere theinformationisstoredinvariousdatatables(Codd,1970),which iscommonlyusedingenomics,proteomicsandclinicalresearch wherelargeamountofdatahavetobestoredfromeachsubjector patient(MackeyandPearson,2004;YuandSalomon,2009).

Toenter,organizeandselectdatafromadatabaseadatabase management system (DBMS) is necessary. These programs are speciﬁcallydesigned toenable interaction between user,other applications,andthedatabaseitselfandcoverthefollowingissues:

1.Deﬁne,removeandmodifythedatastructureofneworexisting databasetables.

∗Correspondingauthor.Tel.:+497531884035;fax:+4907531884033.

E-mailaddress:alexander.buerkle@uni-konstanz.de(A.Bürkle).

2.Inserting,modifying,anddeletingdata.

3.Querydataforreportsandmakethemaccessibletoend-users.

4.Datasecurityandrecovery,registeringandmonitoringofusers.

TherearemanydifferenttypesofDBMSs,rangingfromsmall systems that run on personal computers tovery complex sys- temsthatrunonmainframes.Structuredquerylanguage(SQL)was developedintheearly1970satInternationalBusinessMachines (IBM)(BoyceandChamberlin,1974).Itisastandardlanguageto interactwithrelationaldatabasesandiscurrentlythemostwidely used databaselanguage.Theusage ofSQL includesdatainsert, query,updateanddelete,modiﬁcationanddataaccesscontrol.

Inlargeresearchprojectsthesharingofdatawithinaconsor- tium,betweendifferentconsortia,orwiththescientiﬁccommunity at largeis essential in order toboost progress duringcomplex dataanalysisandmodelling.Anabsoluterequirementforsharing ofadatabaseisthatthecommunicateddataarewellorganized, have beenenteredin thecorrect format,carefullycheckedand validated.Differentorganizationalapproachesaccordingsuchpro- cessesarealreadydescribed forseveraldatabasesintheageing research(Craigetal.,2015;DeMagalhaesetal.,2005).

Duringdesigningaproject,strategiesneedtobeimplemented in order to preventerrors during data entry. Error prevention strategies,however,cannotguaranteetheabsenceofincorrector

http://dx.doi.org/10.1016/j.mad.2015.05.006

Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-0-295749

(2)

Fig.1.SchemeoftheSQLandKNIMEdatasourcefusion.

DatawereeitheruploadedtotheSQLdatabaseviatheinternet,ordirectlyimplementedinKNIME.Togenerateacompletesetofdatabothsourceswereintegratedwithin KNIMEonacentralplace.

incompletedataentryduringtheproject(VandenBroecketal., 2005).Datathathavenotbeenscreenedandcheckedformislead- inginformationmayproducefalseresultsandconclusions.‘Data cleaning’,i.e.theidentiﬁcationandcorrectionoferrorsinorderto improvedataqualitybeforestoringandanalysingdataistherefore anindispensablepartofthedatamanagementprocess(Chapman, 2005;Rahm andDo, 2000;Van denBroeck etal., 2005).There are,however,manydifferenterrortypes,andfurthermoreerror sourcesarenotalwayseasytoidentify.Typicalexamplesareerrors duringmeasurement,dataentryordataintegration(Hellerstein, 2008).Fordataevaluation,analyzersmustknowaboutpossible dataqualityproblemsthatcancompromisethevalidityofresults.

ThistopicisalsoaddressedintheDigitalAgeingAtlas,whereeach entryisconnectedtothebelongingsourceofrawdata,offeringthe possibilitytocheckoriginalcontents(Craigetal.,2015).

Below,itisexplainedhowtheMARK-AGEdatabasewaspre- paredandextendedwithintheKNIMEdataintegrationplatform.

Hiddenproblemsinvariousdatasets,aswellashandlingstrategies formisleadingdataarepresented.Asthecasesmentioneddidoccur despitecarefulstudydesign,theymayprovidesomeguidanceto avoidsimilarproblemsinfuturestudies.

2. Materialsandmethods

2.1. SQLMARK-AGEdatabase

The MARK-AGE database was established using Structured QueryLanguage(SQL)(KötterandMoreno-Villanuevaetal.,this issue).SQLisacommonlyuseddatabasemanagementsystemfor relationaldatabases.Thedatatablescontaintherawdatauploaded byMARK-AGEConsortiummembers.SQLisarathercomplextool fornon-expertsininformatics.Therefore,KNIMEwaschosenfor queryingtheSQLdatabaseandforfurtherprocessingtheMARK- AGEdataadditionally(Fig.1).

2.2. Konstanzinformationminer(KNIME)

TheKonstanzInformationMiner(KNIME)isauser-friendlydata integrationplatform,whichenablesvisualassemblyandinterac- tiveexecutionofdatapipelines.KNIMEenableseasyintegrationof newalgorithmsorvisualizationmethodsasmodulesornodes.For

aclearstructuringofbigworkﬂowsKNIMEofferstheopportunity tocollapseagroupofselectednodesintoaso-calledmetanode.

Inaddition,KNIMEprovidespluginsforcommonprograminglan- guagesandcanbeusedasdatabasemanagementsystem(Berthold etal.,2007).Toavoidtheriskoflosinginformation,thealready generatedpartsoftheSQLdatabasetableswereopenedinKNIME afterconnectingtotheMARK-AGEdatabaseserver(Fig.1).KNIME didnotprovidedirectreadingaccesstotheoriginaldatatables,but tomirroredorjoinedso-called‘views’thatweregeneratedbythe SQLDBMS.Thus,thissystemguaranteesamaximallevelofsafety forthealreadyestablishedSQLbaseddata.Eachstepperformed withKNIMEis executedand saved ina dedicated node, which therebyworksasdocumentationplatform.Asaresult,aworkﬂow ofnodeswasgeneratedperformingthecompleteintegrationofthe MARK-AGEdatabase(Fig.2).

2.3. KNIMEserver

TheKNIMEserverallowsstorageandaccessingofworkﬂows viatheinternet.Useraccessrightscontrolhowdataaregrouped forprojects,workgroupsordepartments.Theserverwasusedto storeanddocumentKNIMEworkﬂowsoftheMARK-AGEdatabase extensionprocess(https://www.knime.org/knime-server).

2.4. Ethicalclearance

TheMARK-AGEstudyhasbeenapprovedbytheappropriate ethicscommitteesandhasbeenperformedinaccordancewiththe ethicalstandardslaiddownintheDeclarationofHelsinki.Allstudy subjectsgavetheirinformedconsentpriortotheirinclusioninthe study.ThisisdescribedingreatdetailinCaprietal.,thisissue.

3. Results

3.1. Dataentry

To guarantee the blinding of the study, questionnaire data (comprisingdescriptorsoftherespectivesubjectsuchasageand gender)andbioanalyticaldatawerestoredseparately.Question- naireswereuploadedtothedatabaseunderaprimarysubjectcode (PSC)anddatafrombiochemicalanalysesofsampleswererecoded

(3)

Fig.2. OverviewoftheKNIMEpreparationworkﬂowfortheextendeddatabase.

Informationwasseparatelyimplementedforbioanalyticalandquestionnairedata(A)Subsequently,speciﬁccalculations(B)andtherenaming(C)oftheparameterswere performed.Inthelaststep,subgroupsfromseveralrecruitmentphaseswereseparated(D).

andenteredintothedatabaseunderthesecondarysubjectcode (SSC)(seeBürkleetal.,thisissue).Tojoinbothtypesofinforma- tion,atranslationtable(termed‘PSC-to-SSC’table)wasestablished duringtheproject(seeMoreno-VillanuevaandKötteretal.,this issue).

3.1.1. Entryofxlsandcsvdatatables

Forlogisticreasonsnotallofthecollecteddatafrombiochem- icalanalysiswereuploadedtotheSQLdatabaseviathewebsite interfaces.Thereforexlsorcsvtablescontainingthesecondarysub- jectcodewiththerespectivedatawereintegratedwiththealready establisheddatabaseusingKNIME,thus,leadingtotheformationof an‘ExtendedDatabase’(Fig.1).Thetables,manuallygeneratedby theanalyzersofferedproblemswiththesubjectcodesassomeof thosewereeitherinvalidduetotyposormultipleusage.Tosolve thisproblem,aKNIMEworkflowwasestablishedthatautomati- callyidentifies andrejectsallrowsofa tablecontaininginvalid ormultiplesubjectcodes. Therebyeachincomingxlsorcsvfile wascheckedseparately,firstfortheuniquenessofSSCsbycount- ingandsecondforvaliditybycomparingallavailablecodeswith thePSC-to-SSCtablefromtheSQLdatabase,containingallvalid subjectcodeinformation (Fig.4).Invalidentrieswereseparated anddocumentedforcommunicationtothepartners.Applyingthis procedure,atotalof19datatableswereaddedtotheExtended Databasecontainingonaverage1.3%±1.2invalidand 0.4%±0.8 multiplyenteredsubjectcodes.Thehighvaluesofstandarddevi- ationsindicatethatsometablesdisplayedmoreerrorsincoding thanothers.Additionally,theamountofmiss-enteredsubjectcodes indirectlyreflectsthedataquality.Anotherscenariooccurringin thedatatableswasthemix-upofcodesandrespectivebioanalyt- icalvaluesinthelaboratories.Thosecasescouldonlybeidentified iftheyledto outliersin somedownstream analysis,for exam- pleifafemalesubjectisattributedvaluesthatappliedtomales only.

Table1

List of generalmodiﬁcations performedduring construction of theExtended Database.

Recalculationofnon-SIunitstoSIunits Normalizationsonparameters

Normalizationsonparameterswithvaluesmeasuredbyanotherpartner Removalofsubjectswithvaluesbeyondtheboundariesofmeasuring Calculationofratiosonparametersmeasured

Calculationofthemeansofduplicatemeasures Normalizationsforbatcheffects

Exclusionvaluesfromineligiblesamples(e.g.,frozensamplesinadvertently thawedduringshipmentorstorage)

CalculationofestablishedscoreslikeBMIandHOMAindex CalculationofnewscoresdevelopedduringtheMARK-AGEstudy

3.1.2. Additionofdatacolumns

Afterdata entry,somebioanalytical researchers(‘analyzers’) requestedcalculationslikenormalizationsandcorrectionsontheir specificparameters.Therefore,aKNIMEworkflowwasgenerated performingthestepsrequested,separatelyforeachrequestingana- lyzer (Fig. 3).Ifnormalizations wereperformedonparameters, newcolumnswereappendedtotheextendeddatabaseinorder tomaintaintheoriginaldata.TheWorkflownotonlydocuments theperformed modificationsbut alsoautomaticallyrenews the calculationseachtimenewdatawereuploaded.

Based on existing parameters, compound parameters such as ratios between individual parameters were calculated and appendedtotheExtendedDatabase,likebodymassindex(BMI) (Must and Anderson, 2006) or Homeostasis Model Assessment (HOMA)index(Matthewsetal.,1985).Suchcompoundparame- tershaveeitherbeenpublishedalreadyorwerenewlydesignedby theMARK-AGEConsortiummembers,likethe‘NutritionScore’.

Table1showsalistofgeneralcalculationmethodsperformed.

TheestablishedKNIMEworkﬂowisusedasdocumentationbase andorganizedinawaythatnewcolumnscouldeasilybeaddedby theuser.

(4)

Fig.3.Schematicoverviewofthecalculationworkﬂow.

Metanodesareusedtoperformthecalculationstepsseparatedforeachprojectpartner.Theclearstructuresimpliﬁestheusageandvisualizeseachstepperformed.

Fig.4.RepresentativeschematicoverviewoftheKNIMEﬁleappendﬂow.

ThreestepswereperformedtochecktheincomingfilesfromtheMARK-AGEpartners.SSCswerecounted(middlebox)andcontroledforvalidity(upperbox).Inaddition thefileformatwaschecked(lowerbox)andconvertedifnecessary.Invalidormulitplecodeswereexcludedanddocumentedinxlsfilesviatheredxlswriternodes.

3.2. Datapre-processing 3.2.1. Re-namingofparameters

ThecolumnnamesoftheSQLdatabasetableshadbeendesigned usingshorttermsandtheunderscoresign.Parameternames,however,mustbeusableforheadingsingraphsetc.There-namingof theparameterswasperformedwithastandardKNIMEnodethat automaticallytranslatesthecolumnnames,accordingtoarefer- encetablecontainingoriginalcolumnnamesaswellasnewnames.

Asacentralplacetostorebothkindsofinformationwasnecessary, theywereimplementedinthemetatableestablishedduringthe project(seeKötterandMoreno-Villanuevaetal.,thisissue).Cor- rectednamescontainthemorereadablefull-lengthnameor,iftoo long,thestandardabbreviationaswellastheunitinwhichthe parameterwasmeasured.

3.2.2. Joiningofanalyticandquestionnairedata

Anecessarysteptoworkwiththeextendeddatabase,wasthe joiningofseparatelystoredbioanalyticalandquestionnairedata.

Asquestionnaireswere stored underthe PSC and bioanalytical measurementsundertheSSC,asimplejoiningprocedurewasper- formedusing the ‘PSC-to-SSC’ translation table. Upon checking ofthedata,however, adiscrepancy in thenumbersofsubjects betweenbioanalyticalandquestionnairedatawasnoted.

Asthequestionnairesweredividedinsixelectronicparts,the PSChadtobeentereduptosixtimesseparately.Therefore,insome casesdifferentPSCcodeswithtyposorreverseddigitsappeared forasinglesubject.Thedatabaserecognizedeachnewlyentered,

Table2

ChartillustratingthePSCenteringproblem.

PSC SSC quest1 quest2 quest3 quest4 quest5 analysis

1 0200123 12345 x x x x x

2 0200128 12346 x x x

3 02rt123 12347 x x x

4 02RT123 12348 x x

Row1and2reflectentriesforthesamesubjectinthefirstrecruitmentround.When questionnairetwowasenteredtheinterviewertransposedan8withthe3atthelast digitbymistake.Becausethesecondcodeisunknowntothesystemitgeneratesa newrowreflectinganewpersonbymistake.Thesameproblemoccursifindicators fromfurtherrecruitmentroundsweretransposed(rt−>tr)orindicateddifferently withlowerorupperlettersforonesubject(row3and4).

differingPSCasanewsubjectandgeneratedanewentity.Asa result,foronesubjectmorethanoneSSCcouldbeavailable,and thecompletequestionnaireinformationfromonesubjectcouldbe dispersedaswell(Table2).Consequencesofthisproblemmaywell haveaffectedanyphaseofrecruitment.

3.2.2.1. Firstrecruitment. Forthemainrecruitmentphase, ﬁxing erroneousmultiplePSCinsertionwasnotdone.Asthereisnovalid possibilitytoassemblethedifferententriesbelongingtoonesub- ject,allcaseswhereonlypartsofquestionnaireshadbeenentered wereexcludedfromfurtheranalysis.

3.2.2.2. Subsequent recruitment phases. Additional recruitment rounds,termed‘re-sampling’or‘re-testing’,wereperformedduring

(5)

theproject.Inthesephasesonlyasubsetofsubjectswasexamined anddatawereenteredagain.Foridentifyingpurposes,analtered PSCcode,withaseparateidentifierconsistingoftwoletterswas used.Ifthoselettersweremixedupduringvariousentries,again differentPSCsweregeneratedforonesubject.Fortunatelyenough, thosemis-entriescouldberetraced,accordingtotheremaining partsofthecode.Asaresult,aKNIMEworkflowwasestablished removingtheidentifierandwritingallseparatelystoredinforma- tionintoonevalidPSCwithacorrectindicator.

3.2.3. Filteringdata

AccordingtothedesignofMARK-AGE,criteriafortheenrolment ofsubjectshadbeenestablished.Subjectshadtofallintheage range35–74.9years;andfurthermore,positivityforhepatitisBor Cwasanexclusioncriterion. Forlogisticreasons,somesubjects wereenteredintothedatabasethatdidnotsatisfythesecriteria.

GeneralﬁlterswereestablishedinaKNIMEnodethatexcludeand documentthosesubjects.

Twoexceptionshadtobeconsideredduringthesetupofthe algorithm. The ‘spouses of GEHA Offspring’ [SGO] group, were allowedtoexceedthedeﬁnedagerangeastheirnumberwasrather low.Furthermoresubjectsrecruitedduringthere-testingphase obviouslyhaddifferentagerequirements:Ifa personwas73at thetimeoftheﬁrstrecruitment,3yearslatertheageexceedthe above-mentionedlimit,which,however,wasacceptedinthiscase.

Afterthedatacleaningstepsmentionedaboveitwaspossible todeterminethenumberofvalidentriesofsubjectsthatpartici- patedinthestudy.Therefore,wedeﬁnedrequirementsnecessary fora subjecttobeconsideredinstandard analyticalanalysis.A

‘validsubject’requiredatleastonebioanalyticalparametermea- suredsuccessfully(besidehepatitisanalysis,wherepositivitywas anexclusioncriterion)andthefullyenteredpartofthequestion- nairescontainingageandgenderinformation whilea recruited subjectrequiresonepartofthequestionnairesorbloodsampling collectionsenteredintothedatabase.Table3showsthenumberof recruitedandvalidsubjectsestablishedwiththeexplainedclean- ingstepsanddeﬁnitions.

3.3. Correctionsofcoupledparameters

Besidecorrectionsonsingleparameters,correctionsofcoupled parameterswerealsosetup.

3.3.1. ATCcodes

Asdrugnamesvaryfromcountrytocountry,thestandardized AnatomicalTherapeuticChemical(ATC)classificationsystemwas usedtoclearlyindicatethedrugintakeofasubject.AnATCcode consistsof5levelsdefinedbyaspecificorderoflettersandnum- bers(WHO,2013).IntheMARK-AGEdatabase,typosinthesecodes occurred:forexample,thenumber0andtheletterOwereused synonymouslyorthecodingsystemwasnotmaintained.Analgo- rithmwassetupcorrectingthetyposandextractingtheinvalid codes,whichshouldsubsequentlybecorrectedbytheresponsible recruitmentcentres.Toextractinformationaboutthedisease,level 3ATCcodesofallMARK-AGEsubjectsweregroupedandadisease translationtablewasgenerated(Table4).

A 06 A Gastrointestinal

A 07 A Infectionsdisease

A 07 E Antiinﬂammatories/antirheumaticcompounds

A 07 F Gastrointestinal

A 08 A Diabetes

A 09 A Gastrointestinal

A 10 A,B Diabetes

A 11 A,C,D,G,H Vitamins

B 01–02 A Thrombosis/coagulationdisorders

B 03 B Vitamins

C 01 A-E Cardiacdisease

C 02 A,C,D Hypertensioncluster C 03 A,B,C,E Hypertensioncluster

C 04 A Thrombosis/coagulationdisorders C 07 A,B,C Hypertensioncluster

C 08 C Hypertensioncluster

C 08 D Cardiacdisease

C 09 A,B,C,D,X Hypertensioncluster

C 10 A,B Lipidmetabolism

D 05 A Skindiseases

D 07 A Antiinﬂammatories/antirheumaticcompounds

D 07 C Infectionsdisease

D 11 A Skindiseases

H 02 A Antiinﬂammatories/antirheumaticcompounds

H 03 A,B,C Thyroiddisorders

J 01 A,C,F,X Infectionsdisease

J 05 A Infectionsdisease

L 01 A,B,X Cancertherapy

M 01 A,C Antiinﬂammatories/antirheumaticcompounds

M 02 A Pain

M 04 A Gout

M 05 B Bonedisease

M 09 A Antiinﬂammatories/antirheumaticcompounds

N 02 A,B Pain

N 03 A CNSdisordersotherthandepression N 04 B CNSdisordersotherthandepression N 05 A CNSdisordersotherthandepression

N 05 B Depression/anxiolytics

N 06 A Depression/anxiolytics

N 06 B,D CNSdisordersotherthandepression N 07 X CNSdisordersotherthandepression

P 01 B Infectionsdisease

R 03 A,B,D Lung/bronchialdisorders R 05 C,D Lung/bronchialdisorders

S 01 A Infectionsdisease

S 01 B Antiinﬂammatories/antirheumaticcompounds

S 01 E,F,X Eyedisease

V 01 A Immunesystemdisorders

3.3.2. Bloodparametersinstandardunits

Foreachsubjectageneralbloodcountwasperformedduring thestudy.Asstandardbloodanalysisdevicesandtheirreporting formatvaryindifferentcountries,theuploadtablesweregener- atedsuchthat thenumber andtheunit for a singleparameter wereenteredintwodifferentcolumns.Inthis process,theunit had to be selectedfrom a drop-downmenu. In order tostan- dardizetheparameterunitsanalgorithm hadtobeestablished re-calculatingallvaluesaccordingtotheinternationalmetricsys- tem(SIunits)(Table5).Largeunstainedcells(LUC)andplatelet distributionwidth(PDW)wereonlymeasuredinonelaboratory andtherefore,excludedfromtheanalysis.

3.3.3. ZUNGscalescoring

Theself-ratingdepressionscale(Zung,1965)wasusedtomon- itorifthesubjectssufferfromdepressive disorders (seeBürkle

(6)

Table5

Listofbloodcountparametersanalyzedandunitsused.

Parameter Longname Unit

MCH meancorpuscularhaemoglobin picogram

MCHC meancorpuscularhaemoglobinconcentration g/dl

MCV meancellvolume femtoliters

HCT haematocrit %

RDW redcelldistributionwidth %

HGB haemoglobin g/dl

HDW haemoglobindistributionwidth g/dl

RBC redbloodcells million/␮l

WBC whitebloodcells thousand/␮l

Neutrophils neutrophils thousand/␮l

Eosinophils eosinophils number/␮l

Basophils basophils number/␮l

Monocytes monocytes number/␮l

Lymphocytes lymphocytes number/␮l

Platelets platelets thousand/␮l

MPV meanplateletvolume femtoliters

etal.,thisissue).Twentyquestionshadtobeansweredbyusing thefollowingoptions:alittleofthetime,someofthetime,good partofthetime,mostofthetime.Eachquestionwasscoredafter- wards to calculatethe depression status value. Gaps were not allowedinthissystembutthesubjectshadthechoicetoselect‘na’

(notapplicable)ifheorshedidnotwishtoanswerthequestion.

Therefore,itoccurredthatseveralquestionswerenotusablefor theratingsystem.Asalreadypublished(Shriveetal.,2006)these gapswereﬁlledwiththemeanofthepointsfromtheindividual subject.

3.3.4. Calculatinguniformtimeintervals

Informationrequestedonfoodandbeverageintakewasentered witha tabular system,in which the subject hastospecify the amountconsumedeitherperday,weekormonth.Toworkwith this data in a standardized fashion all intakes were recalcu- lated toa weekly and monthly indication with an established algorithm.

4. Conclusion

Inthispaper,wedescribethenecessarystepsthatwereper- formedontheMARK-AGEraw data togeneratea database for theﬁnaluser.Thetoolsusedpresentmethodstodetectandhan- dleproblemshiddeninthedatastructureofcollectedrawdata.

Problemswe reportedshouldbeusedtoimplementpreventive strategiesinnewagingresearchprojects.AdditionallyKNIMEis introducedaswebbasedtoolforthedevelopmentofaneasy-to- handledatacommunicationplatform.

Evenourbesteffortsinvestedinthedesignoftheprojectcould notguaranteecompletepreventionoferrorsorproblemsrelated withdataentryintothedatabase.Duringtheproject,errorsources inthegrowingdatabasewereidentified, andthus, dataquality improvedcontinuously.Someoftheproblematiceffectswereiden- tifiedshortlyafterthelaunchoftheprojectwhereasotherswere hiddenand onlydetectedafteranalysis. Thefactthat aninter- nationalprojectlikeMARK-AGEinvolvesseveralcountrieswith divergentstandardsand guidelines madeit even moredifficult toestablishastandardizedworkingsystem.Sincethecreationof largeEuropeandatabasesfortheanalysisofbiological systemic effectswillcontinuetobearelevanttask,strategiestogeneratereli- abledatabasesofhighestqualityarenecessary.Sofarpublications coveringthisaspecthavebeenrare.Withourabovedescription ofessentialstepsontheMARK-AGEdatabase,we providerele- vantinformationfromourhands-onexperienceonfrequenterror sourcesandideasforpreventivesolutionstrategies.

Investing sufﬁcient time and manpower in the construction phaseof aprojectand itsdatabaseis essentialand avoidshigh costs,delaysandqualityloss.Inparticular,arealisticestimateofthe requiredfundsforaccomplishingthecrucialprogrammingworkin theearlyphaseoftheprojectisofutmostimportance.Thesepro- grammingtasksincludetheimplementationofadatabasestructure andtheestablishmentofanappropriatebackupsystem,thedesign ofweb-interfacesfordataentrywithsuitableconsistencychecks likevaluerestrictions(avoidingfreeentryframeswheneverpossi- ble),theselectionandimplementationoferror-correctingsubject codes,andtheimplementationofaframeworkforsubsequentanal- ysis.KNIMEwaschosenfordataintegrationandretrievalbecause itiseasiertohandlefornon-IT-expertsanddirectlyprovidesanal- ysistoolsandelaboratereportingfeatures.Evennon-expertscan efﬁcientlyworkwiththissystemafteratrainingperiodofapproxi- matelyoneweek,basedontheuser-friendlyinterfaceandintuitive nodestructure.

Withoutsufﬁcientknowledgeaboutthedatabackgroundand thewaysdatawereprepared,analysiscancausemisleadingresults.

The documentation of the extended MARK-AGE database con- structioniscompletedandadetaileddescriptionisavailablefor users.Thereforetheworkpresentedisalsonecessaryforupcom- ingpublicationspresentingresultsontheMARK-AGEdata.Ifparts ofpublisheddataweretobeintegratedin otheragingresearch databasesliketheDigitalAgeingAtlas(Craigetal.,2015)itwouldbe necessarytoknowaboutdatasourceanddatapreparationstrate- gies.Lastly,atthetimetheMARK-AGEdatabasecouldbemade publiclyavailable,itwouldbenecessarythateachuserisfamiliar withthedatabackground.

Acknowledgements

WewishtothanktheEuropeanCommissionforﬁnancialsup- portthroughtheFP7largescaleintegratingprojectEuropeanStudy toEstablishBiomarkersofHumanAgeing(MARK-AGE;grantagree- mentno.:200880).FurthermorewewishtothankallMARK-AGE Consortiummembersfortheireffortstomakethisworkpossible.

OurspecialthanksgotoLotharGasteiger,ThorstenMeinlandPeter Burgerforthesupportregardinghardwareandsoftwaretools.

References

Berthold,M.R.,Cebron,N.,Dill,F.,Gabriel,T.R.,K ¨otter,T.,Meinl,T.,Ohl,P.,Sieb,C., Thiel,K.,Wiswedel,B.,2007.KNIME:thekonstanzinformationminer.In:

StudiesinClassiﬁcationDataAnalysisandknowledgeorganization.

Heidelberg-Berlin,Springer-Verlag.

Chamberlin,D.D.,Boyce,R.F.,1974.SEQUEL:astructuredenglishquerylanguage, SIGFIDET,47Proceedingsofthe1974ACMSIGFIDETworkshoponData description,accessandcontrol,249-264.

Chapman,A.D.,2005.PrinciplesandMethodsofDataCleaningOccurrenceData.

Version1.0.ReportfortheGlobalBiodiversityInformationFacility.

Copenhagen,1–72.

Codd,E.F.(1970).RelationalModelofDataforLargeShareDataBanks, CommunicationsoftheACM,13:6.

Craig,T.,Smelick,C.,Tacutu,R.,Wuttke,D.,Wood,S.H.,Stanley,H.,Janssens,G., Savitskaya,E.,Moskalev,A.,Arking,R.,DeMagalhaes,J.P.,2015.Thedigital ageingatlas:integratingthediversityofage-relatedchangesintoauniﬁed resource.NucleicAcidsRes.43,D873–D878.

DeMagalhaes,J.P.,Costa,J.,Toussaint,O.,2005.HAGR:thehumanageinggenomic resources.NucleicAcidsRes.33,D537–D543.

Hellerstein,J.M.,2008,Quantitativedatacleaningforlargedatabases,Surveyfor theUnitedNationsEconomicCommissionforEurope(UNECE),http://db.cs.

berkeley.edu/jmh

Mackey,J.,Pearson,W.R.,2004.Usingrelationaldatabasesforimprovedsequence similaritysearchingandlarge-scaleGenomicAnalyses.In:CurrentProtocolsin Bioinformatics.JohnWiley&Sons,Inc,Chapter9,Unit9.4.

Matthews,D.R.,Hosker,J.P.,Rudenski,A.S.,Naylor,B.A.,Treacher,D.F.,Turner,R.C., 1985.Homeostasismodelassessment:insulinresistanceandbeta-cell functionfromfastingplasmaglucoseandinsulinconcentrationsinman.

Diabetologia28(7),412–419.

Must,A.,Anderson,S.E.,2006.Bodymassindexinchildrenandadolescents:

considerationsforpopulation-basedapplications.Int.J.Obesity30(4), 590–594.

(7)