• Keine Ergebnisse gefunden

The MARK-AGE extended database : data integration and pre-processing

N/A
N/A
Protected

Academic year: 2022

Aktie "The MARK-AGE extended database : data integration and pre-processing"

Copied!
7
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

j o ur na l h o me pa g e :w w w . e l s e v i e r . c o m / l o c a t e / m e c h a g e d e v

The MARK-AGE extended database: data integration and pre-processing

J. Baur

a

, T. Kötter

b

, M. Moreno-Villanueva

a

, T. Sindlinger

a

, M.R. Berthold

b

, A. Bürkle

a,∗

, M. Junk

c

aChairofMolecularToxicology,UniversityofKonstanz,78457Konstanz,Germany

bChairofBioinformaticsandInformationMining,UniversityofKonstanz,78457Konstanz,Germany

cDepartmentofMathematicsandStatistics,UniversityofKonstanz,78457Konstanz,Germany

a r t i c l e i n f o

Articlehistory:

Received31January2015

Receivedinrevisedform13May2015 Accepted18May2015

Availableonline21May2015

Keywords:

Database Dataentry Dataintegration Dataprocessing Dataextraction KNIME

a b s t r a c t

MARK-AGEisarecentlycompletedEuropeanpopulationstudy,wherebioanalyticalandanthropometric datawerecollectedfromhumansubjectsatalargescale.Tofacilitatedataanalysisandmathematical modelling,anextendeddatabasehadtobeconstructed,integratingthedatasourcesthatwerepartofthe project.Thisstepinvolvedchecking,transformationanddocumentationofdata.Thesuccessofdown- streamanalysismainlydependsonthepreparationandqualityoftheintegrateddata.Here,wepresent thepre-processingstepsappliedtotheMARK-AGEdatatoensurehighqualityandreliabilityinthe MARK-AGEExtendedDatabase.Variouskindsofobstaclesthataroseduringtheprojectarehighlighted andsolutionsarepresented.

©2015ElsevierIrelandLtd.Allrightsreserved.

1. Introduction

Adatabaseisa structuredcollectionofinformation thatcan beaccessedbyusingspecificsoftwaretools.Toextracttheinfor- mation and understand the fundamental structure of collected data,thesehavetobepresentedinsuchawayastoenableeffi- cientknowledgeextraction.Variousformsofdatabaseshavebeen developedandtheyaretypicallycategorizedonthebasisoftheir function.Themostcommontypeistherelationaldatabasewhere theinformationisstoredinvariousdatatables(Codd,1970),which iscommonlyusedingenomics,proteomicsandclinicalresearch wherelargeamountofdatahavetobestoredfromeachsubjector patient(MackeyandPearson,2004;YuandSalomon,2009).

Toenter,organizeandselectdatafromadatabaseadatabase management system (DBMS) is necessary. These programs are specificallydesigned toenable interaction between user,other applications,andthedatabaseitselfandcoverthefollowingissues:

1.Define,removeandmodifythedatastructureofneworexisting databasetables.

Correspondingauthor.Tel.:+497531884035;fax:+4907531884033.

E-mailaddress:alexander.buerkle@uni-konstanz.de(A.Bürkle).

2.Inserting,modifying,anddeletingdata.

3.Querydataforreportsandmakethemaccessibletoend-users.

4.Datasecurityandrecovery,registeringandmonitoringofusers.

TherearemanydifferenttypesofDBMSs,rangingfromsmall systems that run on personal computers tovery complex sys- temsthatrunonmainframes.Structuredquerylanguage(SQL)was developedintheearly1970satInternationalBusinessMachines (IBM)(BoyceandChamberlin,1974).Itisastandardlanguageto interactwithrelationaldatabasesandiscurrentlythemostwidely used databaselanguage.Theusage ofSQL includesdatainsert, query,updateanddelete,modificationanddataaccesscontrol.

Inlargeresearchprojectsthesharingofdatawithinaconsor- tium,betweendifferentconsortia,orwiththescientificcommunity at largeis essential in order toboost progress duringcomplex dataanalysisandmodelling.Anabsoluterequirementforsharing ofadatabaseisthatthecommunicateddataarewellorganized, have beenenteredin thecorrect format,carefullycheckedand validated.Differentorganizationalapproachesaccordingsuchpro- cessesarealreadydescribed forseveraldatabasesintheageing research(Craigetal.,2015;DeMagalhaesetal.,2005).

Duringdesigningaproject,strategiesneedtobeimplemented in order to preventerrors during data entry. Error prevention strategies,however,cannotguaranteetheabsenceofincorrector

http://dx.doi.org/10.1016/j.mad.2015.05.006

0047-6374/©2015ElsevierIrelandLtd.Allrightsreserved.

Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-0-295749

(2)

Fig.1.SchemeoftheSQLandKNIMEdatasourcefusion.

DatawereeitheruploadedtotheSQLdatabaseviatheinternet,ordirectlyimplementedinKNIME.Togenerateacompletesetofdatabothsourceswereintegratedwithin KNIMEonacentralplace.

incompletedataentryduringtheproject(VandenBroecketal., 2005).Datathathavenotbeenscreenedandcheckedformislead- inginformationmayproducefalseresultsandconclusions.‘Data cleaning’,i.e.theidentificationandcorrectionoferrorsinorderto improvedataqualitybeforestoringandanalysingdataistherefore anindispensablepartofthedatamanagementprocess(Chapman, 2005;Rahm andDo, 2000;Van denBroeck etal., 2005).There are,however,manydifferenterrortypes,andfurthermoreerror sourcesarenotalwayseasytoidentify.Typicalexamplesareerrors duringmeasurement,dataentryordataintegration(Hellerstein, 2008).Fordataevaluation,analyzersmustknowaboutpossible dataqualityproblemsthatcancompromisethevalidityofresults.

ThistopicisalsoaddressedintheDigitalAgeingAtlas,whereeach entryisconnectedtothebelongingsourceofrawdata,offeringthe possibilitytocheckoriginalcontents(Craigetal.,2015).

Below,itisexplainedhowtheMARK-AGEdatabasewaspre- paredandextendedwithintheKNIMEdataintegrationplatform.

Hiddenproblemsinvariousdatasets,aswellashandlingstrategies formisleadingdataarepresented.Asthecasesmentioneddidoccur despitecarefulstudydesign,theymayprovidesomeguidanceto avoidsimilarproblemsinfuturestudies.

2. Materialsandmethods

2.1. SQLMARK-AGEdatabase

The MARK-AGE database was established using Structured QueryLanguage(SQL)(KötterandMoreno-Villanuevaetal.,this issue).SQLisacommonlyuseddatabasemanagementsystemfor relationaldatabases.Thedatatablescontaintherawdatauploaded byMARK-AGEConsortiummembers.SQLisarathercomplextool fornon-expertsininformatics.Therefore,KNIMEwaschosenfor queryingtheSQLdatabaseandforfurtherprocessingtheMARK- AGEdataadditionally(Fig.1).

2.2. Konstanzinformationminer(KNIME)

TheKonstanzInformationMiner(KNIME)isauser-friendlydata integrationplatform,whichenablesvisualassemblyandinterac- tiveexecutionofdatapipelines.KNIMEenableseasyintegrationof newalgorithmsorvisualizationmethodsasmodulesornodes.For

aclearstructuringofbigworkflowsKNIMEofferstheopportunity tocollapseagroupofselectednodesintoaso-calledmetanode.

Inaddition,KNIMEprovidespluginsforcommonprograminglan- guagesandcanbeusedasdatabasemanagementsystem(Berthold etal.,2007).Toavoidtheriskoflosinginformation,thealready generatedpartsoftheSQLdatabasetableswereopenedinKNIME afterconnectingtotheMARK-AGEdatabaseserver(Fig.1).KNIME didnotprovidedirectreadingaccesstotheoriginaldatatables,but tomirroredorjoinedso-called‘views’thatweregeneratedbythe SQLDBMS.Thus,thissystemguaranteesamaximallevelofsafety forthealreadyestablishedSQLbaseddata.Eachstepperformed withKNIMEis executedand saved ina dedicated node, which therebyworksasdocumentationplatform.Asaresult,aworkflow ofnodeswasgeneratedperformingthecompleteintegrationofthe MARK-AGEdatabase(Fig.2).

2.3. KNIMEserver

TheKNIMEserverallowsstorageandaccessingofworkflows viatheinternet.Useraccessrightscontrolhowdataaregrouped forprojects,workgroupsordepartments.Theserverwasusedto storeanddocumentKNIMEworkflowsoftheMARK-AGEdatabase extensionprocess(https://www.knime.org/knime-server).

2.4. Ethicalclearance

TheMARK-AGEstudyhasbeenapprovedbytheappropriate ethicscommitteesandhasbeenperformedinaccordancewiththe ethicalstandardslaiddownintheDeclarationofHelsinki.Allstudy subjectsgavetheirinformedconsentpriortotheirinclusioninthe study.ThisisdescribedingreatdetailinCaprietal.,thisissue.

3. Results

3.1. Dataentry

To guarantee the blinding of the study, questionnaire data (comprisingdescriptorsoftherespectivesubjectsuchasageand gender)andbioanalyticaldatawerestoredseparately.Question- naireswereuploadedtothedatabaseunderaprimarysubjectcode (PSC)anddatafrombiochemicalanalysesofsampleswererecoded

(3)

Fig.2. OverviewoftheKNIMEpreparationworkflowfortheextendeddatabase.

Informationwasseparatelyimplementedforbioanalyticalandquestionnairedata(A)Subsequently,specificcalculations(B)andtherenaming(C)oftheparameterswere performed.Inthelaststep,subgroupsfromseveralrecruitmentphaseswereseparated(D).

andenteredintothedatabaseunderthesecondarysubjectcode (SSC)(seeBürkleetal.,thisissue).Tojoinbothtypesofinforma- tion,atranslationtable(termed‘PSC-to-SSC’table)wasestablished duringtheproject(seeMoreno-VillanuevaandKötteretal.,this issue).

3.1.1. Entryofxlsandcsvdatatables

Forlogisticreasonsnotallofthecollecteddatafrombiochem- icalanalysiswereuploadedtotheSQLdatabaseviathewebsite interfaces.Thereforexlsorcsvtablescontainingthesecondarysub- jectcodewiththerespectivedatawereintegratedwiththealready establisheddatabaseusingKNIME,thus,leadingtotheformationof an‘ExtendedDatabase’(Fig.1).Thetables,manuallygeneratedby theanalyzersofferedproblemswiththesubjectcodesassomeof thosewereeitherinvalidduetotyposormultipleusage.Tosolve thisproblem,aKNIMEworkflowwasestablishedthatautomati- callyidentifies andrejectsallrowsofa tablecontaininginvalid ormultiplesubjectcodes. Therebyeachincomingxlsorcsvfile wascheckedseparately,firstfortheuniquenessofSSCsbycount- ingandsecondforvaliditybycomparingallavailablecodeswith thePSC-to-SSCtablefromtheSQLdatabase,containingallvalid subjectcodeinformation (Fig.4).Invalidentrieswereseparated anddocumentedforcommunicationtothepartners.Applyingthis procedure,atotalof19datatableswereaddedtotheExtended Databasecontainingonaverage1.3%±1.2invalidand 0.4%±0.8 multiplyenteredsubjectcodes.Thehighvaluesofstandarddevi- ationsindicatethatsometablesdisplayedmoreerrorsincoding thanothers.Additionally,theamountofmiss-enteredsubjectcodes indirectlyreflectsthedataquality.Anotherscenariooccurringin thedatatableswasthemix-upofcodesandrespectivebioanalyt- icalvaluesinthelaboratories.Thosecasescouldonlybeidentified iftheyledto outliersin somedownstream analysis,for exam- pleifafemalesubjectisattributedvaluesthatappliedtomales only.

Table1

List of generalmodifications performedduring construction of theExtended Database.

Recalculationofnon-SIunitstoSIunits Normalizationsonparameters

Normalizationsonparameterswithvaluesmeasuredbyanotherpartner Removalofsubjectswithvaluesbeyondtheboundariesofmeasuring Calculationofratiosonparametersmeasured

Calculationofthemeansofduplicatemeasures Normalizationsforbatcheffects

Exclusionvaluesfromineligiblesamples(e.g.,frozensamplesinadvertently thawedduringshipmentorstorage)

CalculationofestablishedscoreslikeBMIandHOMAindex CalculationofnewscoresdevelopedduringtheMARK-AGEstudy

3.1.2. Additionofdatacolumns

Afterdata entry,somebioanalytical researchers(‘analyzers’) requestedcalculationslikenormalizationsandcorrectionsontheir specificparameters.Therefore,aKNIMEworkflowwasgenerated performingthestepsrequested,separatelyforeachrequestingana- lyzer (Fig. 3).Ifnormalizations wereperformedonparameters, newcolumnswereappendedtotheextendeddatabaseinorder tomaintaintheoriginaldata.TheWorkflownotonlydocuments theperformed modificationsbut alsoautomaticallyrenews the calculationseachtimenewdatawereuploaded.

Based on existing parameters, compound parameters such as ratios between individual parameters were calculated and appendedtotheExtendedDatabase,likebodymassindex(BMI) (Must and Anderson, 2006) or Homeostasis Model Assessment (HOMA)index(Matthewsetal.,1985).Suchcompoundparame- tershaveeitherbeenpublishedalreadyorwerenewlydesignedby theMARK-AGEConsortiummembers,likethe‘NutritionScore’.

Table1showsalistofgeneralcalculationmethodsperformed.

TheestablishedKNIMEworkflowisusedasdocumentationbase andorganizedinawaythatnewcolumnscouldeasilybeaddedby theuser.

(4)

Fig.3.Schematicoverviewofthecalculationworkflow.

Metanodesareusedtoperformthecalculationstepsseparatedforeachprojectpartner.Theclearstructuresimplifiestheusageandvisualizeseachstepperformed.

Fig.4.RepresentativeschematicoverviewoftheKNIMEfileappendflow.

ThreestepswereperformedtochecktheincomingfilesfromtheMARK-AGEpartners.SSCswerecounted(middlebox)andcontroledforvalidity(upperbox).Inaddition thefileformatwaschecked(lowerbox)andconvertedifnecessary.Invalidormulitplecodeswereexcludedanddocumentedinxlsfilesviatheredxlswriternodes.

3.2. Datapre-processing 3.2.1. Re-namingofparameters

ThecolumnnamesoftheSQLdatabasetableshadbeendesigned usingshorttermsandtheunderscoresign.Parameternames,how- ever,mustbeusableforheadingsingraphsetc.There-namingof theparameterswasperformedwithastandardKNIMEnodethat automaticallytranslatesthecolumnnames,accordingtoarefer- encetablecontainingoriginalcolumnnamesaswellasnewnames.

Asacentralplacetostorebothkindsofinformationwasnecessary, theywereimplementedinthemetatableestablishedduringthe project(seeKötterandMoreno-Villanuevaetal.,thisissue).Cor- rectednamescontainthemorereadablefull-lengthnameor,iftoo long,thestandardabbreviationaswellastheunitinwhichthe parameterwasmeasured.

3.2.2. Joiningofanalyticandquestionnairedata

Anecessarysteptoworkwiththeextendeddatabase,wasthe joiningofseparatelystoredbioanalyticalandquestionnairedata.

Asquestionnaireswere stored underthe PSC and bioanalytical measurementsundertheSSC,asimplejoiningprocedurewasper- formedusing the ‘PSC-to-SSC’ translation table. Upon checking ofthedata,however, adiscrepancy in thenumbersofsubjects betweenbioanalyticalandquestionnairedatawasnoted.

Asthequestionnairesweredividedinsixelectronicparts,the PSChadtobeentereduptosixtimesseparately.Therefore,insome casesdifferentPSCcodeswithtyposorreverseddigitsappeared forasinglesubject.Thedatabaserecognizedeachnewlyentered,

Table2

ChartillustratingthePSCenteringproblem.

PSC SSC quest1 quest2 quest3 quest4 quest5 analysis

1 0200123 12345 x x x x x

2 0200128 12346 x x x

3 02rt123 12347 x x x

4 02RT123 12348 x x

Row1and2reflectentriesforthesamesubjectinthefirstrecruitmentround.When questionnairetwowasenteredtheinterviewertransposedan8withthe3atthelast digitbymistake.Becausethesecondcodeisunknowntothesystemitgeneratesa newrowreflectinganewpersonbymistake.Thesameproblemoccursifindicators fromfurtherrecruitmentroundsweretransposed(rt−>tr)orindicateddifferently withlowerorupperlettersforonesubject(row3and4).

differingPSCasanewsubjectandgeneratedanewentity.Asa result,foronesubjectmorethanoneSSCcouldbeavailable,and thecompletequestionnaireinformationfromonesubjectcouldbe dispersedaswell(Table2).Consequencesofthisproblemmaywell haveaffectedanyphaseofrecruitment.

3.2.2.1. Firstrecruitment. Forthemainrecruitmentphase, fixing erroneousmultiplePSCinsertionwasnotdone.Asthereisnovalid possibilitytoassemblethedifferententriesbelongingtoonesub- ject,allcaseswhereonlypartsofquestionnaireshadbeenentered wereexcludedfromfurtheranalysis.

3.2.2.2. Subsequent recruitment phases. Additional recruitment rounds,termed‘re-sampling’or‘re-testing’,wereperformedduring

(5)

theproject.Inthesephasesonlyasubsetofsubjectswasexamined anddatawereenteredagain.Foridentifyingpurposes,analtered PSCcode,withaseparateidentifierconsistingoftwoletterswas used.Ifthoselettersweremixedupduringvariousentries,again differentPSCsweregeneratedforonesubject.Fortunatelyenough, thosemis-entriescouldberetraced,accordingtotheremaining partsofthecode.Asaresult,aKNIMEworkflowwasestablished removingtheidentifierandwritingallseparatelystoredinforma- tionintoonevalidPSCwithacorrectindicator.

3.2.3. Filteringdata

AccordingtothedesignofMARK-AGE,criteriafortheenrolment ofsubjectshadbeenestablished.Subjectshadtofallintheage range35–74.9years;andfurthermore,positivityforhepatitisBor Cwasanexclusioncriterion. Forlogisticreasons,somesubjects wereenteredintothedatabasethatdidnotsatisfythesecriteria.

GeneralfilterswereestablishedinaKNIMEnodethatexcludeand documentthosesubjects.

Twoexceptionshadtobeconsideredduringthesetupofthe algorithm. The ‘spouses of GEHA Offspring’ [SGO] group, were allowedtoexceedthedefinedagerangeastheirnumberwasrather low.Furthermoresubjectsrecruitedduringthere-testingphase obviouslyhaddifferentagerequirements:Ifa personwas73at thetimeofthefirstrecruitment,3yearslatertheageexceedthe above-mentionedlimit,which,however,wasacceptedinthiscase.

Afterthedatacleaningstepsmentionedaboveitwaspossible todeterminethenumberofvalidentriesofsubjectsthatpartici- patedinthestudy.Therefore,wedefinedrequirementsnecessary fora subjecttobeconsideredinstandard analyticalanalysis.A

‘validsubject’requiredatleastonebioanalyticalparametermea- suredsuccessfully(besidehepatitisanalysis,wherepositivitywas anexclusioncriterion)andthefullyenteredpartofthequestion- nairescontainingageandgenderinformation whilea recruited subjectrequiresonepartofthequestionnairesorbloodsampling collectionsenteredintothedatabase.Table3showsthenumberof recruitedandvalidsubjectsestablishedwiththeexplainedclean- ingstepsanddefinitions.

3.3. Correctionsofcoupledparameters

Besidecorrectionsonsingleparameters,correctionsofcoupled parameterswerealsosetup.

3.3.1. ATCcodes

Asdrugnamesvaryfromcountrytocountry,thestandardized AnatomicalTherapeuticChemical(ATC)classificationsystemwas usedtoclearlyindicatethedrugintakeofasubject.AnATCcode consistsof5levelsdefinedbyaspecificorderoflettersandnum- bers(WHO,2013).IntheMARK-AGEdatabase,typosinthesecodes occurred:forexample,thenumber0andtheletterOwereused synonymouslyorthecodingsystemwasnotmaintained.Analgo- rithmwassetupcorrectingthetyposandextractingtheinvalid codes,whichshouldsubsequentlybecorrectedbytheresponsible recruitmentcentres.Toextractinformationaboutthedisease,level 3ATCcodesofallMARK-AGEsubjectsweregroupedandadisease translationtablewasgenerated(Table4).

A 06 A Gastrointestinal

A 07 A Infectionsdisease

A 07 E Antiinflammatories/antirheumaticcompounds

A 07 F Gastrointestinal

A 08 A Diabetes

A 09 A Gastrointestinal

A 10 A,B Diabetes

A 11 A,C,D,G,H Vitamins

B 01–02 A Thrombosis/coagulationdisorders

B 03 B Vitamins

C 01 A-E Cardiacdisease

C 02 A,C,D Hypertensioncluster C 03 A,B,C,E Hypertensioncluster

C 04 A Thrombosis/coagulationdisorders C 07 A,B,C Hypertensioncluster

C 08 C Hypertensioncluster

C 08 D Cardiacdisease

C 09 A,B,C,D,X Hypertensioncluster

C 10 A,B Lipidmetabolism

D 05 A Skindiseases

D 07 A Antiinflammatories/antirheumaticcompounds

D 07 C Infectionsdisease

D 11 A Skindiseases

H 02 A Antiinflammatories/antirheumaticcompounds

H 03 A,B,C Thyroiddisorders

J 01 A,C,F,X Infectionsdisease

J 05 A Infectionsdisease

L 01 A,B,X Cancertherapy

M 01 A,C Antiinflammatories/antirheumaticcompounds

M 02 A Pain

M 04 A Gout

M 05 B Bonedisease

M 09 A Antiinflammatories/antirheumaticcompounds

N 02 A,B Pain

N 03 A CNSdisordersotherthandepression N 04 B CNSdisordersotherthandepression N 05 A CNSdisordersotherthandepression

N 05 B Depression/anxiolytics

N 06 A Depression/anxiolytics

N 06 B,D CNSdisordersotherthandepression N 07 X CNSdisordersotherthandepression

P 01 B Infectionsdisease

R 03 A,B,D Lung/bronchialdisorders R 05 C,D Lung/bronchialdisorders

S 01 A Infectionsdisease

S 01 B Antiinflammatories/antirheumaticcompounds

S 01 E,F,X Eyedisease

V 01 A Immunesystemdisorders

3.3.2. Bloodparametersinstandardunits

Foreachsubjectageneralbloodcountwasperformedduring thestudy.Asstandardbloodanalysisdevicesandtheirreporting formatvaryindifferentcountries,theuploadtablesweregener- atedsuchthat thenumber andtheunit for a singleparameter wereenteredintwodifferentcolumns.Inthis process,theunit had to be selectedfrom a drop-downmenu. In order tostan- dardizetheparameterunitsanalgorithm hadtobeestablished re-calculatingallvaluesaccordingtotheinternationalmetricsys- tem(SIunits)(Table5).Largeunstainedcells(LUC)andplatelet distributionwidth(PDW)wereonlymeasuredinonelaboratory andtherefore,excludedfromtheanalysis.

3.3.3. ZUNGscalescoring

Theself-ratingdepressionscale(Zung,1965)wasusedtomon- itorifthesubjectssufferfromdepressive disorders (seeBürkle

(6)

Table5

Listofbloodcountparametersanalyzedandunitsused.

Parameter Longname Unit

MCH meancorpuscularhaemoglobin picogram

MCHC meancorpuscularhaemoglobinconcentration g/dl

MCV meancellvolume femtoliters

HCT haematocrit %

RDW redcelldistributionwidth %

HGB haemoglobin g/dl

HDW haemoglobindistributionwidth g/dl

RBC redbloodcells million/␮l

WBC whitebloodcells thousand/␮l

Neutrophils neutrophils thousand/␮l

Eosinophils eosinophils number/␮l

Basophils basophils number/␮l

Monocytes monocytes number/␮l

Lymphocytes lymphocytes number/␮l

Platelets platelets thousand/␮l

MPV meanplateletvolume femtoliters

etal.,thisissue).Twentyquestionshadtobeansweredbyusing thefollowingoptions:alittleofthetime,someofthetime,good partofthetime,mostofthetime.Eachquestionwasscoredafter- wards to calculatethe depression status value. Gaps were not allowedinthissystembutthesubjectshadthechoicetoselect‘na’

(notapplicable)ifheorshedidnotwishtoanswerthequestion.

Therefore,itoccurredthatseveralquestionswerenotusablefor theratingsystem.Asalreadypublished(Shriveetal.,2006)these gapswerefilledwiththemeanofthepointsfromtheindividual subject.

3.3.4. Calculatinguniformtimeintervals

Informationrequestedonfoodandbeverageintakewasentered witha tabular system,in which the subject hastospecify the amountconsumedeitherperday,weekormonth.Toworkwith this data in a standardized fashion all intakes were recalcu- lated toa weekly and monthly indication with an established algorithm.

4. Conclusion

Inthispaper,wedescribethenecessarystepsthatwereper- formedontheMARK-AGEraw data togeneratea database for thefinaluser.Thetoolsusedpresentmethodstodetectandhan- dleproblemshiddeninthedatastructureofcollectedrawdata.

Problemswe reportedshouldbeusedtoimplementpreventive strategiesinnewagingresearchprojects.AdditionallyKNIMEis introducedaswebbasedtoolforthedevelopmentofaneasy-to- handledatacommunicationplatform.

Evenourbesteffortsinvestedinthedesignoftheprojectcould notguaranteecompletepreventionoferrorsorproblemsrelated withdataentryintothedatabase.Duringtheproject,errorsources inthegrowingdatabasewereidentified, andthus, dataquality improvedcontinuously.Someoftheproblematiceffectswereiden- tifiedshortlyafterthelaunchoftheprojectwhereasotherswere hiddenand onlydetectedafteranalysis. Thefactthat aninter- nationalprojectlikeMARK-AGEinvolvesseveralcountrieswith divergentstandardsand guidelines madeit even moredifficult toestablishastandardizedworkingsystem.Sincethecreationof largeEuropeandatabasesfortheanalysisofbiological systemic effectswillcontinuetobearelevanttask,strategiestogeneratereli- abledatabasesofhighestqualityarenecessary.Sofarpublications coveringthisaspecthavebeenrare.Withourabovedescription ofessentialstepsontheMARK-AGEdatabase,we providerele- vantinformationfromourhands-onexperienceonfrequenterror sourcesandideasforpreventivesolutionstrategies.

Investing sufficient time and manpower in the construction phaseof aprojectand itsdatabaseis essentialand avoidshigh costs,delaysandqualityloss.Inparticular,arealisticestimateofthe requiredfundsforaccomplishingthecrucialprogrammingworkin theearlyphaseoftheprojectisofutmostimportance.Thesepro- grammingtasksincludetheimplementationofadatabasestructure andtheestablishmentofanappropriatebackupsystem,thedesign ofweb-interfacesfordataentrywithsuitableconsistencychecks likevaluerestrictions(avoidingfreeentryframeswheneverpossi- ble),theselectionandimplementationoferror-correctingsubject codes,andtheimplementationofaframeworkforsubsequentanal- ysis.KNIMEwaschosenfordataintegrationandretrievalbecause itiseasiertohandlefornon-IT-expertsanddirectlyprovidesanal- ysistoolsandelaboratereportingfeatures.Evennon-expertscan efficientlyworkwiththissystemafteratrainingperiodofapproxi- matelyoneweek,basedontheuser-friendlyinterfaceandintuitive nodestructure.

Withoutsufficientknowledgeaboutthedatabackgroundand thewaysdatawereprepared,analysiscancausemisleadingresults.

The documentation of the extended MARK-AGE database con- structioniscompletedandadetaileddescriptionisavailablefor users.Thereforetheworkpresentedisalsonecessaryforupcom- ingpublicationspresentingresultsontheMARK-AGEdata.Ifparts ofpublisheddataweretobeintegratedin otheragingresearch databasesliketheDigitalAgeingAtlas(Craigetal.,2015)itwouldbe necessarytoknowaboutdatasourceanddatapreparationstrate- gies.Lastly,atthetimetheMARK-AGEdatabasecouldbemade publiclyavailable,itwouldbenecessarythateachuserisfamiliar withthedatabackground.

Acknowledgements

WewishtothanktheEuropeanCommissionforfinancialsup- portthroughtheFP7largescaleintegratingprojectEuropeanStudy toEstablishBiomarkersofHumanAgeing(MARK-AGE;grantagree- mentno.:200880).FurthermorewewishtothankallMARK-AGE Consortiummembersfortheireffortstomakethisworkpossible.

OurspecialthanksgotoLotharGasteiger,ThorstenMeinlandPeter Burgerforthesupportregardinghardwareandsoftwaretools.

References

Berthold,M.R.,Cebron,N.,Dill,F.,Gabriel,T.R.,K ¨otter,T.,Meinl,T.,Ohl,P.,Sieb,C., Thiel,K.,Wiswedel,B.,2007.KNIME:thekonstanzinformationminer.In:

StudiesinClassificationDataAnalysisandknowledgeorganization.

Heidelberg-Berlin,Springer-Verlag.

Chamberlin,D.D.,Boyce,R.F.,1974.SEQUEL:astructuredenglishquerylanguage, SIGFIDET,47Proceedingsofthe1974ACMSIGFIDETworkshoponData description,accessandcontrol,249-264.

Chapman,A.D.,2005.PrinciplesandMethodsofDataCleaningOccurrenceData.

Version1.0.ReportfortheGlobalBiodiversityInformationFacility.

Copenhagen,1–72.

Codd,E.F.(1970).RelationalModelofDataforLargeShareDataBanks, CommunicationsoftheACM,13:6.

Craig,T.,Smelick,C.,Tacutu,R.,Wuttke,D.,Wood,S.H.,Stanley,H.,Janssens,G., Savitskaya,E.,Moskalev,A.,Arking,R.,DeMagalhaes,J.P.,2015.Thedigital ageingatlas:integratingthediversityofage-relatedchangesintoaunified resource.NucleicAcidsRes.43,D873–D878.

DeMagalhaes,J.P.,Costa,J.,Toussaint,O.,2005.HAGR:thehumanageinggenomic resources.NucleicAcidsRes.33,D537–D543.

Hellerstein,J.M.,2008,Quantitativedatacleaningforlargedatabases,Surveyfor theUnitedNationsEconomicCommissionforEurope(UNECE),http://db.cs.

berkeley.edu/jmh

Mackey,J.,Pearson,W.R.,2004.Usingrelationaldatabasesforimprovedsequence similaritysearchingandlarge-scaleGenomicAnalyses.In:CurrentProtocolsin Bioinformatics.JohnWiley&Sons,Inc,Chapter9,Unit9.4.

Matthews,D.R.,Hosker,J.P.,Rudenski,A.S.,Naylor,B.A.,Treacher,D.F.,Turner,R.C., 1985.Homeostasismodelassessment:insulinresistanceandbeta-cell functionfromfastingplasmaglucoseandinsulinconcentrationsinman.

Diabetologia28(7),412–419.

Must,A.,Anderson,S.E.,2006.Bodymassindexinchildrenandadolescents:

considerationsforpopulation-basedapplications.Int.J.Obesity30(4), 590–594.

(7)

Referenzen

ÄHNLICHE DOKUMENTE

By using the generic proxy mechanism, the synchronisation between the exter- nal information sources and the database system is done automatically when the information source is

An available or complete case analysis (CCA) is the most frequently used method in handling missing data and exclude all subjects with missing values from the analysis (Altman

Mit "QBiQ" (sprich wie im Englischen für "cubic") wurde schließlich vom Sfb ein Datenbanksystem entwickelt, mit dem nicht nur qualitative und quantitative

Because most of our projects combine qualitative and quantitative methods in longitudinal panel studies, unique verbal and standardised data sets are available with an

Implement the straightforward approach to load auth.tsv to the database (PostgreSQL, Java/Python)?.

Implement the straightforward approach to load auth.tsv to the database (PostgreSQL, Java/Python)..

Task 1: Implement the straightforward approach to load auth.tsv to the database (PostgreSQL, Java).. Task 2: The straightforward approach

Task 1: Implement the straightforward approach to load auth.tsv to the database (PostgreSQL, Java).. Task 2: The straightforward approach