Data pre-processing in the e:KID study - Data management in the e:KID study

6. Data management in the e:KID study

6.4 Data pre-processing in the e:KID study

Often data cannot be employed in the state in which they are contained in the e:KID-DB-Basic:

Some raw variables are of difficult use and require pre-processing. This pre-processing is different from data cleaning, as it is context-dependent: It does not suppose the correction of errors, as the values are (supposedly) correct per se, but rather the adaption of the values for a concrete data analysis procedure. Therefore, the results from data pre-processing do not necessarily lead to the generation of a new version of e:KID-DB-Basic but result in changes in the database e:KID-DB-Active for data analysis. In this section, I will address some examples pre-processing of the e:KID data necessary employed in data analysis, including management of missing data.

6.4.1 Generation of new variables for data analysis

The requirement of pre-processing variables for their ulterior use in analysis can be due to different causes. One common case is that of variables not explicitly collected because they can be easily calculated using other data. This is the case for e.g. body mass index and EBV mismatch (a function of donor and recipient serostatus). These newly calculated variables should not be included in the main database e:KID-DB-Basic, as they would constitute a redundancy, unnecessarily increasing the size of the database and the probability of introducing new errors. Nevertheless, since they were needed for data analysis, they are automatically calculated in the scripts generating e:KID-DB-Active.

Often more than one measurement of the same variable for the same patient and time point can be found in the database. These are replicates, which can be used to monitor the precision of an experiment.²²³ In the e:KID-DB-Basic 1.7., for example, two replicates (measured at two different time points) for cytomegalovirus (CMV) viral load were included. A total of 728 measurements (23.4%) were replicated, for which the majority (96.7%) showed a small difference in the results of the replicates (<500 copies·mL^-1). As the precision of these measurements is comparable, the average was used for data analysis and was included as a

viral loads – were re-measured. The average of all three replicates is inappropriate for the analysis of discrepant data, as it takes into account values that differ extremely from the other two: Therefore, from that point on, the median was employed as the reference variable for data analysis of CMV reactivations (chapters 7 and 8).^211,213

6.4.2 Statistical transformation and normalization of variables

In the e:KID study, a mixed approach was taken for the statistical transformation of variables:

Part of the data had to be transformed before their integration in e:KID-DB-Basic, e.g.

metabolomics and gene expression data (see sub-section 6.2.3). However, most data are included in the raw state, leaving the possibility open for different transformations that are adequate for the use in the analysis approaches.

An example of the importance of variable transformation can be found in our manuscript on predictive antibody profiles for acute cellular rejection, which is part of this doctoral thesis (chapter 9).¹³⁰ Binding interactions of serum antibodies with HLA antigens were measured as a list of quantitative variables, which denote the intensity of binding for each antigen.¹³⁰ But conventional methods employed to predict acute rejection require a binary value (presence-absence of binding) of the binding interactions.^224–230 However, it is controversial how the binarization of the data should be calculated – some authors favour a fixed threshold, while others prefer an individual threshold for each patient.^130,231 Moreover, a binarization of the data necessarily leads to a loss of information on the strength of the interactions. Therefore, in this manuscript we investigated the influence of the data transformation method on the quality of prediction of acute rejection, comparing the performance of raw data, normalized quantitative data and binarized data.¹³⁰ Our results showed that quantitative, normalized data had the best performance employing the P-SVM machine learning algorithm.^62,130 This demonstrates the central importance of data transformation for the obtaining of satisfactory results.

6.4.3 Dealing with strong centre effects: The case of GFR

A particularly relevant case of variable with a strong centre effect was the glomerular filtration rate (GFR). GFR is an estimation of renal function and is widely employed both in the clinic and in research.²³² It is calculated using the serum creatinine concentration and the demographic characteristics of the patient, employing different formulae.^233–235 After examining the data distribution of GFR, I observed very strong differences between centres, with centre O having significantly higher GFR values. A large part of the variation stemmed from the fact that two different formulae were employed in Harmony: MDRD-IV and Cockcroft-Gault.^234,235 The reason for this is that GFR is an important decision tool in the clinic – it is recommendable for a good patient outcome that physicians work with the standard they have experience on. To avoid confusions, the formula employed for each calculation was recorded in the Harmony database. However, the results of MDRD-IV and Cockcroft-Gault are not comparable: They employ different units and have a slightly different biological meaning (Cockcroft-Gault is considered rather an estimation of creatinine clearance).^234–236 Furthermore, there were additional differences in the way in which each centre calculates the GFR, as some centres cap all values below or above a certain threshold, based on their clinical experience.

Because of all this, we decided to avoid centre effects in the GFR calculation by recalculating it centrally for all samples, based on the available data collected in the Harmony study.

Additionally to MDRD-IV and Cockcroft-Gault, the newer CKD-EPI formula for the calculation

232,234,235 The GFR calculated through all three formulae was incorporated into e:KID-DB-Basic 1.14; we employed the CKD-EPI formula as the reference for data analysis in two of the manuscripts presented in this doctoral thesis (chapters 7 and 8), as it has been shown to be the most accurate formula.^232,236

Figure 9. Comparison of the newly calculated GFR values with the values in the clinical for all samples. The colour and point type denote the (pseudonymized) centre where the sample was collected. No units are given for the original GFR, as it is a mixture of values calculated in mL·min

-1·1.73m^-2 (MDRD-IV) and mL·min^-1 (Cockcroft-Gault). As it can be observed, most values correlate linearly with the newly calculated GFR; the clearest exception was centre O, which was the only one employing exclusively Cockcroft-Gault. Note the capping of the GFR values in centres F (≥60), H (≤20) and J (GFR≥80).

However, it should not be assumed that the newly calculated GFR is free from centre effects:

The variable still relies on the serum creatinine concentrations measured at each transplantation centre, so that protocol differences between centres may introduce a bias.

Moreover, the variable can be influenced by differences in patient demographics and management between the centres. Because of this – especially for studies seeking to identify a causal relationship between a factor and GFR – the variation introduced by centre effects is to be taken into account into the study. This was the case in our manuscript on the differences in outcome between two different therapeutic strategies (see chapter 8).²¹³

6.4.4 Working with variables with missing values

as on data completeness. Here, I will shortly describe how missing values were taken into account into the design of our work.

In our two manuscripts employing biostatistical methods (chapters 7 and 8), the effects of missing data had to be considered in the analysis.^211,213 In these manuscripts, the glomerular filtration rate (GFR) was the main outcome variable, and its distribution for patients with different viral reactivation history or different therapeutic strategies was compared.^211,213 As the loss of patients to follow-up increases the number of missing GFR values in the later visits of the study, the statistical power of the GFR comparison one year after transplantation was reduced.^211,213 However, the statistical effect was high enough to observe significant effects for conditions associated with a 20% reduction of median GFR.^211,213 Moreover, the risk of bias in the analyses introduced by missing values was reduced, as all analyses were run taking into account the entire GFR time course, from the second visit (in which missing values are very reduced).^211,213

While imputation of missing values is often performed in machine learning methods, it was not necessary in the case of our manuscript (chapter 9): Only half of the patients of the sub-cohort had available measurements, but these measurements had no missing values.¹³⁰ Lastly, missing values were not considered in our manuscript on mathematical modelling of the immune system (chapter 10) due to the analysis method.²³⁷ Each patient time course was modelled individually at different time points, therefore the modelling approach explicitly estimates all values between the available measurements.²³⁷ For this estimation it is indifferent whether there is an available measurement x days after transplantation for all patients, the critical factor is rather the frequency of the available measurements.

These examples from our work demonstrate that there is no “one-size-fits-all” strategy for the handling of missing data.^55,57,58 We employed only available data for the analysis, and while imputation strategies were considered, they were not deemed adequate due to their intrinsic uncertainty. However, for further, future works (such as the search of predictive markers for viral reactivation) employing a large number of markers and patients, imputation might prove to be an adequate and useful strategy for the handling of missing values. In summary, the strategies have to take into account the goals and methods of the analysis, as well as the number of available measurements, the evidence of possible bias due to missing data not at random and the practical possibility of data imputation.

Im Dokument Towards personalized medicine in kidney transplantation: Unravelling the results of a large multi-centre clinical study (Seite 42-45)