• Keine Ergebnisse gefunden

6. Data management in the e:KID study

6.3 Data cleaning in the e:KID study

6.3.3 Evaluating the plausibility of data with biostatistical methods

As explained in the introduction (sub-section 5.3.2), the plausibility of data points can be assessed employing a large array of methods, including frequency and distribution analysis.46,47

An illustrative example of the use of frequency analysis within the e:KID study is the cleaning of the Epstein-Barr viral (EBV) load data. These EBV load data were included for the first time in the version 1.8. of e:KID-DB-Basic. But I observed that these data were not plausible: While for the other viruses over three thousand samples were measured, in the case of EBV there were values only for 309 samples. Moreover, there was a surprisingly high prevalence of viral load over the detection limit (250 copies·mL-1) in the samples (63%), compared to BK virus (14%) and cytomegalovirus (5%), which are known to be similarly prevalent as EBV in the human population.214–216 I found the reason of these irregularities in conversation with the experimentalist who had generated the data: The coding employed for the calculation of the viral load generated an error message for loads below the detection limit, as it attempted to perform a division by zero; a value of zero was used to denote missing samples. This convention was not immediately evident: The error messages had been interpreted by the database manager as missing values and zero values as loads below detection limit. The data were therefore corrected in a new version of e:KID-DB-Basic. The new data on EBV showed that 3163 samples had been measured and a prevalence of viral load over detection limit of 5% had been found – very similarly to cytomegalovirus. These corrected data were employed in our analyses of EBV viral reactivations performed in two of the here presented manuscripts (chapters 7 and 8).211,213

Figure 7. Histogram of the distribution of five clinical variables at visit 8 from the e:KID-DB-Basic before cleaning. These variables were collected in the Harmony database and imported into the version 1.2. of e:KID-DB-Basic and correspond to the final study visit, one year after transplantation.

Shaded is the interval corresponding to extreme outliers, defined by Tukey’s fences with k=3. Note that the y-axis is truncated at frequency 50 to make the outliers more easily identifiable.

Statistical distribution analysis variables was also employed for the identification of implausible values. In Figure 7, the distribution of some clinical variables one year after transplantation is shown, as recorded in e:KID-DB-Basic 1.2. These clinical variables were originally measured within the Harmony study, cleaned and recorded in the table “Laboratory Analysis (Visit 8)”.

As it can be observed in the figure, there were extreme outliers for white, red blood cells and

blood cell count are not biologically possible and were most probably due to an experimental error.217–219 Therefore, I proposed the removal from the database of the outliers in the red blood cell count, while the outliers of white blood cells and C reactive protein were kept in the database.

The case of creatinine – a marker of kidney function – is especially interesting. There was a very large number of outliers and a bimodal distribution: The large majority of the patients (N=387, 87.4%) had a blood concentration below 10 mg·dL-1, while for the rest of the patients (N=56, 12.6%), concentration was over 30 mg·dL-1. Such a large difference strongly suggested that the data were in different units; the fact that 96.4% of the patients with creatinine over 30 were transplanted in one centre (centre B) further supports this suspicion. Moreover, creatinine levels of 30 mg·dL-1 are not possible, as already a serum creatinine ≥4 mg·dL-1 indicates the need of dialysis.220 My analyses showed that assuming these outlier concentrations were actually measured in μmol·L-1 (a widely used standard for creatinine levels) led to a normalization of the distribution (see Figure 8). The resulting distribution therefore supports the suspicion on the different units. These corrected creatinine data were essential for the analyses of renal function performed in two of the here presented manuscripts (chapters 7 and 8).211,213

In the case of haemoglobin, a solution was not as evident, as in spite of the bimodality of the distribution, all individual values can be considered possible.221,222 However, there were strong centre effects, again for centre B: In this centre 100% (N=55) of the patients had haemoglobin levels ≤12 g·dL-1 (considered diagnostic for anaemia)221, while for the other centres the prevalencewas of 37.2%. Therefore, I concluded that haemoglobin levels in centre B were measured in mmol·L-1, another standard unit. The results of the conversion can be observed in Figure 5; as it can be seen, the patients from centre B now distribute similarly to the rest of the population. The newly calculated data for haemoglobin concentration were therefore incorporated into e:KID-DB-Basic.

Figure 8. Histogram of the distribution of creatinine and haemoglobin concentrations at visit 8 after cleaning. In green is depicted the distributions in patients from centre B, in blue the distribution for the rest of the cohort. Shaded is the interval corresponding to extreme outliers, defined by Tukey’s

fences with k=3. As it can be observed, the patients from centre B are now distributed similarly to the rest of the cohort.

These examples show the opportunities and limitations of the methods based on examination of statistical distributions. While they can be performed very fast and without special clinical knowledge, decisions cannot be taken solely based on the statistical distribution: The same phenomenon, i.e. extreme outliers, has a different interpretation for red blood cell counts as for C reactive protein or creatinine concentration. They also highlight how many mistakes in clinical databases are not typographic, but rather the result of employing different conventions (units, coding, etc.). In this case, for multi-centric studies it is useful to explore systematic centre-associated differences, even if all values are in the range of possible, as in the case of haemoglobin. In summary, these examples demonstrate that a basic understanding of the clinical values, their expected distribution, the different employed standards and their information flow is paramount to any data cleaning based on the examination of statistical distributions.