• Keine Ergebnisse gefunden

2 Preprocessing of microarray data

2.1 Review of methods

2.1.6 Detection of outlier samples and array quality assessment

39

without the effect of imputation, can be evaluated. Outlier detection based solely on the median results in a clear cut of outliers. Detection based on the z-score takes also SD of transcript expression into account and therefore a clear line cannot be observed.

Compared to detection via median another set of values is detected as outliers. The third introduced method of outlier detection using median and MAD identifies a higher number of outliers. The MAD is a more robust estimator for the standard deviation, which is smaller than SD used for the previous two methods and therefore more values lie outside the interval [medianij ± 3 MAD] than the interval [meanij ± 3 SD]. The MAD in most cases smaller than 1 and therefore the interval [medianij ± 3 MAD] is smaller than the interval [medianij ± 3]. This allows us to increase the constant in this approach to reduce the probability for false positive detection of outliers.

The visual inspection of outliers allows to find arrays that have a high amount of outliers and therefore to identify interesting subjects deviating from the group or arrays that may be corrupted and need to be excluded from the analysis.

40

by gene set enrichment analysis. The use of outlier removal procedures leads to an improvement in the biological relevance of the analysis. This description focuses on relative outlier detection; outlier arrays are discovered in comparison to other arrays.

Absolute quality metrics use internal controls, or spike-in, or the variability of replicate probe sets on the array (Kauffmann and Huber, 2010). A careful examination of the quality metrics provided by the manufacturer gives an impression of the absolute quality of this array.

Detection of outlier arrays in comparison to other microarrays can be based on robust principal component analysis (Shieh and Hung, 2009) or the correlation coefficient between arrays and number of outlier values per array (Yang et al., 2007). In general the distance between arrays, and the distance between technical replicates with subsequent cluster analysis can be used to detect compromised or mislabeled arrays. For the examples, introduced in this chapter, of outlier detection we used hierarchical clustering with Euclidean distance and complete linkage clustering (see Figure 2-8) or Ward’s linkage clustering (see Figure 2-9).

Visualization tools facilitate the rapid identification of possible outlier arrays. For an overview of all distances a cluster analysis and a dendrogram is helpful. A dendrogram shows not only the distance between single arrays, but also the distances between groups of arrays (see Figure 2-8) as further discussed in chapter 2.1.6.1. MA-plots were again used to identify outliers. For the purpose of identifying whole outlier arrays the relative amount of identified outliers in a technical replicate compared to other technical replicates can be assessed (see Figure 2-7). Lastly, heatmaps are suitable to highlight pairwise distances or gene expression profiles as demonstrated in Figure 2-9. It gives a visual impression of the similarity or distance between the examined expression profiles.

The following examples demonstrate how outlier detection in the presented data set was used to detect outlier in technical replicates by examining outlier values and with the use of cluster analysis at the end of the quality control workflow.

Detecting outlier in technical replicates 2.1.6.1

For our investigation 63 CodeLink Human 10k Bioarrays of 22 preterm infants were examined in an unsupervised analysis approach with no prior definition of groups as described in chapter 4. For each sample two to four technical replicates were prepared.

Raw data was background corrected using subtract and intra-slide normalized using

41

median normalization. Negative values were removed, and transcripts were filtered, when more than 50% of the values were missing or below negative control detection threshold.

Outlier values were defined as values of a probe i outside the interval [mediani ± 3]. To visually assess whether outlier arrays were present MA-plot were used (see Figure 2-7).

Technical replicates with high number of outlier values in comparison to the other replicates can be considered corrupted and should be removed from the analysis. In this data set the arrays “unsup.896.00.3” and “unsup.1073.00.3” show a great number more outlier values than the respective replicates.

Figure 2-7 Detection of outliers in technical replicates via MA-plots.

Each row shows a set of technical replicates of two samples with outlier arrays. In each row the third plot shows a higher number of outlier values (indicated in red) and can be considered outlier arrays. Median of a transcript of all arrays was plotted against the difference of an expression value of a transcript and array and the respective median. As outlier value were values with an absolute difference to the median greater 3 considered.

For demonstration purposes these arrays were left in the analysis and after filtering probes with high rates of missing values due to outlier values, all missing values were imputed, and technical replicates were averaged. Figure 2-8 demonstrates how outlier in arrays can be detected through the distance of technical replicates to each corresponding replicate and through the distance of a single array to all other arrays.

42

Figure 2-8 Detection of outliers in technical replicates through clustering.

Dendrograms of hierarchical clustering of arrays before, on the left side, and after averaging, on the right side, of technical replicates indicate that two arrays can be classified as outlier as they show a high distance to the other replicates. Contrary to the expectation of a high similarity of technical replicates and thus an early clustering of these arrays, the outlier arrays are added late to the establishing clusters. The two sets of replicates are indicated in lighter and darker grey boxes.

Two dendrograms are shown; the left shows the hierarchical clustering with Euclidean distance and complete linkage clustering of arrays before technical replicates are averaged, while the right shows the hierarchical clustering result after averaging. All genes that remained after quality control are taken into account. Between technical replicates a short Euclidean distance is expected. In complete linkage clustering this translates into an early clustering of technical replicates. In the example however, it can be seen that array

“unsup.896.00.3”, indicated in darker grey in the upper left corner of the left dendrogram, not only has a greater distance to its replicates, but also to the rest of the arrays, strongly suggesting that it is an outlier array and should be removed from the analysis. Replicate

“unsup.1073.00.3”, indicated in lighter grey, also shows a great distance to its replicates, also marked in lighter grey. The same phenomenon can be seen for the supervised

43

microarray analysis approach (data not shown), consideration of groups has no apparent effect on the distance between outlier-arrays and their replicates.

Detecting mislabeled arrays through microarray analysis 2.1.6.2

Microarray analysis may also be able to detect grouping errors or may be able to generate hypothesis concerning single outlying microarrays. In the later described analysis of cord blood of preterm infants later developing BPD it became apparent, that arrays of preterm infant displayed a gene expression similar to the group it was not assigned to.

Figure 2-9 Cluster analysis of preterm infants with and without BPD can also be used to detect mislabelled arrays.

In the cluster (on the left hand side) of eight preterm infants without BPD two arrays of BPD preterm infants are clustered; of these two clusters one array (BPD.1149, on the far left) belongs in fact to the group of no BPD preterm infants; it displays a profile similar to the no BPD preterm infants than to the BPD infants.

44

To detect this, an unsupervised quality control, i.e., without the prior assignment of groups, was conducted. Afterwards transcripts were selected using a linear model of known risk factors to describe the variation in gene expression (see detailed description in material and methods of the study itself in chapter 4.2). If at least one regression-coefficient of the model displayed an adjusted significance level of ≥ 90%, the transcript was selected and all transcripts were used for a hierarchical cluster analysis.

Then a closer look was taken at the cluster formation in the gene expression patterns that fit to the pattern of risk factors for BPD that were established later, i.e., days of mechanical ventilation and oxygen dependence of the preterm infants, as well as a priori risk factors, i.e., the gestational age or maturity of the preterm infants. In Figure 2-9, a heatmap of the expression profiles of 20 preterm infants at time of birth is displayed. Hierarchical cluster analysis (Euclidean distance, Ward’s linkage clustering) was performed on samples and on transcripts. In transcript clustering roughly two main clusters can be distinguished, a cluster of up-regulated and a cluster of down-regulated transcripts between the two clusters identified on sample level. One of the sample cluster mainly, i.e., 8 of 10, contains preterm infants without BPD (see Figure 2-9, top left cluster) and one cluster mainly contains BPD preterm infants (see Figure 2-9, top left cluster). Examining cluster 1 with only two no BPD preterm infants it becomes apparent, that these two expression profiles rather seem like profiles of preterm infants without BPD than with BPD. In the process of examining this phenomenon it becomes apparent that one of these preterm infants (“BPD.1149”) had a short need of oxygen support and therefore should not be grouped in the BPD-group. After double-checking with the clinical data of these preterm infants it becomes clear, that this infant in fact did not develop BPD.

45

2.2 Preprocessing workflow

The preprocessing workflow (Figure 2-10) described in this chapter will summarize the methodology and steps used in the BDP study to be presented in chapter 4, which are necessary to make data obtained from different microarrays comparable through noise reduction by background correction and normalization, as well as examination and preparation of signals for statistical analysis. The presented workflow thereby allows to easily evaluate the order of those steps and to allocate their respective place within the workflow.

Figure 2-10 Preprocessing workflow as established for this investigation.

It begins with the raw data format. Microarrays are constantly checked for abnormalities affecting whole arrays. Outlier samples are then removed from the data set. Background is subtracted and intra-slide normalized using median normalization provided by the manufacturer software. With the normalized data begins data preparation in R. Values are filtered if more than 50% in at least one group are missing, or more than 50% in each group are low expressed. Then outlier values are identified by the group median and set to missing. Again values were filtered for missing values. Missing values are then imputed by Bayesian Principal Component Analysis (BPCA). Technical replicates are averaged using the arithmetic mean of the transcripts

In short, the workflow aims at the (1) removal of uninterpretable signals due to high rates of missing values, or rates of values below a detection threshold, (2) reduction of variances of gene expression across microarrays of a treatment group due to high rates of

46

outlier values in a treatment group, (3) imputation of missing values to meet requirements and improve results of statistical analyses. The following workflow is incorporated in R (R Core Team, 2014) or prepared to be incorporated into a R-routine.

Outlier samples. Abnormalities in the data with regard to overall distributions of microarrays, e.g., in the number of missing values or the number of outlier values is constantly monitored. In addition, clustering techniques (see chapter 3.1) are applied and correlation measures between technical replicates used to detect microarrays which deviate strongly from technical replicates or from the treatment group. Detected outliers are then examined separately for possible defects or handling errors, e.g., mislabeling and are finally discarded.

Background correction As stated in section 2.1.1 it is necessary in microarray experiments to improve the signal of gene expression and to reduce the noise in the measured intensities due to technical issues, e.g., different amount of hybridized mRNA.

To ensure the ability to compare the recent study with other previously done studies based on this array type we followed the recommendation of the manufacturer of CodeLink Bioarrays and used background subtraction. But it later has been shown that background subtractions bears some difficulties and a variance stabilizing method for background correction, i.e., “normexp” in addition to an offset, lead to a higher accuracy in returned signals (Ritchie et al., 2007). In future studies it should be considered to switch to another background correction method. Resulting negative values were blanked.

Normalization. For the investigation presented in this work, we normalized the data twice. After the preparation of the microarray data was first background corrected using the subtract method and then normalized using Median normalization as an intra-microarray normalization step. As previously stated in section 2.1.2, Median normalization has a great advantage in microarray analysis, because it is applicable when microarrays are processed at different times and it is, as well as background correction, independent of the expression of other microarrays and still achieves comparable distributions in signal intensities.

After Median normalization data is filtered as discussed below affecting once more the distribution. Therefore we added a second normalization step. Here, Quantile normalization as an inter-microarray method is used, which then compares the distributions of the microarrays actually used in the study. For Quantile normalization the Bioconductor limma (Ritchie et al., 2015) was used.

47

Missing values filtering. Filtering for missing values was done twice. First, data was filtered for missing values accounting for missing values due to preparation steps and background correction. If at least 50% of the values of a transcript in at least one group were missing the transcript was excluded. A second step of missing value filtering is conducted after outliers are identified and removed from the data set, again rejecting transcripts with missing rates > 50%.

Low expressed values. In contrast to filtering for missing values low expressed value (missing values are counted as low expressed values) filtering is only done, if in each group a certain amount of low expressed values are found. Filtering for low expressed transcripts was conducted using the negative control threshold as quality flag data was not available for our data set. In the case that flags are available, they should be considered as they bear even more information than the negative control threshold. Transcripts were excluded if in each groups at least 50% of the values were below detection threshold. The order of the missing value or low expressed value filtering does not play an important role.

We decided to place this step second because of computation time constraints. The removal of transcripts with high amounts of missing values in a treatment group may speed up the process of low expression filtering.

Outlier values. Outlier detection is conducted in order to avoid bias due to sporadic extreme values in a treatment group and transcript. Data was analyzed for outlier values defining log-transformed values as outlier using a median based method. We used a method based on median plus offset for outlier detection as it is a more robust method to detect outliers than using a z-transformation based method. In the future using and implementing a MAD based method would be more conservative as it considers a robust estimation for the standard deviation of a transcript over all arrays. The procedure was performed twice to adjust for new medians after the first step of outlier detection.

Afterwards data was again filtered for missing values using the same settings as described above.

Imputation. The remaining missing values, values were log-transformed, were imputed using an BPCA imputation described by Oba et al. (2003) via the Bioconductor-package pcaMethods (Stacklies et al., 2007). BPCA is recommended as robust method suitable for microarray data by various reviews and imputation comparison studies as elucidated in chapter 2.1.3.2.

48

Log-transformation. After inter-microarray normalization, technical replicates are averaged using the arithmetic mean of a transcript. One procedural question remains:

when to log-transform the data. In the current workflow data is log-transformed as the last step before statistical analysis, but for other case studies it may have advantages to log-transform at an earlier point in data preparation. The background correction methods subtract and half are not affected by log-transformation. For normexp background correction Ritchie et al. (2007) recommend to perform a started log-transformation afterwards. Median and Quantile normalization are independent of the prior distribution of the transcripts. Procedures filtering transcripts with missing values or values below detection threshold are not affected whether data is log-transformed or not, but imputation procedures are affected. Evaluation of literature sources (Aittokallio, 2010; Oba et al., 2003; Oh et al., 2011; Troyanskaya et al., 2001) indicate a preference for log-transformation of data before the imputation step. Troyanskaya and colleagues (2001) also point out that log-transformation reduces the effect of outliers present in the data set on transcript similarity detection. For averaging technical replicates it is also beneficial to use log-transformed data, as this ensures normal distributed data where the mean is the most accurate statistic.

49