Bioinformatic and statistical microarray data analysis

2 Material and methods

2.2 Methods

2.2.11 Bioinformatic and statistical microarray data analysis

All microarray data analyses were performed in GeneSpring GX 10.0.2 software, unless stated otherwise.

2.2.11.1 Data preprocessing

Preprocessing of Affymetrix microarrays involves three key steps: background correction/adjustment, normalization and probe summarization.

For preprocessing, original CEL-data files were imported into GeneSpring GX and the GC-RMA (GC-Robust Multi-array Average) algorithm (Wu et al. 2004) was applied. It includes the mentioned three major steps of background correction, normalization and probe summarization. The GC-RMA method is an improved form of RMA (Bolstad et al. 2003, Irizarry et al. 2003a, Irizarry et al. 2003b) re-garding background correction. In contrast to RMA, the background correction

implemented within the GC-RMA algorithm takes different tendencies of probes to encounter non-specific binding (based on GC content) into consideration, thus avoiding underestimation of background noise. These sequence-specific probe affinities are completely ignored by the background correction done by the origi-nal RMA method. Normalization and probe summarization of RMA and GC-RMA are exactly the same. For probe summarization, signal intensities of mis-match probes are ignored and the summarized expression value of each probe set is created based on the perfect match probes, only.

The expression measures outputted by the GC-RMA algorithm are on log2 scale.

Finally, these measures were baseline transformed to the median of all samples.

During this step, the median value of each probe sets across all arrays is computed (on log scale) and subtracted from each individual value of this probe set.

2.2.11.2 Quality control

In order to verify the quality of the microarray data, a correlation analysis across GeneChips® was performed. The correlation coefficient of each GeneChip® to all other GeneChips® was calculated and coefficients were plotted against each other as a heatmap (correlation plot). This type of analysis allows estimating on a visual basis whether outlier chips are present within the data set or whether all chips are comparable to one another.

2.2.11.3 Comparison of independent groups ‒ Identification of significantly differentially expressed gene

Different clinical parameters (e.g. histological type, recurrence of disease) were used to form patient groups. Significantly differentially expressed genes between groups were identified by Welch-test, a modified unpaired t-test, which does not assume variances of samples of each group to be equal.

In a typical microarray study several thousands of genes are simultaneously intro-duced to a statistical test. During the test, each gene is considered independently from one another, and thus the test is performed on each gene separately. The in-cidence of false-positives (genes passing the test, but possess no difference be-tween groups in reality) is proportional to the number of tests performed and the critical significance level chosen (p-value cutoff). The p-value is the probability that the gene passes the test due to chance alone. A p-value of 0.05 signifies a 5%

probability that the gene passes the test by chance. If, for example, 10,000 genes are tested, 5% or 500 genes might be called significant just by chance. This is why it is important to correct the p-value of each gene, when performing a statistical test on a group of genes. Multiple testing correction algorithms, which are used for this purpose, correct the individual p-value for each gene to keep the overall error rate (or false-positive rate) to less than or equal to the user-desired p-value cutoff or error rate.

In this study, multiple testing correction techniques of Benjamini and Hochberg False Discovery Rate (Benjamini and Hochberg 1995) and Bonferroni Family Wise Error Rate (Bonferroni 1935, Bonferroni 1936) were used.

Statistical significance was accepted at corrected p<0.05. If multiple testing cor-rections resulted in no significant features passing the test, statistical significance was accepted at non-corrected p<0.001.

The fold change (FC) of expression between two groups was calculated as the fold difference between group means.

2.2.11.4 Gene Ontology analysis

The results of high-throughput experimental techniques like microarrays are lists/groups of interesting genes, e.g. lists of differentially expressed genes, which require further biological interpretation and evaluation. One possibility to

accom-plish this task is to use the gene-specific functional annotations provided by the Gene Ontology (GO) system.

For this study, GO analysis was performed using GOSSIP, a freely available software package that tests whether a molecular function, biological process or cellular location, described in the Gene Ontology system (the so-called GO terms), is significantly associated with a group of interesting genes when com-pared to a reference group (Bluthgen et al. 2005). As a result, lists of statistically enriched GO terms are outputted. In order to avoid misleading results, GOSSIP implements multiple testing corrections when determining statistical significance.

Thus, GOSSIP represents a powerful and reliable tool to identify and examine the biological relevance of gene groups of interest.

2.2.11.5 Clustering analyses

Clustering is the assignment of observations/objects of one data set into subsets, the so-called clusters, such that those within the same cluster are more closely related/similar to one another than to the objects in the other clusters. Central to all the goals of cluster analysis is the intention to identify similarity (or dissimilar-ity) between the individual objects investigated.

In microarray studies the primary goal is to find “co-behaving” subsets of genes or samples. Clustering of genes/probe sets is performed based on the expression pro-file of each individual probe set across all microarrays, and probe sets that “be-have” similar (e.g. up or downregulated in the same arrays) are clustered together.

For clustering of samples/microarrays, the expression values of a certain set of genes/probe sets is used, and individual microarrays are clustered according to their expression profile. As the result, microarrays exhibiting similar expression profiles are assigned to the same cluster.

All clusterings were done by means of hierarchical clustering algorithms. The result of such hierarchical cluster analyses is a tree-like structure, the so-called dendrogram, which displays the hierarchy of the formed clusters. “Euclidian dis-tance” and “complete linkage” were used as distance metric and linkage algorithm for all clustering analyses performed (clustering of microarrays/samples and of probe sets/genes).

2.2.12 Statistical evaluation of quantitative real-time

Im Dokument Gene expression profiling of human lymph node-positive gastric adenocarcinomas (Seite 65-69)