• Keine Ergebnisse gefunden

Statistical analyses of conventional experimental data

2 Materials and Methods

2.2 Methods

2.2.8 Statistical analyses of conventional experimental data

For comparison of two data columns, the two-tailed Student´s t Test was employed. For all tests, Gaussian distribution was assumed, and the confidence interval was set to 95 %.

2.2.8.1 Statistical analysis of high-throughput data

The data gained from Taqman low density arrays and Illumina HT12 BeadChip arrays required extensive correlation studies and statistical correction for large sample sizes. These advanced analyses were carried out by Dr. Annalisa Marsico, assisted by Dr. Brian Caffrey, Max Planck Institute for Molecular Genetics, Berlin.

The analysis of all 9 Illumina HT12 Beadchip Arrays was carried out with the lumi R Bioconductor Package, which is especially designed to process Illumina microarray data.

After background correction, the variance stabilization and normalization procedure from the vsn R package was applied. This simultaneous normalization of intensities and variance stabilization transformation corrects for the fact that the variance of array replicates is not independent from the mean signal intensity, but increases at higher intensities. Differentially expressed genes were identified by means of a moderate t-test (R limma package), including Benjamin-Hochberg Correction for multiple testing. Genes with an adjusted p-value < 0.1 and a linear fold change > 1.5 were considered differentially expressed.

For interpretation of the TLDA analyses, the HTqPCR R package was used. All miRNAs with little or no variation among samples were removed prior to testing for differential expression.

For each miRNA, the inter-quantile range among samples (IQR) was calculated, and miRNAs with an expression level of IQR < 1.2 were not considered for further analysis. The ∆∆Ct model was used for quantification of differential expression. Statistical significance of miRNA differential expression was assessed by means of a moderate t-test. By converting the Ct values to a logarithmic scale (log2 transformation), miRNAs with a |∆∆Ct| > 1 (fold change of 2) and a p-value < 0.1 were considered differentially expressed.

In order to identify functional miRNA targets and reduce as much as possible the number of false positives an adjusted ranking score for prediction of microRNA and mRNA interaction was employed. The adjusted score was computed using the formula

Materials and Methods

This formula integrates the following parameters:

miRSVR (miRanda) prediction score (A) [107]

positive target prediction by both miRanda and TargetScan [108] (B)

conservation across species (C)

Todorovski distance of miRNA and mRNA expression data (D)

Published experimental validation (E)

Number of miRNA binding sites in the mRNA 3´UTR (Fn) Each factor is weighted by a negative coefficient (b, c, d, e, fn).

Expression values of mRNA and miRNA are given as log2 of linear expression data. This transformation corrects for high absolute standard deviation of highly expressed targets, and it allows treating the data set as Gaussian, which is a prerequisite for the Student´s t-test.

2.2.8.2 Principal Component Analysis

Prerequisites

In order to visually represent the global sample variation within the mRNA and miRNA array experiments, a principal component analysis (PCA) was performed.

Log2-transformed expression data of genes or dCt values of miRNAs that were determined to be subject to significant regulation after treatment were provided by in-depths bioinformatic analyses (section 2.2.8.1). Prior to extraction of the first principal components of each dataset, a test on sampling adequacy was performed to ensure eligibility of the data for subsequent analyses. The Kaiser Meyer Olkin (KMO) Criterion was calculated on the basis of each data matrix, here exemplarily termed “transcriptome”. The “paf” command was retrieved from the R package rela.

paf(transcriptome)$KMO

The KMO is an index value between 0 and 1 for measuring the suitability of the attributes to be involved in PCA, higher values being better. The KMO takes into account the inter-sample

Materials and Methods

67

correlation and is computed on the basis of a correlation matrix. A value > 0.8 indicates low partial correlation between the samples, while a value > 0.6 is considered acceptable [109].

Such uncorrelated or weakly correlated samples are a prerequisite for PCA.

Furthermore, the following R command was used to run a measure of sampling adequacy (MSA):

paf(transcriptome)$MSA

While the KMO provides a single index number to characterize the dataset, the MSA returns an individual value for each sample that describes its eligibility for a factor analysis. Like the KMO, the MSA takes a value of 1 for uncorrelated values and declines as a reciprocal function of partial sample correlation. Value interpretation is analogous to the KMO (see above).

Principal Component Analysis

A principal component analysis reduces a high-dimensional dataset by summarizing variables and expressing them as a single composite numeric value, i.e. a principal component. Once the first principal component has been fit to the data, the following principal components are incrementally added to the first one at orthogonal axes along the directions of maximum variance in the data. Each principal component is an eigenvector of the covariance matrix that is computed on the basis of the original data. Once every eigenvector has been added, the orthogonal body of eigenvectors is rotated to optimize the fitting of all principal components to the variables in the dataset. The principal components with the highest explanatory power, i.e. representing the directions of maximum variation, can then be extracted to represent the original dataset with both reduced complexity and highest possible fidelity.

A principal component analysis was performed on the mRNA and miRNA array data using the “prcomp” R command. It z-transforms and rotates the data matrix and returns an object (“pca”) that contains the list of eigenvectors computed from the covariance matrix (i.e. the principal components).

pca ← prcomp(transcriptome, center = TRUE, scale= TRUE)

Materials and Methods

68

For a graphic representation, the first principal components (i.e. those with the highest explanatory power) were selected in order to achieve an explained variance > 95 %. The percentage of explained variance that was contributed by each principal component, i.e. the factor loading, was calculated as the ratio of the respective cumulative sum of variance (the standard deviation squared) and the sum of total variance.

var ← pca$sdev^2

cumsum ← cumsum(var)/sum(var)

The result identified the explanatory power of the first three principal components to be sufficient, as it amounted to > 95 % of total variance. A 3D cube was used for graphic representation. The “plot3d” and “spheres3d” commands were retrieved from the R package rgl.

plot3d(pca$rotation[,1:3], xlab = "x")

Color and shape were given to the data points by

spheres3d(pca$rotation[,1:3], radius=0.02, col=c("red", "red", "red", "blue", "blue", "blue",

"darkgreen", "darkgreen", "darkgreen"))

Results

69

3 Results