• Keine Ergebnisse gefunden

8.4 Stress Response Analysis of S. meliloti

8.4.4 Identification of Significant Functional Categories

The data from the microarray experiment were analyzed by using multivariate anal-ysis of variance (MANOVA), principle component analanal-ysis (PCA), and subsequent cluster-analysis to identify genes of particular interest. Prior information about the from the genome annotations stored in the GenDB project were available for a fraction of genes. No strong hypothesis about probable correspondence of gene expression profiles could be stated before the analysis, nevertheless functional

clas-sification of the genes was present within the database and should be used. For S.

meliloti the functional classification of COG, enzymatic reaction pathways (KEGG) and functional subsystems of Overbeek and coworkers were available within the an-notation system.

A sequence of EMMA2 pipeline functions was applied to preprocessed datasets.

First, it was necessary to find out if the available datasets were at all containing enough information to distinguish between different classes for each of the annota-tion formalisms. For a multivariate microarray experiment, the classificaannota-tion is the response variable and the predictor variable is a vector of length 6.

Thus, a MANOVA pipeline was applied to compare multivariate inter-class with intra-class variances. Therefore, only genes having a sensible annotation for any of the classification schemes were retained. After applying the filtering step, 2737 genes were identified within one of 21 COG categorie, 532 genes could be identified to belong to a KEGG enzymatic reaction; a rather small subset of 99 sequences were annotated to belong to one out of 8 functional subsystem. The functional classification was automatically extracted from the GenDB annotations via the BRIDGE communication layer during pipeline execution.

For the identified genes, replicate measurements where joined by their arith-metic mean. Each joint gene-expression vector was assigned its class label and tested against the null hypothesis, that there was no difference in expression pro-files between classes. The null hypothesis could be rejected for COG, KEGG, and subsystems. This can be interpreted as there is at least one group, category or pathway for which the expression data differs significantly from the mean, or in other words some aspect of the prior annotation knowledge is reflected in the data.

In contrast, if no external knowledge was available, the only sensible analysis step would have been to identify genes which show differential expression by applying a genewise statistical test.

The next question to investigate is, which classes do in fact react to the experi-mental condition. The ANOVA approach only provides information about whether there is at least one class with significant reaction, but not which or how many.

Box and whisker plots are suitable means to address this problem by visualization, if there not to many classes to compare. In addition to the information about the mean expression profiles for all genes in a group, they provide a visual measure of the within group variation. For each functional subsystem in GenDB, a boxplot visualization was plotted. Four types of reaction patterns could be identified by in-specting the plots: 1. subsystems that show no response, 2. subsystems that show a directed response over time, they tend to be either downregulated or up-regulated with a clear trend, without increased variance,. 3. subsystems that show a trend and also an increase in variance, 4. subsystems that show no clear trend, but react with increased variance.

Most subsystems showed an increase in variance with the treatment. The biosyn-thesis of the amino-acid tryptophan , however, could be shown to exhibit no re-sponse to the stress condition. This is a behavior contrasts other amino-acid syn-thesis like methionine that show down-regulation and an increased variance. At

8.4. Stress Response Analysis of S. meliloti 157 this step, the hypothesis, that regulation of tryptophan synthesis is markedly dif-ferent from regulatory mechanisms of other amino-acids, could be readily derived from the automated pipeline.

The Stable-Pathway Approach

In order to apply the boxplot method to all 87 KEGG-pathways, it was necessary to apply data reduction for an appropriate visualization. Boxplots of complete expres-sion profiles would be confusing, but this problem can be solved by dimenexpres-sionality reduction. Reducing the dimensionality to the principle direction of variance pro-vides a means to plot only one bar per pathway.

Therefore, principle component analysis was applied to find the direction of max-imum variance within the data. The main focus of this analyis step was, to detect pathways, which are rather stable in their expression profiles, while eventually hav-ing a common trend in their expression. The original expression data were projected on the first PCA component, which described more than 80 percent of the experi-mental variance, for this experiment. By analyzing the factor loadings of the first principle component in a plot, it turned out, that the first component is a linear combination of all original axes with almost equal weights.

The boxplot of the first component were then made and visually inspected for interesting groups (see Figure 8.9 on the following page). It appeared that the vast majority of pathways exhibits an increased variance, like for the subsystems plots, but a small subset remains relatively stable. To automatically detect these pathways, the variance of each pathway sPk was compared to the overall data variance sD by a F-test. The F-test computes the ratio

F = sPk

sD

and compares it to a given value. The alternative hypothesis was, that F < 1, so to find groups having variance significantly smaller than the overall variance.

The stable pathway pipeline was added to the function repository of EMMA2.

It returns a list of p-values and adjusted p-values using Bonferroni’s method for each pathway. As a result, a small number of stable pathways could be identified as significant with p≤0.05. The top-scoring pathway with respect to itsp-value is the Glycolysis/Glyconeogenesis pathway; it is also the only significant finding after bonferroni correction.

As a conclusion from the previous findings, it seems reasonable to further inves-tigate the Glycolysis/Glyconeogenesis pathway. In the next step, a lines-plot of the expression profiles of this pathway was produced. The plot exhibited an increase in the M-values of the last timepoint for the majority of the genes in this pathway.

When each of the genes is tested individually by a t-test, however, no significant change in expression was detected for any of them. This fact assures the hypothesis, that there are relevant findings for the analysis of per-pathway expression profiles, which cannot be made on a per-gene basis and can at least to some extent be automatically detected by an analysis pipeline.

1− and 2−Methylnaphthalene degradation 2,4−Dichlorobenzoate degradation 3−Chloroacrylic acid degradation Alanine and aspartate metabolism Aminoacyl−tRNA biosynthesis Aminophosphonate metabolism Aminosugars metabolism Androgen and estrogen metabolism Arginine and proline metabolism Ascorbate and aldarate metabolism Atrazine degradation Benzoate degradation via hydroxylation beta−Alanine metabolism Biosynthesis of ansamycins Biosynthesis of siderophore group nonribosomal peptides Biosynthesis of steroids Biosynthesis of vancomycin group antibiotics Biotin metabolism Bisphenol A degradation Blood group glycolipid biosynthesis−neolactoseries Butanoate metabolism C5−Branched dibasic acid metabolism Caprolactam degradation Carbon fixation Citrate cycle (TCA cycle) D−Alanine metabolism D−Glutamine and D−glutamate metabolism Fatty acid biosynthesis (path 1) Fatty acid biosynthesis (path 2) Fatty acid metabolism Fluorene degradation Folate biosynthesis Fructose and mannose metabolism Galactose metabolism gamma−Hexachlorocyclohexane degradation Glutamate metabolism Glutathione metabolism Glycerolipid metabolism Glycerophospholipid metabolism Glycine, serine and threonine metabolism Glycolysis / Gluconeogenesis Glycosphingolipid metabolism Glyoxylate and dicarboxylate metabolism Histidine metabolism Inositol metabolism Inositol phosphate metabolism Limonene and pinene degradation Lipopolysaccharide biosynthesis Lysine biosynthesis Lysine degradation Methane metabolism Nicotinate and nicotinamide metabolism Nitrogen metabolism Nucleotide sugars metabolism One carbon pool by folate Oxidative phosphorylation Pantothenate and CoA biosynthesis Penicillins and cephalosporins biosynthesis Pentose and glucuronate interconversions Pentose phosphate pathway Peptidoglycan biosynthesis Phenylalanine metabolism Phenylalanine, tyrosine and tryptophan biosynthesis Photosynthesis Polyketide sugar unit biosynthesis Porphyrin and chlorophyll metabolism Protein export Purine metabolism Pyrimidine metabolism Pyruvate metabolism Reductive carboxylate cycle (CO2 fixation) Riboflavin metabolism RNA polymerase Selenoamino acid metabolism Starch and sucrose metabolism Stilbene, coumarine and lignin biosynthesis Streptomycin biosynthesis Sulfur metabolism Terpenoid biosynthesis Tetrachloroethene degradation Thiamine metabolism Tryptophan metabolism Type II secretion system Tyrosine metabolism Ubiquinone biosynthesis Urea cycle and metabolism of amino groups Valine, leucine and isoleucine biosynthesis Valine, leucine and isoleucine degradation Vitamin B6 metabolism

−15 −10 −5 0 5

Figure 8.9: Boxplot of the data projected on first principle component for each KEGG-pathway.

8.5. The Plant Microarray Databases 159