• Keine Ergebnisse gefunden

8.5. The Plant Microarray Databases 159

Integrated structural, functional and comparative genomics of the model legume Medicago truncatula”.

The Grainlegumes Integrated Project (GLIP) is an international project co-funded by the European Commission (http://www.eugrainlegumes.org); it is based on the activities of the MEDICAGO project. An expressed goal of the project is to improve the utilization of legume plants like peas, lupins, and beans as crops in European agriculture for protein nutrition. A problem for European farmers with growing grain legumes is in particular yield inconsistency caused by a variety of biotic and abiotic stresses. This project currently comprises 53 contrac-tors from 18 countries. Microarray data-analysis is a substantial part of legume research within GLIP. Currently workpackage 5.3 ‘transcription-profiling’ is con-cerned with microarrays and workpage 6.1 ‘bioinformatics’ includes data-analysis and the developement of databases to share the resulting data between institutions that are part of GLIP. Another deliverable in this workpage is the integration of the heterogeneous datasources that emerge during the project with each other and also with plant-related databases from other projects.

Within the GLIP project Medicago truncatula and Pisum sativum (pea) are used as model organisms. For both plants, the complete genomic sequence has not been finished yet. For Medicago truncatula the sequencing project is ongoing, but the sequence and annotations is not stable, yet. Therefore, tentative consensus sequences (TCs) of ESTs have been used to generate 70-mer oligonucleotides for the microarray designs. The use of ESTs and the existence of a only partially sequenced genome mark a fundamental difference between the GLIP project and microbial related projects, where fully sequenced genomes are available. The clustering of ESTs in TCs is subject to frequent changes; direct mapping of reporter oligomer sequences is in principle feasible but not final.

Within the GLIP project, 3 different microarray designs are used. The Mt16kOli1 design (Hohnjecet al., 2005) contains 16,086 unique reporters spotted in duplicates, representing all TCs of the TIGR Medicago truncatula gene index1. This array design has 33696 features layed out in 4 horizontal and 12 vertical grids.

The Mt16kOli1Plus layout (Thompson et al., 2005) is based on the Mt16kOli1 oligo-set, enriched by 384 reporters representing mainly transcription factors and other known regulatory elements. With 34944 spots, the Mt16kOli1Plus microarray is one of the largest cDNA microarray designs currently available. An overview on EMMA2 and its application in GLIP is found in K¨uster and Dondrup (2006). A general overview on all transcriptomics and bioinformatics tools provided by the CeBiTec for legume plants is found in the TheMedicago truncatula handbook (Bekel et al., 2006).

1www.tigr.org

8.5. The Plant Microarray Databases 161

Species Array design Array type ArrayExpress Identifier

Sequences on the ar-ray

M. truncatula Mt6kRIT cDNA macroar-ray

A-MEXP-80 Probes representing 5648 EST-clusters from M. truncatula root nodules, AM roots, and uninfected roots (K¨usteret al., 2004)

M. truncatula Mt6kRIT cDNA

mi-croarray

A-MEXP-81 Probes representing 5648 EST-clusters from M. truncatula root nodules, AM roots, and uninfected roots (K¨usteret al., 2004)

M. truncatula Mt8k cDNA

mi-croarray

A-MEXP-84 Mt6kRIT probe set plus 1776 probes represent-ing EST-clusters from M. truncatula flowers and pods (Firnhaber et al., 2005)

M. truncatula Mt16kOLI1 70mer oligonu-cleotide microarray

A-MEXP-85 Probes representing 16.086 tentative con-sensus sequences of the TIGR M. truncatula Gene Index version 5 (Hohnjecet al., 2005) M. truncatula Mt16kOLI1Plus 70mer

oligonu-cleotide microarray

A-MEXP-138 Mt16kOLI1 probe set plus 384 probes primarily representing transcription factors (Thompson et al., 2005)

P. tremula + A. muscaria

Pt2.4kOLI1 70mer oligonu-cleotide microarray

A-MEXP-202 Probes representing 2350 EST-clusters from poplar ECM Uwe Nehls, Universit¨at ubingen

Table 8.4: Expression profiling tools generated in the MolMyk, MEDICAGO, and GLIP projects.

8.5.1 Project Specific Requirements

One goal of all Medicago related projects is to deliver integrated databases, that store the microarray-based expression data obtained in the course of the project, together with relevant information on the experimental conditions profiled and the protocols used to obtain transcriptome profiles.

Currently the sequencing of the Medicago truncatula genome is an ongoing project. Consequently, the reporter sequences of the microarrays were designed against ESTs. From the point of data representation, there are no direct implica-tions. The PCR-primer pairs and oligonucleotide sequences can be directly entered into EMMA2 sequence objects. The EST sequences are organized into tentative consensus sequences by clustering the ESTs. While new ESTs are sequenced or existing ESTs resequenced the assignment of ESTs to TCs might change and hence the annotation of the TCs. The process of re-annotation of ESTs is frequently carried out, creating the need for dynamic updates of the annotations.

The EST annotation should be kept up to date by linking the internal sequence annotations against an external component using the BRIDGE integration com-ponent. At first, there was no BRIDGE-aware component suitable for linking the representations against. As the internal data representation of EMMA2 allows for storage of freetext annotations of sequences, this was used as a fall-back. The ESTs were then linked against the TIGR-medicago gene index to facilitate map-ping the ESTs on their TCs. The sequence descriptions in EMMA2 were regularly updated by a script. At a later stage, the sequence data was linked against the BRIDGE-aware SAMS application (see below).

Another specific requirement, that emerged early in the GLIP project was a data-mining component for sequence annotations. The users should be able to query for expression data from the whole project, not only from specific experiments. The query should be made on known unique sequence identifiers or with a boolean full-text search within the annotations. This search strategy should resemble search mechanism employed by search-engines for web-pages.

8.5.2 Project Setup

The GLIP-microarray database was set-up using the standardized procedures.

Databases for the ArrayLIMS and EMMA2 systems where created using the stan-dard table-definitions. No additional optimizations were required in the first place.

The standard role definitions were also found to be sufficient for the GLIP database.

The initial project administrator (termed ’Chief’) was the only initially registered user. All other users were registered by the administrator on their request.

Concerns of data privacy raised by the user community resulted in the creation of several user-groups making collaboration within an institution possible, while disclosing the data to other users for a limited period of time.

The requested integration of the novel datamining tool created the necessity to modify the database and EMMA2 modules. Due to the flexible modular design,

8.5. The Plant Microarray Databases 163 this step required only limited efforts. To make specific data fields searchable, a single full-text index was added to the Description-object of the GLIP-database.

A display module and a backend module were added to the standard modules to provide the requested functionality (see Figure 8.10 on the next page). After this implementation step and an automated database-update, the datamining function-ality was automatically available within all other projects. Data integration with the ESTs stored in the SAMS system was established by adding BRIDGE URIs to the array layouts in EMMA2.

8.5.3 Results

After the set-up phase, the users were able to upload experimental data to the ArrayLIMS and to use the analysis pipelines autonomously. Currently, the plant databases contain a total of over 300 hybridizations.

Several relevant findings could be obtained by using the microarray data in con-junction with the EMMA2 analysis pipelines. Within a study of Yahyaoui et al.

(2004), over 750 genes, including a large proportion of transcription factors, were found to be differentially expressed during root nodulation by using the Mt6k-RIT macroarrays and microarrays. The authors applied a pipeline of normalization and t-test with a combined filtering strategy in combination with hierarchical cluster analysis. By visual inspection of the cluster results, the authors end up with five independent clusters and conclude that there exists a clear switch between a general root-specific and nodule-specific gene expression program.

Based on Mt6kRIT microarray hybridizations, several comparative transcription profiling studies of root nodules and root tissues during AM formation (K¨uster et al., 2004; Manthey et al., 2004) now allow for a more global comparison of ex-pression profiles during nodulation and formation of mycorrhizza. It was found that the two endosymbioses, although they were known to share common mechanisms, have only limited overlap of their genetic programs, with 75 genes being co-induced in the two interactions.

The article of Firnhaber and colleagues provides insights into the developmental expression regulation during the development of M. truncatula flowers and pods.

The authors describe the extension of the of the Mt6RIT towards the Mt8k mi-croarrays and their subsequent application to identify more than 700 genes with developmental expression regulation (Firnhaber et al., 2005).

In a recent study, the more comprehensive Mt16kOli1 70mer oligonucleotide mi-croarrays were applied to specify the overlapping genetic program activated by two commonly studied microsymbionts, Glomus mosseae and Glomus intraradices. In total, 201 plant genes were significantly co-induced at least 2-fold in either inter-action (Hohnjec et al., 2005), using normalization functions and statistical analysis pipelines implemented in EMMA2. A set of well-known marker genes were found to be co-activated, thus validating the transcriptomics data (Hohnjec et al., 2006).

As EMMA2 is used throughout all three plant-related functional genomics projects, a cross project integration of the obtained expression data is feasible.

Figure 8.10: The Datamining Wizard of EMMA2. It allows to search for expression data by a boolean full-text search. The search can be restricted to experiments and conditions. Below the search mask a table containing the results of the search is depicted.

8.6. The Mamma Carcinoma Microarray Database 165