Extension of EMMA2 to store and analyze Affymetrix GeneChip ® expression datasets

CHAPTER 5

Implementation

This Chapter describes the implementation of the previously designed applications.

First, this Chapter illustrates the implementations to store and analyze Affymetrix GeneChip^® expression datasets with EMMA2 (cf. Section 2.2.4) the same way as conventional oligonucleotide microarrays. Second, the implementation of the TRUNCATULIX data warehouse is presented, focusing on data handling, data ac-cess, and frontend visualization. As a last part of this chapter, the implementation of the tool MediPlEx, combining different gene expression analyses, is outlined in detail.

5.1 Extension of EMMA2 to store and analyze

56 Chapter 5. Implementation Layout import

The import of an array layout of GeneChips^® is more complex than importing a layout of classical oligonucleotide microarrays, because the data to be imported is not stored in one single file, but spread over 3 files (for details see Section 4.1).

Due to the size of a GeneChip^® dataset (about 1.000.000 reporter), the import of the layout is divided into two successive steps:

In the first step the essential information of the layout is imported. This includes the reporters, genes, and basic layout information. In a second step, all additional information are added to the array layout, like sequences of the genes and reporters, but also the information of x and y coordinates and PM and MM information.

This two-step method offers the possibility to create the layout fastly and to add all additional data afterwards, which is more memory-efficient.

Implementation of the GeneChip^® array layout importer:

The GeneChip^® array layout importer is implemented in Perl so that it can be integrated in the existing pipeline framework of EMMA2. The main steps of the import process are:

• Read the CDF file, load only basic layout information and reporter names.

• Create an ArrayLayout in the database of the EMMA2 project, containing the basic information loaded before.

• Read the CDF file again, load all stored information (including the reporter sequences and the x and y coordinates of the spots).

• Store these information and link the objects in the ArrayLayout.

• Read the SIF file, containing the gene names and fasta sequences.

• Store these information in the ArrayDesign and link it to the existing objects.

The web interface of EMMA2 is extended to load Affymetrix GeneChip layouts as shown in Figure 5.1.

As an additional option, a script is implemented to import the sequences of the spotted reporters stored in the probe tab file. This information is general not of interest and thus there is no option integrated into the web interface of EMMA2. If a user wants to add this information to the imported ArrayLayout, an administrator can start the script to do so.

Import of Affymetrix GeneChip^® datasets

The raw GeneChip^® microarray expression datasets can be uploaded into the ArrayLims application in the same way as oligonucleotide microarrays. The raw files are stored and administered internally. The EMMA2 software can connect to the ArrayLIMS application and load these raw datasets during the experiment creation step. In this step, the microarray layout has to be chosen. In case of an

5.1. Extension of EMMA2 to store and analyze Affymetrix GeneChip^®

expression datasets. 57

Figure 5.1: A screenshot of the EMMA2 web interface focusing the import of an Affymetrix GeneChip^® layout.

Affymetrix GeneChip^® layout, it is only allowed to import GeneChip^® datasets from ArrayLims. The datasets are loaded via the R statistic programming language (using the Bioconductor package, and the affy library). Transferred back to the Perl O2DBI API (using RSPerl) the raw values are stored in the EMMA2 database in MBAD (BioAssayData → MeasuredBioAssayData) objects. A screenshot of the web interface for the import of Affymetrix GeneChip^® datasets into EMMA2 is shown in Figure 5.2.

Data pre-processing and processing

For Affymetrix GeneChip^®microarrays there exists a set of different pre-processing and normalization methods that have been developed in recent years. For the im-plementation in the EMMA2 system, the most commonly used are integrated as pipeline tools to be computed for all arrays in one experiment. As for oligonu-cleotide microarrays, the R statistic programming language is used for the computa-tion, as it is very fast and efficient. The functions adapted for EMMA2 resort on the functions provided by the affy package by Gautier et al.(2004) (MAS5, RMA, and MBEI) from Bioconductor and the expresso package by Wuet al. (2003)(GCRMA) . The different normalization functions offer different options that can be used to fine-tune the calculations.

These functions are explained here, the options can be adjusted in the web-interface:

MAS5.0:

MAS5.0 normalization is performed on each of the GeneChips^® separate using one GeneChip^® as a reference. A background correction is performed using perfect-match (PM) and mismatch (MM) probes.

58 Chapter 5. Implementation

Figure 5.2: A screenshot of the EMMA2 dialog for importing Affymetrix GeneChip^® arrays into a new experiment in EMMA2.

RMA:

Using the RMA (Robust Multichip Average) normalization defined by Irizarry et al. (2003), all GeneChips^® in the experiment are normalized together. The algorithm uses a pool of perfect-match (PM) probes to normalize each value. As background correction, the PM distribution is used to get an overall background level. Then a transformation based on a background noise and signal model is applied.

GCRMA:

GCRMA uses the RMA normalization with the help of probe sequence and with GC-content background correction. The perfect-match (PM) values are background-corrected, normalized and finally summarized resulting in a set of expression measures.

Expresso:

The expresso package offers more options that can be adjusted to the datasets.

The expresso package implements nearly all available algorithms for background correction, normalization, PM adjustment measures, and expression value transfor-mation. Available background correction methods are rma, rma2, mas, and none.

For the normalization, the user can select from the following algorithms: quantiles, scaling(mas5 like), constant, invariant set (aka dChip), paired loess, contrast, quantiles.probeset, qspline, and quantiles.robust. The available PM adjustment methods are pmonly, substactmm(mas4), and mas(mas5). For the calculation of an expression value, the algorithms mas, medianpolish(rma), playerout, liwong(aka.

dChip), and avgdiff are available.

5.1. Extension of EMMA2 to store and analyze Affymetrix GeneChip^®

expression datasets. 59

Figure 5.3: The screenshot shows the EMMA2 interface for preprocessing and nor-malization of Affymetrix GeneChip^® microarrays.

The user can also decide to logarithmize the results of the computation.

The setup for the normalization functions for the GeneChip^® in one experiment:

• Load all MBAD(MeasuredBioAssayData) objects from the database

• Create a job to be computed on the compute cluster using R, starting the selected normalization function with the selected options and the complete datasets of the experiment.

• Store expression datasets in DBAD (Derived BioAssayData) objects in the database

A screenshot of the web interface of EMMA2 for selecting the preprocessing and normalization method is presented in Figure 5.3, a screenshot presenting the selectable options is shown in Figure 5.4.

For quality control, the R package AffyQCReport is integrated and can create PDF documents with various statistics and plots.

Significance tests

The significance tests to be used with Affymetrix GeneChip^® datasets are nearly the same that were used for the classical oligonucleotide microarrays. One new significance test was added to the EMMA2 system, and Affymetrix optimized two-sample t-test (Affy two two-sample-test). The pipelines load the normalized datasets (DBAD objects) and run the selected significance test in the R environment. The

60 Chapter 5. Implementation

Figure 5.4: A screenshot showing the EMMA2 interface so select normalization op-tions for the integrated GCMRA normalization.

calculated values are stored in the database as DBADobjects afterwards. Figure 5.5 show a screenshot of the EMMA2 web databrowser and the loaded GeneChip^® expression datasets.

Clustering

The clustering pipelines used for conventional oligonucleotide microarrays can be used for GeneChip^® datasets as well, because the DBAD objects can be handled as classical oligonucleotide datasets. The pipeline ”Hierarchical clustering (Top 1000)

” is the common clustering pipeline, allowing to tweak the clustering according to prefiltering options, significance filters, distance method and clustering method.

Im Dokument Development of a software infrastructure to mine GeneChip expression data and to combine datasets from different Medicago truncatula expression profiling platforms (Seite 67-72)