• Keine Ergebnisse gefunden

On the one hand, with the simpler representation some biological information is lost. But on the other hand, with more sophisticated models their inherent complexity can be problematic in terms of applicability and interpretability.

Nevertheless, various models of a pathway can be used for integration with experimental data, provided they are in a computer-readable form.

1.5 Enrichment analysis

To annotate lists of DEGs enrichment analysis is a frequently used bioin-formatic approach. There are many tests, which aim to detect pathways significantly altered between of two experimental conditions based on expres-sion profiles (Khatri et al., 2012; Maciejewski, 2013). These tests have been implemented into a plethora of tools. The enrichment tools differ in many aspects ranging from the null hypothesis that is tested, through the statistical formulation, pathway data encoding and database support, up to the software implementation and utility.

Within the scope of the thesis two categories of enrichment methods are of interest, defined by the way in which pathway information is incorporated into the analysis. In the traditional enrichment approach a pathway is considered as a simple gene list omitting any knowledge of the gene and protein relations.

Methods belonging to this category are referred to as gene-set (GS) analysis methods. Another way of integrating a pathway into enrichment analysis takes into account its topological structure. These methods are referred to as pathway topology-based (PT-based).

The particular methods of focus in this thesis are described in detail in the 2.2.1Gene-set enrichment methods and 2.2.2 Pathway topology-based meth-ods sections. Here I provide the reader with a general overview of the

method-ological principles within these two enrichment analysis approaches.

1.5.1 Gene-set enrichment approach

Initial tools for enrichment analysis fall into category of GS methods and most of the classification strategies and criteria are defined within this ap-proach. Notably, the earlier GS methods, such as Onto-Express, GoMiner or

GoStat (Khatri et al., 2002; Zeeberg et al., 2003; Beißbarth and Speed, 2004), did not consider a gene set as a curated pathway gene list but as a Gene Ontology term (Ashburner et al., 2000).

In regard to the null hypothesis of the statistical tests they employ, the enrichment methods can be categorized into two groups: competitive and self-contained (Goeman and B¨uhlmann, 2007). Competitive methods compare genes in a pathway to its complement usually represented by the rest of the genes measured in the experiment. This approach is naturally linked with gene sampling for p-value calculation. Self-contained methods consider only genes within a pathway and test their association with the phenotype by subject sampling for significance assessment. Therefore, the competitive methods indicate whether there is a difference between the gene set and random gene sets of the same size in terms of association with phenotype, whereas the self-contained methods state how strong the association is, while not considering other gene sets at all. Both approaches have their limitations. On the one hand, competitive methods coupled with a gene sampling assume independence of genes, which is simply not true in the most cases. On the other hand, self-contained methods have been criticized for being too powerful and yielding too many significant gene sets. Furthermore, the number of experimental replicates is often too low for the purpose of subject sampling (Goeman and B¨uhlmann, 2007).

Another classification of enrichment methods separates them into over-representation analysis (ORA) and functional class scoring (FCS) groups (Kha-tri et al., 2012). ORA is the earliest strategy for enrichment analysis and is referring to 2×2 table methods such as Fisher’s exact test, hypergeometric test and chi-squared test. It represents exclusively the competitive approach. The main drawback of ORA is that it requires a strict cut-off in the list of DEGs and that the enrichment results are strongly dependent on this chosen threshold.

Therefore the FCS approach was suggested to overcome this difficulty. The FCS methods usually work in three steps: First genes are scored, then gene level scores are transformed into a pathway level score and finally the signifi-cance of the observed pathway level score is assessed. The FCS group includes both competitive and self-contained methods, depending on the pathway-level transformation and significance assessment of the pathway level score.

1.5 Enrichment analysis 9

1.5.2 Pathway topology-based approach

One of the first PT-based methods was impact analysis introduced by Draghici et al. (2007). Since then this approach has become very popular, resulting in a number of various PT-based algorithms that have been published (Mitrea et al., 2013).

In contrast to traditional GS methods, the methodological concepts de-scribed in previous section have not been defined for the new PT-based group in such explicit terms. In many cases the concepts can be easily extended. For instance, if a PT-based method applies a strict threshold in the gene list it falls into ORA category. If in addition to topological information only expression data of the pathway genes are used to infer pathway significance, thereby omitting expression information of the genes outside the pathway, the approach reflects the self-contained concept. However, due to the inherent complexity of PT-based algorithms, in some cases it might be difficult to draw a strict line.

Considering the PT-based methods, an additional important categorization is based on how the topological information is exploited and incorporated into calculations. In regard to the extent of topological information, some methods consider the position of a gene in the entire pathway structure, e.g. by impact factor or betweenness measure, but some account only for close interaction partners termed neighbors (see 1.6.2 Network analysis and 2.2.2 Pathway topology-based methods sections for an explanation of the topological measures).

Next, the topological information itself can be incorporated into an algorithm in various ways. The most straightforward approach is weighted GS analysis, where weights are assigned based on the topological measures (Gu et al., 2012).

Further methods combine information from the standard ORA/FCS with a specific topology-based scoring system (Tarca et al., 2009; Dutta et al., 2012) or estimate pathway significance by using multivariate scoring models (Massa et al., 2010). For a detailed survey of PT-based methods with different pathway scoring systems the review of Mitrea et al. (2013) is recommended.

Within this thesis the termenrichment analysis is used as the most general label comprising methods of all categories. Usage of the termpathway analysis can be ambiguous: Sometimes it is used in the most general sense (Khatri et al., 2012), but sometimes it implies that the method accounts for the pathway graph

structure. Therefore, the latter is here referred to as PT-based enrichment analysis to avoid confusion.