Validation approaches - Identification of Differentially Expressed Gene Modules in Heterogeneou

removes all biclusters which overlap in more than 0<o<1-fraction of its area with a bigger one.

3.4 Validation approaches

Validation approaches can be classified as supervised when ground truth is available and unsupervised when it is not. In the first case, method performance may be calculated directly, comparing a set of found biclusters with a set of known biclusters. For the evaluation of biclustering results, many similarity measures suitable for comparison of two sets of biclusters have been developed [110]. The choice of an optimal metric depends on the task, such as the tolerance to first and second type errors, redundancy of the results, etc. In this thesis, Relevance and Recovery scores proposed by Prelic et al. [227] and used in previous benchmarks [76, 258] were chosen (see subsection 4.5.1).

Direct performance evaluation is complicated by the fact that ground truth data may be unavailable or not reliable. Breast cancer chosen for this thesis is known to have several well-characterized subtypes, distinguishable at the level of gene expression. However, even if some molecular subtypes of breast cancer are specified based on gene expressions, we cannot refer to them as absolute ground truth, because:

• unknown disease subtypes determined by expressions of different genes may exist along with known;

• these known subtypes may be defined imprecisely. Indeed, some recent works sug-gested the extensions [224, 226] of PAM50 molecular classifications [209, 216].

Similar considerations may concern almost every biological dataset. This means that the evaluation of biclustering on real data may result in a biased performance estimate.

An alternative way of supervised evaluation in the absence of ground truth is a benchmark on synthetic data. This approach also allows the direct computation of performances and therefore is widely used by the community for the evaluation of biclustering methods. In this setting, the experimenter has full control of the data. This allows investigating the dependence of various data properties such as the number of biclusters, overlap, level of noise, etc. on the method performance. The results of benchmarks performed by Bozdag et al.

[26], Eren et al. [76], and Padilha et al. [206] have demonstrated that method performances may vary widely depending on these characteristics of the data.

The main disadvantage of this approach is that simulated data may not reflect the com-plexity of real-world data or miss some of its important aspects. This may lead to the over-or underestimation of method perfover-ormances (see the discussion in chapter 5 fover-or the details).

In the absence of reliable ground truth, the indirect validation of the results obtained on real data is still possible. Genes falling into the same bicluster are demonstrating a similar pattern of expression are expected to be functionally related. Therefore to obtain indirect evidence of method performance, the resulting biclusters are tested for biological significance.

Almost all of the above methods discussed in this chapter test gene sets for overlap with Gene Ontology (GO) categories. GO is a controlled vocabulary of gene attributes, providing annotations of genes with molecular functions they perform, biological processed they participate and cellular components in which they work. Overrepresentation of genes labeled with the same GO term in a bicluster compared to background genes points to their functional coherence and supports the reliability of this bicluster.

A similar idea can be applied for the evaluation of patients groupings obtained in the result of biclustering. They can be tested for associations with various biological variables like known disease subtypes or survival. Of course, the absence of association of bicluster with any functional group or clinical variable result does not necessarily mean that it is defined incorrectly.

Chapter 4 Methods

The lack of biclustering methods specifically aimed at the detection of differentially expressed biclusters motivated the development of a novel biclustering method called [298]. To reduce search space and obtain more robust biclusters, we suggested adding gene network to the problem definition and searching for network-constrained differentially expressed biclusters.

This chapter starts from the formal problem definition (section 4.1, published in [298]), represents the first version of DESMOND (section 4.2, published in [298]), and introduces the second version of the method (section 4.3). Theoretical analyzes of runtime complexity for both versions of DESMOND are provided in section 4.4. Sections 4.5, and 4.6 (also adapted from [298]) explain data preprocessing and validation approaches respectively. The details on the implementation of the methods are provided in section 4.7.

4.1 Problem definition

The problem addressed in this thesis is the discovery of connected groups of genes differen-tially expressed in an unknown subgroup of samples, given a network of gene interactions and a matrix of gene expression profiles (Fig. 4.1). This problem can be classified as network-constrained biclustering, or, alternatively, as unsupervised active subnetwork detection, when the desired sample subgroups are unknown.

Formally speaking, given expressions of genes inGmeasured in the samples of setS, and an undirected and unweighted graphN= (G,I), representingIinteractions between the Ggenes, the aim is to find subsets ofG^′⊂Ggenes andS^′⊂Ssamples, such that genesG^′ are differentially expressed in a subset of samplesS^′compared to the background samples S^′=S\S^′; andG^′forms a connected component in the networkN. Such pairs(G^′,S^′)are calledmoduleswhich is a synonym of bicluster in the context of this thesis. A gene gis

differentially expressed in a set of samplesS^′⊂Scompared toS^′=S\S^′, ifµ_g,S^′, its median expression inS^′, is different from the median expressionµ_g,S′ inS^′. Since the aim of this thesis is the discovery of gene subsets that differentiating putative disease subtypes, it is important to find biomarkers which expressions in S^′ would be well-separated from the background. To control how well the expression of the genegdistinguishes the group of samplesS^′from the background, one can employ the signal-to-noise ratio (SNR) [100, 187].

The SNR for expression of geneginS^′samples is defined as SNR(g,S^′) = µ_g,S^′−µ_g,S′

σ_g,S^′+σ_g,S′

, (4.1)

where µ and σ denote mean and standard deviation of gene expression in a subgroup of samples.

Similarly, a set of genesG^′is also called differentially expressed in the samples of setS^′ if∀geneg∈G^′,gdifferentially expressed inS^′. The average of absolute SNR over all genes G^′is used as a measure of differential expression of a biclusterB(G^′,S^′):

avg.|SNR(B(G^′,S^′))|= 1

|G^′|

∑

g∈G^′

|SNR(g,S^′)| (4.2)

A higher average absolute SNR value indicates that a subset of samples S^′ is well-separated from the background in a subspace ofG^′. Such gene sets are promising biomarker candidates for distinguishing unknown but biologically relevant subtypes of samples.

In the standard setting of differential expression analysis, all genes are tested in two given groups, e.g. disease vs control. In contrast, in the biclustering problem, the groups of samples are undefined and are to be discovered. If genes are up-regulated in more than half of all samples, the remaining samples also form a down-regulated module andvice versa. Therefore, it makes sense to search for groups of samples of size not bigger than|S|/2.

Furthermore, the desired module should not be too small in terms of samples, because a smaller module has a higher probability to appear just by chance. To avoid finding too small modules, the user can select an appropriates_min value based on the size of the dataset and intended downstream analysis.

Im Dokument Identification of Differentially Expressed Gene Modules in Heterogeneous Diseases (Seite 73-76)