• Keine Ergebnisse gefunden

Pathways in cancer Ribosome subunits – Translation

Proteasome subunits – Protein degradation Protein transport

MAPK/VEGF/ErbB signaling pathway

Figure 4.3: Different biological subprocesses within the largest CCS from human, fly, worm and yeast. This CCS consists of 61 proteins and 108 interologs and encompasses different biochemical activities, such as protein degradation, translation, signaling and protein transport.

P1 P2 P3 P4

P5

P3 P4

P1 P3 CCSP1

P2 P4

P1 P2 P4 P5

P4 P5

CCSP3 CCSP2

CCSP

4 CCSP

5

P1

Figure 4.4: Processing large CCS for function prediction. CCS with more than 25 proteins are split into smaller, overlapping sub-subgraphs by considering each protein of the CCS as seed (green node) of a new, smaller CCS. All direct neighbors of this seed are added to the new CCS.

Sub-subgraphs with less than three proteins are removed. For example,P1 is used as seed and its direct neighborsP3 and P4 are added to form the new sub-subgraph CCSP1. Splitting the entire CCS results in five independent sub-subgraphs, but onlyCCSP1 andCCSP4 are considered further for function prediction while the others are pruned (those having less than three proteins).

4.3 Evaluation methods

To assess the performance of our CCS-based function prediction approach we use preci-sion (P) and recall (R). Both concepts present well-established measures for evaluating

prediction algorithms. Precision indicates the fraction of correctly predicted annotations, true positives (TP), amongst all predictions, both true positives and false positives (FP):

P = TP

TP + FP (4.10)

Recall, on the other hand, depicts the fraction of correctly identified predictions amongst all known functions, true positives and false negatives (FN):

R= TP

TP + FN (4.11)

We assess the performance of CCS-based function prediction according to the following criteria (which are explained in detail below):

• First, leave-one-out cross-validation is used to estimate the expected precision and recall of each single and the combined function prediction methods.

• Second, we evaluate our approach by comparing it with two baselines, namely orthology and neighbor baseline.

• Third, we validate our approach against two classical prediction methods: Neighbor counting (Schwikowskiet al., 2000) andχ2 (Hishigakiet al., 2001). We also com-pare it with FS-Weighted Averaging (Chua et al., 2006), a method that considers indirect functional associations and topological weights.

Cross-validation and the two baselines are defined below while the detailed description of the three related prediction strategies is provided in the Related Work of this chapter (see Section 4.4.1).

4.3.1 Cross-validation

For cross-validation we blind the known annotations for each protein before applying our algorithm. Predicted terms are then compared to the held out annotations. We count a GO term as correctly predicted if the proposed term is an ancestor of the original term on the path to the root or the term itself. Otherwise, the prediction is considered to be incorrect (false positive).

Precision and recall are determined for proteins within CCS that exceed a given simi-larity threshold. Note, for all methods involving CCS, we give recall values on the basis of all annotations of proteins within qualifying CCS. We call this measure per-protein recall. It must be distinguished from the traditional per-species recall (Eq. 4.11) which is also used frequently, but which punishes all methods that first filter proteins. When determining the per-protein recall (Rpp) we consider only proteins p that are part of a CCS:

Rpp=

P

pCCS

TPp P

pCCS

TPp+ FNp (4.12)

4.3 Evaluation methods To also give an idea of the per-species performance, we always complement precision and recall values with coverage, which simply counts the total number of predictions.

4.3.2 Baselines

For evaluation we also defined an orthology and a neighbor baseline. The orthology baseline considers only OrthoMCL orthology relationships ignoring structural network conservation. We randomly select 500 orthologous protein groups, remove annotations from one protein in the group and predict their functions using only its orthologs. The neighbor baseline takes only direct interaction partners into account, independent of evolutionary and structural network conservation. For each species we randomly choose one third of the proteins from the corresponding interaction network and exploit their direct neighbors for deriving novel functions. We repeat this procedure 100 times for each baseline and compute average precision and recall including their standard deviation across all runs.

4.3.3 Further evaluations

We shall use the cross-validation setting described above to assess further features of our approach according to the following aspects (see Section 5.3.5):

• First, we study CCS-based function prediction with respect to the three GO sub-ontologies: molecular function, biological process and cellular component, and de-termine subontology-specific precision and recall. Further, we examine the average depth of predicted terms in the GO hierarchy (see Section 5.3.5.4).

• Second, we assess whether specific GO branches are better predictable than others and if those correlate with evolutionarily conserved functions and processes. To this end, we determine for each GO term a term-specific precision and recall (see Section 5.3.5.4).

• Third, we analyze how CCS-based function prediction performs on proteins with-out any or with only very little functional information by considering all novel predictions for these proteins which are counted as false positives in the cross-validation (see Section 5.3.5.5).

• Fourth, we study whether there is a difference in the prediction performance be-tween more general genes, such as housekeeping genes, or specific genes. There-fore, we extract tissue-specific and housekeeping genes from microarray studies.

Human proteins are then classified according to this list (if possible). Protein-specific precision and recall is determined and compared between the two groups (see Section 5.3.5.6).

Finally, we discuss predicted functions for selected proteins that are highly relevant for colorectal cancer (see Section 5.5). Specifically, we study the gene products MLH1, PMS2 and EPHB4, which receive 14, 16, and 15 novel annotations using our method.