• Keine Ergebnisse gefunden

6.3 Network-based disease gene identification

6.3.2 Disease Network Centrality Analysis

Once a disease-specific network has been generated, we apply network centrality analysis to identify the most relevant candidates for the disease. Different centrality measures have been proposed for analyzing various types of biological networks (Junkeret al., 2006;

Koschützki and Schreiber, 2008). We investigated the following centrality measures (see Section 2.3.2.3 for definitions):

• Degree centrality

• Closeness centrality

• Betweenness centrality

• PageRank centrality.

We chose betweenness centrality for all further experiments because it (a) performs best on our data (see Section 7.2 for a comparison of the four measures) and (b) also showed favorable properties for generating new hypotheses on disease-gene associations by others (Özgüret al., 2008). Accordingly, we rank all proteins with respect to their betweenness centrality within the network using the igraph library in R (Csardi and Nepusz, 2006).

6.3.2.1 Normalization for Hub Proteins

Betweenness centrality is applied to identify proteins that are central within disease-specific networks (local hubs). However, the ranking of disease-relevant elements becomes more difficult in large disease networks, for instance, when integrating d2 neighbors or considering diseases with a large number of seed genes.

An important property of whole cell protein interaction networks is their scale-free topology (Albert, 2005), as discussed in Section 2.3.2.2. Thus, the more proteins are integrated in a disease network the higher is the likelihood of includingglobal hubs, e.g., proteins with many interaction partners, independent of any disease context. These hubs affect the ranking since they often will be central due to their general high (but unspecific) connectivity rather than due to a particular relevance for a disease. However, hubs cannot be simply removed from a disease network because this would destroy their topology and might also affect disease-relevant local hubs (by means of missing links).

To account for these effects, we adjust the ranking for all proteins by considering their individual distribution across many disease networks. This is based on the assumption

6.3 Network-based disease gene identification that proteins that are involved in various disease networks are less disease-specific than those that occur only in particular networks. Highly ranked proteins, integrated in many disease networks, are likely to present global hubs that are not disease relevant.

We generate all disease networks for OMIM diseases and count for each proteinP in how many disease networks it is involved. We define a normalized betweenness centrality scoreBCN forP in a disease networkDby normalizing the betweenness centrality score BC by the frequency of P across all disease networks:

BCN(P|D) = BC(P|D)

|{k|P ∈Dk}| (6.1)

Proteins are then ordered according to their BCN score for further analysis. Thus, proteins occurring in many disease networks (especially global hubs) are adjusted down-wards.

The effect of the proposed hub normalization is exemplarily illustrated in Figure 6.5.

The figure shows the prioritizedd1disease networks, with and without hub correction, for Familial Atypical Mycobacteriosis (OMIM Id 209950), a tuberculosis-like disease caused by mycobacteria other than Mycobacterium tuberculosis. The mycobacteriosis network has been generated from five seeds proteins and comprises in total 119 proteins of which six are global hubs with more than 23 interactions (see Section 7.2). Proteins in the network are ranked according to their betweenness centrality whereas the rank of each protein is reflected in the node size, i.e., the larger the node the higher is its centrality and its rank.

Figure 6.5(a) indicates that most disease proteins are fairly central. Two seeds are ranked among the top five proteins while the remaining seeds are found among the top 53 proteins. However, also hub proteins are very central due to their high number of interactions which compromises the ranking of disease proteins. For instance, three hubs (of which one is a seed) are among the top 5 proteins. Yet, not all of them are disease relevant. Normalizing the centrality scores according to the protein frequencies estimated across all disease networks corrects most of the hubs downwards (see Figure 6.5(b)). In consequence, only one hub protein, the (hub-)seed, is found among the top five proteins.

In turn, the ranking of true disease proteins improves considerably, e.g., the set of seed proteins can be found within the top 23 proteins. Figure 6.5(b) also demonstrates that our normalization effects mostly non-specific hubs as the rank of the hub-seed protein is not altered by the correction. Note that for more clarity we considered a fairly simpled1

example disease network. The impact of hub proteins and the hub normalization on the ranking is much more pronounced ind2 disease networks as we will show Section 7.2.

6.3.3 Evaluation methods

We shall evaluate our method in three ways. First, we verify whether (known) disease proteins are highly ranked in their disease-specific networks. Second, we assess the ability of our method to discover novel disease proteins by performing a leave-one-out-validation over all known disease proteins. For both cases, we study the top k% ranked proteins within a disease network for different values of k (from 1% to 100%). We compare the

(a) Ranking without hub normalization (b) Ranking with hub normalization Figure 6.5: Effect of hub normalization on the protein ranking in ad1 disease network generated for Familial Atypical Mycobacteriosis. Known disease proteins are shown in red.

Hub proteins are represented as hexagons (green for non-seeds and red for seeds). The size of each node correlates with its betweenness centrality score and thus with its rank.

performance of our methods against two related methods, namely random walk with restart (RWR) applied by Köhleret al. (2008) and PRINCE (Vanunu et al., 2010).

Centrality analysis and cross-validation are defined below while the detailed descrip-tion of the two related strategies is provided in Secdescrip-tion 6.4.2.

6.3.3.1 Centrality of disease proteins

We determine the amount of highly ranked disease proteins in a disease network by counting the number of seed proteins among the topk% ranked proteins of the network.

Clearly, we expect the majority of seed proteins to be highly ranked in the prioritized list, since we build the disease networks around them which naturally puts them in a central position. However, not all seed proteins are central in their disease networks, and many non-seed proteins are highly ranked. We are especially interested in the latter since these present promising candidates for novel disease-gene relationships (Özgüret al., 2008).

Note that this type of evaluation is often used for analyzing the performance of dis-ease gene identification methods (see Section 6.4). Yet, this evaluation only reflects a method’s ability to score and to rank candidates with respect to a particular disease rather than its predictive power as all disease genes remain in the data set. To assess whether a method is capable ofde novo identification, disease gene associations have to be removed from the data (see below).

6.3 Network-based disease gene identification

6.3.3.2 Cross-validation

For leave-one-out cross-validation, we consider all OMIM diseases with two or more known disease proteins, since our method requires at least one disease protein as seed.

For each disease, we remove one associated protein from its set while using the re-maining ones as seeds. We apply our method as described and count how often the blinded disease-associated protein is re-discovered. We consider only those proteins as re-discovered that rank among the top k% proteins of the prioritized list. We repeat this procedure for each seed per disease and determine an average relative recovery rate for different values ofk (using macro average).

The inclusion of additional evidence leads to larger networks and thus to a higher number of potential candidates. Hence, the ratio between promising and false positive candidates decreases and the number of proteins in a top-k% list increases. To assess whether we truly gain additional information from our extended networks, we also study the absolute recovery rate by performing cross-validation as explained above using only the top 100 proteins within each network.

Further evaluations

We shall use the two evaluation settings described above to further assess the perfor-mance of our approach according to the following aspects:

• First, we measure the effect of integrating predicted protein functions into the framework. We compare the ranking of disease-related proteins in functionally enriched disease networks with their ranking in non-enriched networks. We will demonstrate, that predicted functions enhance the ranking of disease-related pro-teins (see Section 7.2).

• Second, we assess the impact of utilizing indirect interactions on finding disease-related gene products. To this end, we compare the recovery rates achieved when considering either direct (d1) or also indirect interaction partners (d2) against each other. We will show, that the inclusion of indirectly linked proteins significantly improves the cross-validation recovery rate (see Section 7.3).

• Third, we verify our hypothesis that the inclusion of indirect interaction partners also yields a higher number of (disease-unspecific) hub proteins which in turn compromises the ranking of proteins relevant to the disease of interest. In addition, we assess the impact of the hub normalization on the ranking and show that the proposed strategy is most effective for filtering proteins unrelated to a particular disease (see also Section 7.2).

• Fourth, we quantify the influence of the number of known disease-related genes on the performance of our method. Therefore, we analyze the cross-validation recovery rates with respect to the number of initial seed proteins (see Section 7.3.2).

• Fifth, we study whether the specific disease type influences the prediction quality of our method. To this end, we group OMIM diseases into 22 distinct disease classes according to a disease classification scheme proposed by Goh et al. (2007) and

consider the recovery rates with respect to each disease class (see Section 7.3.3).

• Sixth, we assess how much our method could benefit from utilizing information on genomic regions, i.e., disease loci. Related methods are mostly evaluated on artifi-cial linkage intervals with around 100 to 110 genes including the target gene which is incomparable to evaluations considering an entire genome (see Section 6.4).

Therefore, we perform a cross-validation in which we filter all proteins from the ranked candidate list that are not located on the same chromosome as the left-out protein. This mimics a scenario where the candidate search is restricted to a particular genomic region, i.e., a chromosome in this case (see Section 7.3.1).

Finally, we show in two biologically relevant use-cases that our approach is highly applicable for diseases with complex and incompletely known genetic background. First, we apply our method to investigate classical Hodgkin Lymphoma (cHL) and colon cancer (see Sections 7.3.4 and 7.3.5). Second, we utilize our method to study surface membrane factors that might contribute to HIV-1 infection, a phenotype which cannot be limited to particular chromosomal regions (see Section 7.5).

6.4 Related Work

In the following, we discuss related work in the field of computational disease gene identification focusing on (interaction) network-based approaches. We start with a clas-sification of the different methodologies based on their underlying ideas. Representative methods will be briefly discussed regarding their main concepts and distinctive features with respect to our approach.

The currently existing prioritization strategies can be classified into three categories:

1. Local methodsinfer disease association for a gene product by investigating either its direct or indirect interaction partners or the shortest paths between the candidate and known disease genes (Oti et al., 2006; George et al., 2006).

2. Global methods model the flow of information within the cell to assess the connec-tivity and proximity between known disease genes and candidate genes (Franke et al., 2006; Köhleret al., 2008; Vanunu et al., 2010).

3. Disease module-based methods associate proteins with diseases based on the hy-pothesis that common phenotypes are associated with dysfunction in proteins par-ticipating in the same complex or pathway. These methods first construct disease-specific networks around a set of genes related to the condition of interest which are assumed to present modular disease-machineries (Chen et al., 2006; Gonzalez et al., 2007; Özgüret al., 2008). Different scoring functions are then used to score and rank proteins in such networks according to their relevance to the disease.

According to this classification scheme, we follow a module-based strategy, by gener-ating disease-specific networks, and employing a global similarity measure for identifying disease-related genes within such networks.

6.4 Related Work