• Keine Ergebnisse gefunden

4.3 Semi-global graph comparison

4.3.1 SEGA - SEmi-global Graph Alignment

4.3.1.3 Defining a distance measure

Algorithm 3 construct matrix: Constructs a cost matrix from ambiguous mapping candidates

Require: S set of candidate mappings of nodes, U1k, U2k C=M ax V alue

for all vi(1) ∈U1k do for all v(2)j ∈U2k do

if (vi(1), vj(2))∈/ S then Mij =C

else

Mij =P

q=1,2...|W1|

|vi(1)−w(1)q | − |v(2)j −w(2)q | , w(1)q ∈W1, wq(2) ∈W2

return cost matrix M, constant C

solving a weighted optimal assignment problem. Thus, as an alternative, the SE-GAHA (SEmi-global Graph ALignment - Hungarian Algorithm) variant is proposed as an alternative, where the incremental assignment of nodes is replaced by the Hun-garian algorithm (Kuhn, 2005). The question, which of the two approaches will be more suitable for protein binding site comparison will be addressed in Chapter 5.

4.3 Semi-global graph comparison

inG2:

δ(G1, G2) = P

(v(1)i ,v(2)j )∈Adij +cp·(|A| − |G1|)

|G1| . (4.30)

The constant cp is a penalty that accounts for unmatched nodes which can simply be set to the highest obtainable distance if no triangles can be matched. A degree of inclusion of G2 inG1 is defined analogously.

Based on (4.30), two measures of distance between G1 and G2 can be defined, a “conjunctive” and a “disjunctive” one, in analogy to the scoring scheme for the GAVEO approach:

max(G1, G2) = max{δ(G1, G2), δ(G2, G1)} (4.31)

min(G1, G2) = min{δ(G1, G2), δ(G2, G1)} (4.32) In this case, the measure (4.31) can be seen as a relaxed equality in terms of two-sided inclusion (A⊂B and B ⊂A), while the disjunctive combination, favoring a one-sided inclusion, is given by (4.32). Again,

min(G1, G2) ≤ ∆max(G1, G2) .

The question which of these two measures, the conjunctive or the disjunctive one, yields more suitable degrees of similarity cannot be answered in general and instead depends on the problem setting, in particular on the purpose for which the similarity is used (e.g., function prediction) and the way in which protein binding sites are extracted and modeled (e.g., whether or not the model may include parts of the protein not belonging to the binding site itself).

Again, to account for both possible extremes while allowing for a certain degree of flexibility, the ultimate distance measure of the SEGA algorithm is defined as a (linear) combination of (4.31) and (4.32):

∆(G1, G2) =α·∆max(G1, G2) + (1−α)·∆min(G1, G2) . (4.33) This distance is again inversely related to a similarity score.

Similar to GAVEO, (4.33) is a special case of an OWA (ordered weighted average) aggregation of the two degrees of inclusion, G1 and G2, and the parameter α ∈[0,1]

controls the trade-off between the two extreme aggregation modes: The closerαis to 1, the closer the aggregation is to the minimum, i.e., the more demanding it becomes.

The value α corresponds to the “degree of andness” of the aggregation (4.33), i.e., the degree to which this aggregation behaves like a conjunctive combination (Fodor and Roubens, 1994); likewise, 1−α corresponds to the “degree of orness”.

In principle, choosing a high value of α favors the detection of largely similar binding sites, thus yielding results more alike to those of global methods. This can be of interest for proteins belonging to the same protein family or fold. A low value of αwould be beneficial for the detection of more remote similarities, which could be more useful to detect similarities in proteins of different folds.

5

Results and Discussion

In the following, the approaches introduced in the previous chapters will be exper-imentally validated and compared on different datasets as well as different problem settings. The following algorithms will be used in the experiments:

• BFPH (Bin-FingerPrints Hamming): Crisp fingerprints using a binning of the edge lengths with bin size b. Fingerprints are compared using the Hamming distance (4.2.2.1).

• BFPJ (Bin-FingerPrints Jaccard): Crisp fingerprints using a binning of the edge lengths with bin size b. Fingerprints are compared using the Jaccard distance (4.2.2.1).

• GAVEO (Graph Alignments Via Evolutionary Optimization): Evolutionary al-gorithm that optimizes an objective function based on a graph edit distance (4.1.1).

• GAVEO* : GAVEO in combination with the scoring function originally used by the greedy heuristic (4.1.1).

• GAVEOc (GAVEO with preserved clique): A variant of GAVEO where the maximal clique is calculated prior to the optimization and preserved throughout the calculation (4.1.1.4).

• FPH (-FingerPrints Hamming): Crisp fingerprints using a fixed set of edge labelsl ∈ {1, ...,12} in conjunction with an threshold. Fingerprints are com-pared using the Hamming distance (4.2.2.1).

• FPJ (-FingerPrints Jaccard): Crisp fingerprints using a fixed set of edge labels l ∈ {1, ...,12} in conjunction with an threshold. Fingerprints are compared using the Jaccard distance (4.2.2.1).

• FFP (Fuzzy FingerPrints): Fuzzy fingerprints using triangular membership functions controlled by the radius parameter η. Fingerprints are compared using a generalization of the Jaccard measure (4.2.2.2).

• SEGA (SEmi-global Graph Alignment): A semi-global approach using local similarities and global information to construct a global graph alignment from a distance matrix D (4.3.1).

• SEGAHA (SEGA with Hungarian Algorithm): A variant of SEGA, where the Hungarian algorithm (Kuhn, 2005) is used to construct a global alignment from the distance matrixDby calculating a cost-minimal assignment of nodes (4.3.1).

To better judge the performance of the introduced algorithms, some baseline algo-rithms will be employed, using sequence information as well as structural information retrieved from CavBase.

• BK (Bron-Kerbosch algorithm): A clique-enumeration algorithm (Bron and Kerbosch, 1973) commonly used in graph-based protein structure comparison ((Kinoshita and Nakamura, 2005; Redfern et al., 2007; Schmitt et al., 2001), cf. Chapter 2). The Bron-Kerbosch is used in CavBase to calculate the first 100 cliques instead of a full enumeration which proved already sufficient to cre-ate meaningful solutions (Schmitt et al., 2002). Therefore, the Bron-Kerbosch algorithm will be used analogously here.

• CB (CavBase clique algorithm): The original algorithm used in CavBase, which represents a combination of the Bron-Kerbosch approach with a surface-based scoring scheme (Schmitt et al., 2001).

• GH (Greedy Heuristic): A greedy heuristic based on clique detection. This approach was developed in a previous work and represents the most recent approach for the comparison of CavBase data (Weskamp, 2007).

• RW (Random Walk kernel): The random walk kernel of G¨artner (2003) with the extensions introduced in Chapter 4 (4.2.1.1).

• SA (Sequence Alignment): A local sequence alignment using the Smith-Waterman algorithm (Smith and Waterman, 1981) as implemented in the jaligner tool (Moustafa, 2005).

• SP (Shortest Path kernel): The shortest path kernel of Borgwardt and Kriegel (2005) with the extensions introduced in Chapter 4 (4.2.1.2).

• SPSA (SP with Sequence Alignment): The shortest path kernel expansion using sequence alignment on paths (4.2.1.2).

Although the random walk and the shortest path kernels were expanded during this thesis to be applicable on the protein binding site model, they were originally sug-gested elsewhere. Hence it is more appropriate to regard them as baseline approaches for the local comparison approaches. In the experiments, the kernel measures are used directly as similarity measures 1.

Note that some of the approaches might fail to calculate comparisons for pairs of exceedingly large graphs. This especially pertains to the CavBase approach and the random walk kernel, to a lesser extend also to BK and GH. In these cases, the corresponding score is set to −∞ in case of similarity scores, respectively to ∞ in case of distance values.

This chapter is organized as follows: First, an overview of the datasets used in the experiments is given prior to the actual experimental part. The experimental part starts with several preliminary experiments aiming at deriving suitable param-eter settings for the different approaches before more time-consuming studies are conducted.

1In preliminary experiments (not shown), the use of the kernel distance (4.10) showed no im-provement of classification results over using the kernel measure directly

Abbr. Algorithm

BFPH Bin-FingerPrints using the Hamming distance BFPJ Bin-FingerPrints using the Jaccard coefficient

BK Bron-Kerbosch algorithm

CB CavBase approach

GAVEO Graph Alignment Via Evolutionary Optimization GAVEO* GAVEO + original similarity measure

GAVEOc GAVEO + preserved clique

GH Greedy Heuristic

FPH -FingerPrints using the Hamming distance FPJ -Fingerprints using the Jaccard coefficient

FFP Fuzzy Fingerprints

RW Random Walk kernel

SA Sequence Alignment (Smith-Watermann) SEGA SEmi-global Graph Alignment SEGAHA SEGA using the Hungarian Algorithm

SP Shortest Path kernel

SPSA Shortest Path kernel with Sequence Alignment Table 5.1: Algorithms used during the experiments.

This is followed by an assessment of the algorithmic performance of the different approaches when confronted with different levels of structural and mutational distor-tion. Section 5.7 presents a number of classification experiments on different datasets used to assess the performance and suitability of the presented methods for classifi-cation tasks, as the main goal of the graph comparison algorithms presented in this thesis is to discriminate between different classes of protein binding sites. Section 5.6 presents results for another typical application of protein structure comparison tools, the retrieval of similar structures from a reference dataset.

In Section 5.8, the suitability of the presented algorithms for comparison tasks beyond experimentally derived structures and protein binding sites is addressed.