• Keine Ergebnisse gefunden

4.2 Local graph comparison

4.2.1 Extension of existing R-convolution kernels

4.2.1.2 Shortest path kernel

The random walk kernel implicitly considers all possible walks by definition. This might be somewhat problematic, as it introduces a certain redundancy in the simi-larity measure.

Moreover, the random walk kernel suffers from two problems known as tottering andhalting. Tottering occurs if nodes or edges are visited repeatedly, thus attributing more weight to these nodes, respectively edges, which might lead to an overestimation of the similarity. This is especially severe for the graph models used here, as the protein binding sites are modeled by undirected graphs. As a result, a walk can even totter between the same two nodes repeatedly.

Halting refers to the phenomenon that the similarity measure is dominated by shorter walks. Walk kernels suffer from this problem due to the decay factorλ which down-weights larger walks. Thus, to ensure the convergence of the series, one in-evitably inherits a bias towards shorter walks.

Another problem is the complexity of the random walk kernel. While a complexity of O(M6) is of course preferable to solving an NP-complete problem, it is still rather

4.2 Local graph comparison

high. Thus, as an alternative, Borgwardt and Kriegel (2005) introduced the shortest path kernel which considers only the shortest paths between any two nodes in order to reduce the number of considered graph components.

Again, for the purpose of this thesis, the following extension of the shortest path kernel is proposed, in order to make it applicable for the given graph models: Let (vφ1, ..., vφk) denote the shortest path between two nodesvi, vj ∈G withvi =vφ1 and vj =vφk. Let the length of the shortest path be defined by lp(vi, vj) with:

lp(vi, vj) =

k−1

X

l=1

w(vφl, vφl+1) . (4.15) Testing for equality on real-valued edge weights would obviously not be reasonable, due to the measurement accuracies and uncertainties. Therefore, edge lengths are discretized into bins of size 1. Now, the shortest path can be represented as a triple sp(vi, vj) with sp(vi, vj) = (l(vi), l(vj), lp(vi, vj)). Obviously, setting a maximum edge weight is even necessary in this case, as otherwise the shortest path between two distinct nodes would always have a length of one.

In other words, the shortest path is defined by the label of the starting node, the end node and the sum of the discretized edge weights. On the one hand, discretization introduces some error tolerance in order to deal with the inherent noise associated with the edge weights. On the other hand, this also introduces another source of error.

Based on that simple representation, one can use the Dirac kernel to compare two shortest pathssp(v1i, vj1) andsp(v2i, vj2), withv1i, v1j ∈V1 and vi2, vj2 ∈V2:

κpath((vi1, vj1), p(v2i, vj2)) =

1 if sp(v1i, v1j) =sp(vi2, v2j) 0 else

.

With this, the generalized shortest paths kernel is defined as follows:

κSP(G1, G2) = 1 C

X

vi1,v1j∈V1

X

v2i,vj2∈V2

κpath(sp(vi1, vj1), sp(vi2, vj2)) , (4.16)

where C = 14(|V1|2 − |V1|)·(|V2|2 − |V2|) is a normalizing factor that guarantees

0≤κSP(G1, G2)≤1 and more importantly ensures that κSP is size invariant.

To calculate the shortest path of a graph, several algorithms exist, one of the most prominent being the Floyd-Warshall algorithm (Floyd, 1962), which will be used in this case. The Floyd-Warshall algorithm has a cubic complexity. The shortest path kernels considers all shortest paths in two graphs in a pairwise fashion and compares them using (4.16), which has a complexity of O(1). Therefore, the calculation ofκSP amounts toO(M4) assuming|V|=|V0|=M, sinceM4 comparisons have to be made.

The described kernel avoids the above mentioned tottering problem. Moreover, the runtime complexity of the shortest path kernel amounts to O(M4) which is more efficient than the random walk kernel.

Realizing the graph comparison approach as a local method inevitably leads to loss of information, since one neglects the overall structure of the graph. This is of course also true for the kernel methods presented above. However, in the case of the shortest path kernel, the loss of information incurred by reducing the information of the shortest path to the labels of start and end nodes and the associated path length might be too drastic. To put it differently, can the performance of the shortest path kernel be improved by utilizing the information given by the intermediate nodes and edges as well? Thus, a natural alternative would be to represent the shortest path simply as the sequence of node and edge labels that constitutes the path, i.e.,

spf ull(v1, vk) = (l(v1),bw(v1, v2)c, ...,bw(vk−1, vk)c, l(vk)) . (4.17) Obviously, using the Dirac kernel to compare two such shortest path sequences would not be reasonable, as this would result in a relatively crude “all or nothing” evaluation.

Instead, since spf ull is a sequence of node labels and edge weights, one could instead utilize sequence analysis methods to obtain a more fine-grained measure. One pos-sibility would be to use the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970), which is usually employed for the calculation of pairwise sequence alignments, to compare two path sequences. This algorithm utilizes a scoring function based on the Levenshtein distance (Levenshtein, 1966) which can be used as a score that indicates how well two path sequences are in accordance. If a suitable scoring pa-rameterization is used (1 for a match, 0 for a gap or mismatch), this scoring function fulfills the properties of a metric.

4.2 Local graph comparison

However, to employ the Needleman-Wunsch algorithm, an adaptation is necessary to avoid that edge weights are matched to node labels and vice versa. This can easily be realized by setting the score for a node-to-edge mapping to−∞. As an additional modification, the score for each comparison of two path sequences is normalized by dividing it by the length of the largest path sequence in the graphs, which leads to an up-weighting of longer path sequences. The rationale behind this is that longer path sequences would carry more information, simply since more nodes and edges are visited and thus more of the overall topology is covered. Thereby a similarity measure in the interval [0,1] is obtained which can be used in (4.16) instead of the Dirac kernel.

The downside of this variant is again an increased runtime. The total complexity amounts to O(M3) + O(M2 · M2 ·M2) = O(M6) since first the Floyd-Warshall algorithm is used to obtain all shortest paths and the comparison via the adapted Needleman-Wunsch algorithm (withO(n2)) has to be performed forM2·M2sequences (assuming that the number of nodes in both graphs is M).

The shortest path kernel essentially avoids the problem of tottering and, at least in its simpler form, offers a better runtime behavior than the random walk kernel.

However, by focusing explicitly on shortest paths, the problem of halting is still an issue, perhaps even more so, as larger paths are not only down-weighted but completely neglected. Thus, it is impossible to judge in advance, which variant will be best suited for the comparison of protein binding sites.

Both kernels in their original form have already successfully been applied on the comparison of whole proteins, although in a different problem setting. More precisely, both kernels have been used on an SSE-based graph representation of protein folds (Borgwardt et al., 2005). Whether the above kernels with the introduced extensions will be equally useful for the comparison of protein binding sites will be investigated in Chapter 5.