Crisp fingerprints - Local graph comparison

4.2 Local graph comparison

4.2.2 Fingerprints

4.2.2.1 Crisp fingerprints

4.2 Local graph comparison

since this would increase the possibility that a certain pattern would only be present in a handful of graphs, if it occurs at all. On the other hand, the patterns need to be specific enough to be of discriminative value. From a technical point of view, it would be advisable to keep the patterns simple to limit fingerprint size and keep the calculations feasible. Since the construction of the fingerprints involves checking for the presence of each pattern, this means that some kind of isomorphism test needs to be performed for each entry. Hence, the smaller the patterns, the more efficient the runtime performance.

Given these considerations, subgraphs of size three were chosen as patterns to ensure a high runtime efficiency, which is one major motivation to use local methods in the first place. Obviously, not all possible subgraphs of size three can be considered, given that the graphs used in this thesis feature real-valued edge weights and thus an infinite number of possibilities exist. Also, sampling patterns from existing graphs would not be reasonable, as the result would strongly depend on the graphs chosen for the sampling. Instead, one can again resort to discretization by considering n distinct node labels and k distinct edge weights, which gives rise to a finite number of possible patterns given by:

N(n, k) = n

·k³ +n(n−1)·k·

k+ 1 2

+n·

k+ 2 3

(4.18) This is easily verified since only three cases can occur:

1. The pattern contains three different node labels: There are ⁿ₃

possibilities to choose three distinct labels. As this also uniquely identifies the edges, there are k³ possibilities for the edge labels.

2. The pattern contains two equal node labels that differ from the third: There are n(n−1) possibilities to choose two distinct labels, one for the identically labeled nodes and one for the third. In this case a graph with the edges emanating from the uniquely labeled node swapped would be isomorphic. To account for this, one can sort these edges according to their weight which would map isomorphic patterns to the same representation, which leads tok· ^k+1₂

possible edge combinations.

3. The pattern contains only identical node labels: There are n possibilities to choose the node label. Again, to find a unique representation that accounts for isomorphism, all edges can be sorted which leads to ^k+2₃

possible edge combinations.

To test for the presence of a given pattern, an -threshold can again be employed in analogy to the GAVEO approach introduced in Section 4.1.1. A pattern t_i is contained in a graphG, if there is a subgraph in G_s which is -isomorphic to t_i.

Alternatively, one could use a simple binning strategy, by partitioning the set of real-valued edge weights into several intervals of the bin size b. In this case, instead of designating a certain set of discrete edge labels and a tolerance threshold , a bin size b is used to specify the fingerprints. Accordingly, a pattern t_i is contained in a graph G, if there is a subgraph in G_s, whose edge weights fall into the bins specified by the edge labels t_i.

In both cases, a fingerprint is defined in the following way:

Given a graph G, let

f_G = Gwt₁, Gwt₂, . . . , Gwt_N(n,k)

∈N^N(n,k)

where{t₁, . . . , t_N(n,k)}is the set of all non-isomorphic patterns of size three defined by fixed sets of node labels and edge weights, numbered in an arbitrary but fixed order.

The predicate G w t_i tests whether t_i is contained in G and returns the number of occurrences. Again, setting an upper limit δ for edge weights is necessary here and has the effect of limiting the size of the fingerprints, which positively affects runtime efficiency.

To improve runtime performance during the construction of such fingerprint vec-tors, one can make use of a hashing function based on canonical forms of the given patterns, instead of employing a brute-force approach. In this work, the canonical forms are based on the above distinctions between the types of possible patterns of size three⁴:

1. All node labels are identical. In this case, the canonical form is given by the node label followed by the edge lengths in increasing order.

4Though in principle also other conventions are possible.

4.2 Local graph comparison

2. Two nodes have an identical label. The canonical form starts with the node label that appears once in the graph followed by the label that appears twice, the edge weight between the nodes with the same label, and finally the remaining two edge weights in increasing order.

3. All nodes have different labels. The canonical form is then defined by the three occurring labels, sorted in a lexicographic order, the edge length between the first and the second, the second and the third, and finally the first and the third node.

All three cases are illustrated by an example in Fig. 4.6. We denote the set of A

A A

4 3

A B

4 3

A B

4 3

B A 4 3 5 A B C 5 3 4 A 3 4 5

Figure 4.6: The three possible cases that can occur: all labels identical, two labels identical and all labels unique.

canonical forms by Γ. The above representation enables the definition of a bijective function i : Γ → {1, . . . , N(n, k)} ⊂ N assigning a unique number to each form and, therefore, subgraph of size 3.

Using this mapping, the calculation of the fingerprint vector for a graphG= (V, E) can be done in a more efficient way by enumerating all subgraphs of size 3 in G.

For each subgraph g_i of size 3 in G, the transformation to its canonical form σ_i is performed (in timeO(1)) and the functioni(σ_i) is evaluated to determine the position ofgi in the fingerprint vector (in time O(1)). Finally the entry at this position in the vector is incremented by one. Doing this for all ^M₃

= O(M³) subgraphs of size 3 leads to a runtime complexity of O(M³).

Given such a feature representation, the comparison of two graphs G₁ and G₂ is transferred to the comparison of their respective fingerprint vectors f_G₁ andf_G₂. For this purpose, different distance measures can be employed, one of the simplest being the Hamming distance.

Hamming fingerprints

If one is merely interested in the presence or absence of a pattern, a simple distance function can be devised based on the Hamming distance. For each pattern, the simultaneous absence or presence in both graphs is rewarded and aggregated to the following similarity measure:

k_{F P H}(G₁, G₂) = 1 N(n, k)

N(n,k)

i=1

k_δ([f_G₁]ⁱ,[f_G₂]ⁱ) , (4.19) where [f_G₁]ⁱ denotes the i-th entry in the vector f_G₁, and

k_δ(x, y) =







1 (x >0∧y >0)∨(x= 0∧y= 0) 0 otherwise

. (4.20)

Jaccard Fingerprints

A potential disadvantage of using the Hamming distance is the fact that it does not only reward the simultaneous presence of a pattern, but also its absence. This is somewhat counterintuitive, since the absence of a certain pattern can obviously not hint at a shared functionality of the corresponding binding pockets. Therefore, an alternative measure from the field of set theory can be employed that avoids this problem. By utilizing the well-known Jaccard coefficient

J(A, B) = A∩B

A∪B , (4.21)

an alternative similarity measure can be obtained:

k_{F P J}(G, G⁰) =

PN(n,k)

i=1 min([fG1]ⁱ,[fG2]ⁱ) PN(n,k)

i=1 max([fG1]ⁱ,[fG2]ⁱ) . (4.22) Of course, a plethora of other possible distance measures could also be used in-stead, for example cosine similarity, the Minkowski metric, etc. However, for the sake of brevity, the focus will be on the introduced methods as a proof of concept.

4.2 Local graph comparison

6.1 5.9 5.1

 

Figure 4.7: Example of a discontinuity problem. Given that edge weights are sep-arated into the intervals [5,6[ and [6,7[, the left and the center graph would be considered dissimilar, while the center and right graph would correspond to the same pattern. This is clearly counterintuitive, since the left and center graph show a much lower difference in edge lengths.

Im Dokument Graph-Based Approaches to Protein StructureComparison - From Local to Global Similarity (Seite 114-119)