Primer on Graph Kernels - Review on Graph Kernels

1.4 Review on Graph Kernels

1.4.2 Primer on Graph Kernels

Kernels on structured data almost exclusively belong to one single class of kernels: R-convolution kernels as defined in a seminal paper by Haussler [Haussler, 1999].

R-Convolution Kernels R-convolution kernels provide a generic way to construct ker-nels for discrete compound objects. Letx∈Xbe such an object, andx:= (x1, x2, . . . , xD) denote a decomposition of x, with each x_i ∈Xi. We can define a boolean predicate

R :X×X→ {True,False}, (1.24) whereX:=X1×. . .×XD andR(x, x) isTrue whenever x is a valid decomposition of x.

This allows us to consider the set of all valid decompositions of an object:

R⁻¹(x) := {x|R(x, x) =True}. (1.25) Like [Haussler, 1999] we assume that R⁻¹(x) is countable. We define the R-convolution ? of the kernels κ₁, κ₂, . . . , κ_D with κ_i :Xi×Xi →Rto be

k(x, x⁰) =κ₁ ? κ₂ ? . . . ? κ_D(x, x⁰) :=

:= X

x∈R⁻¹(x) x⁰∈R⁻¹(x⁰)

µ(x,x⁰)

i=1

κi(xi, x⁰_i), (1.26)

whereµis a finite measure onX×Xwhich ensures that the above sum converges.² [Haus-sler, 1999] showed that k(x, x⁰) is positive semi-definite and hence admissible as a kernel [Sch¨olkopf and Smola, 2002], provided that all the individualκ_i are. The deliberate vague-ness of this setup regard to the nature of the underlying decomposition leads to a rich framework: Many different kernels can be obtained by simply changing the decomposition.

2 [Haussler, 1999] implicitly assumed this sum to be well-defined, and hence did not use a measureµ in his definition.

1.4 Review on Graph Kernels 29 In this thesis, we are interested in kernels between two graphs. We will refer to those as graph kernels. Note that in the literature, the term graph kernel is sometimes used to describe kernels between two nodes in one single graph. Although we are exploring the connection between these two concepts in ongoing research [Vishwanathan et al., 2007b], in this thesis, we exclusively use the term graph kernel for kernel functions comparing two graphs to each other.

The natural and most general R-convolution on graphs would decompose two each graphsGandG⁰into all of their subgraphs and compare them pairwise. This all-subgraphs kernel is defined as

Definition 19 (All-Subgraphs Kernel) Let G and G⁰ be two graphs. Then the all-subgraphs kernel on G and G⁰ is defined as

k_subgraph(G, G⁰) = X

SvG

S⁰vG⁰

kisomorphism(S, S⁰), (1.27)

where

kisomorphism(S, S⁰) =







1 if S 'S⁰, 0 otherwise.

(1.28)

In an early paper on graph kernels, [G¨artner et al., 2003] show that the problem of com-puting this all-subgraphs kernel based on all subgraphs is NP-hard. Their proof is founded in the fact that computing the all-subgraphs kernel is as hard as deciding subgraph iso-morphism. This can be easily seen as follows. Given a subgraph S from G. If there is a subgraph S⁰ from G⁰ such that kisomorphism(S, S⁰) = 1, then S is a subgraph of G⁰. Hence we have to solve subgraph isomorphism problems when computing kisomorphism, which are known to be NP-hard.

Random Walk Kernels As an alternative to the all-subgraphs kernel, two types of graph kernels based on walks have been defined in the literature: the product graph kernels of [G¨artner et al., 2003], and the marginalized kernels on graphs of [Kashima et al., 2003].

We will review the definitions of these random walk kernels in the following. For the sake of clearer presentation, we assume without loss of generality that all graphs have identical size n in the following. The results clearly hold even when this condition is not met.

Product Graph Kernel [G¨artner et al., 2003] propose the a random walk kernel count-ing common walks in two graphs. For this purpose, they employ a type of graph product, the direct product graph, also referred to as tensor or categorical product [Imrich and Klavzar, 2000].

Definition 20 The direct product of two graphs G = (V, E,L) and G⁰ = (V⁰, E⁰,L⁰) shall be denoted as G× = G×G⁰. The node and edge set of the direct product graph are

30 1. Introduction: Why Graph Kernels?

respectively defined as:

V×={(v_i, v⁰_i0) :v_i ∈V ∧v⁰_i0 ∈V⁰∧L(v_i) =L⁰(v_i⁰)}

E×={((v_i, v⁰_i0),(v_j, v⁰_j0))∈V××V×: (1.29) (v_i, v_j)∈E∧(v_i⁰0, v⁰_j0)∈E⁰∧(L(v_i, v_j) = L⁰(v_i⁰0, v⁰_j0))}

Using this product graph, they define the random walk kernel as follows.

Definition 21 Let G and G⁰ be two graphs, let A× denote the adjacency matrix of their product graph G×, and letV× denote the node set of the product graphG×. With a sequence of weights λ =λ₀, λ₁, . . .(λ_i ∈ R;λ_i ≥0 for all i∈ N) the product graph kernel is defined as

k×(G, G⁰) =

|V×|

i,j=1

[

∞

k=0

λ_kA^k_×]_ij (1.30)

if the limit exists.

The limit ofk(G, G⁰) can be computed rather efficiently for two particular choices ofλ:

the geometric series and the exponential series.

Settingλk =λ^k,i.e.,to a geometric series, we obtain thegeometric random walk kernel k_×(G, G⁰) =

|V×|

i,j=1

[

∞

k=0

λ^kA^k_×]_ij =

|V×|

i,j=1

[(I−λA_×)⁻¹]_ij (1.31) if λ < ¹_a, where a≥∆_max(G×), the maximum degree of a node in the product graph.

Similarly, setting λ_k = ^β_k!^k, i.e., to an exponential series, we obtain the exponential random walk kernel

k×(G, G⁰) =

|V×|

i,j=1

[

∞

k=0

(βA×)^k k! ]_ij =

|V×|

i,j=1

[e^βA^×]_ij (1.32) Both these kernel require O(n⁶) runtime, which can be seen as follows: The geometric random walk requires inversion of an n² ×n² matrix (I −λA×). This is an effort cubic in the size of the matrix, hence O(n⁶). For the exponential random walk kernel, matrix diagonalization of the n²×n² matrix A× is necessary to compute e^βA^×, which is again an operation with runtime cubic in the size of the matrix.

Marginalized Graph Kernels Though motivated differently, the marginalized graph kernels of [Kashima et al., 2003] are closely related. Their kernel is defined as the expec-tation of a kernel over all pairs of label sequences from two graphs

For extracting features from graphG= (V, E,L), a set of label sequences is produced by performing a random walk. At the first step,v₁ ∈V is sampled from an initial probability

1.4 Review on Graph Kernels 31 distributionp_s(v₁) over all nodes inV. Subsequently, at the i-th step, the next nodev_i ∈V is sampled subject to a transition probability p_t(v_i|vi−1), or the random walk ends with probability p_q(v_i−1):

|V|

vi=1

p_t(v_i|vi−1) +p_q(vi−1) = 1 (1.33) Each random walk generates a sequence of nodes w = (v₁, v₂, ..., v_`), where ` is the length of w (possibly infinite).

The probability for the walk w is described as p(w|G) =p_s(v₁)

i=2

p_t(v_i|vi−1)p_q(v_`). (1.34) Associated with a walk w, we obtain a sequence of labels

h_w = (L(v₁),L(v₁, v₂),L(v₂), . . . ,L(v_`)) = (h₁, h₂, . . . , h2`−1), (1.35) which is an alternating label sequence of node labels and edge labels from the space of labelsZ:

h_w = (h₁, h₂, . . . , h_2`−1)∈Z^2`−1. (1.36) The probability for the label sequence h is equal to the sum of the probabilities of all walks w emitting a label sequence h_w identical to h,

p(h|G) = X

δ(h=h_w) (

p_s(v₁)

i=2

(p_t(v_i|vi−1)p_q(v_l)) )

(1.37) where δ is a function that returns 1 if its argument holds, 0 otherwise.

[Kashima et al., 2003] then define a kernel k_z between two label sequences h and h⁰. Assuming that k_v is a nonnegative kernel on nodes, and k_e is a nonnegative kernel on edges, then the kernel for label sequences is defined as the product of label kernels when the lengths of two sequences are identical (` =`⁰):

k_z(h, h⁰) =k_v(h₁, h⁰₁)

i=2

k_e(h2i−2, h⁰_2i−2)k_v(h2i−1, h⁰_2i−1) (1.38) The label sequence graph kernel is then defined as the expectation ofkz over all possible h and h⁰

k(G, G⁰) = X

h⁰

k_z(h, h⁰)p(h|G)p(h⁰|G⁰). (1.39)

32 1. Introduction: Why Graph Kernels?

In terms of R-convolution, the decomposition corresponding to this graph kernel is the set of all possible label sequences generated by a random walk.

The runtime of the marginalized graph kernelk(G, G⁰) is easiest to check if we transform the above equations into matrix notation. For this purpose we define two matrices S and Q of size n×n. Let S be defined as

S_ij =p_s(v_i)p⁰_s(v⁰_j)k_v(L(v_i),L⁰(v⁰_j)). (1.40) and Q as

Q_ij =p_q(v_i)p⁰_q(v⁰_j) (1.41) .

Furthermore, let T be an²×n² transition matrix:

T(i−1)∗n+j,(i⁰−1)∗n+j⁰ =pt(vj|vi)p⁰_t(v_j⁰⁰|v_i⁰⁰)kv(L(vj),L⁰(v_j⁰⁰))ke(L(vi, vj),L⁰(v_i⁰⁰, v_j⁰⁰)); (1.42) The matrix form of the kernel in terms of these three matrices is then [Kashima et al., 2003]

k(G, G⁰) = ((I−T)⁻¹vec(Q))⁰vec(S) = vec(Q)⁰(I−T)⁻¹vec(S) (1.43) where the vec operator flattens an n×n matrix into an n²×1 vector and I is the identity matrix of sizen²×n². We observe that the computation of the marginalized kernel requires an inversion of an n² × n² matrix. Like the random walk kernel, the runtime of the marginalized kernel on graphs is hence in O(n⁶).

Note the similarity between equation (1.43) and equation (1.31), i.e., the definitions of the random walk kernel and the marginalized kernel on graphs. This similarity is not by chance. In Section 2.1, we will show that both these graph kernels are instances of a common unifying framework for walk-based kernels on graphs.

Discussion Graph kernels based on random walks intuitively seem to be a good measure of similarity on graphs, as they take the whole structure of the graph into account, but require polynomial runtime only. However, these kernels suffer from several weaknesses, which we will describe in the following.

Bad News: The Runtime Complexity Random walk kernels were developed as an alternative to the NP-hard subgraph kernel. So do theseO(n⁶) graph kernels save the day?

Unfortunately, although being polynomial, n⁶ is a huge computational effort. For small graphs, n⁶ operations (neglecting constant factors) are even more than 2ⁿ operations, as you can see from Figure 1.5. Hence for graphs with less than 30 nodes, n⁶ is slower than 2ⁿ. Interestingly, the average node number for typical benchmark datasets frequently used in graph mining is less than 30 (MUTAG 17.7, PTC 26.7)!

This high computational runtime severely limits the applicability of random walk graph kernels on real-world data. It is not efficient enough for dealing with large datasets of graphs, and does not scale up to large graphs with many nodes. As our first contribution in this thesis, we show how to speed up the random walk kernel toO(n³) in Section 2.1.

1.4 Review on Graph Kernels 33

0 5 10 15 20 25 30 35 40 45 50

10⁰ 10² 10⁴ 10⁶ 10⁸ 10¹⁰ 10¹² 10¹⁴ 10¹⁶

GRAPH SIZE n

OPERATIONS PER COMPARISON

2ⁿ n⁶

Figure 1.5: Runtime versus graph sizenfor two algorithms requiring n⁶ and 2ⁿ operations.

Tottering In addition to lack of efficiency, walk kernels suffer from a phenomenon called tottering [Mah´e et al., 2004]. Walks allow for repetitions of nodes and edges, which means that the same node or edge is counted repeatedly in a similarity measure based on walks.

In an undirected graph, a random walk may even start tottering between the same two nodes in the product graph, leading to an artificially high similarity score, which is caused by one single common edge in two graphs. Furthermore, a random walk on any cycle in the graph can in principle be infinitely long, and drastically increase the similarity score, although the structural similarity between two graphs is minor.

Halting Walk kernels show a second weakness. The decaying factor λ down-weights longer walks, which makes short walks dominate the similarity score. We describe this problem — which we refer to as ”halting” — in more detail in Section 2.1.5. Approaches to overcome both halting and tottering are the topic of Section 2.2 and Section 2.3.

Due to the shortcomings of random walk kernels, extensions of these and alternative kernels have been defined in the literature. We will summarize these next.

Extensions of Marginalized Graph Kernels Mahe et al. [Mah´e et al., 2004] designed two extensions of marginalized kernels to overcome a) the problem of tottering and b) their computational expensiveness. Both these extensions are particularly relevant for chemoinformatics applications.

The first extension is to relabel each node automatically in order to insert information about the neighborhood of each node in its label via the so-calledMorgan Index. This has

34 1. Introduction: Why Graph Kernels?

both an effect in terms of feature relevance, because label paths contain information about their neighborhood as well, and computation time, because the number of identically la-beled paths significantly decreases. This speed-up effect is successfully shown on real-world datasets. However, this node label enrichment could only slightly improve classification accuracy.

Second, they show how to modify the random walk model in order to remove tottering between 2 nodes (but not on cycles of longer length). This removal of length-2 tottering did not improve classification performance uniformly.

Subtree-Pattern Kernels As an alternative to walk kernels on graphs, graph kernels comparing subtree-patterns were defined in [Ramon and G¨artner, 2003]. Intuitively, this kernel considers all pairs of nodesV fromGandV⁰ fromG⁰ and iteratively compares their neighborhoods. ’Subtree-pattern’ refers to the fact that this kernel counts subtree-like structures in two graphs. In contrast to the strict definition of trees, subtree-patterns may include several copies of the same node or edge. Hence they are not necessarily isomorphic to subgraphs ofGorG⁰, let alone subtrees ofGandG⁰. To be able to regard these patterns as trees, [Ramon and G¨artner, 2003] treat copies of identical nodes and edges as if they were distinct nodes and edges.

More formally, let G(V, E) and G⁰(V⁰, E⁰) be two graphs. The idea of the subtree-pattern kernelkv,v⁰,his to count pairs of identical subtree-patterns inGand G⁰ with height less than or equal toh, with the first one rooted atv ∈V(G) and the second one rooted at v⁰ ∈V(G⁰). Now, ifh= 1 andL(v) = L⁰(v⁰) we havek_v,v⁰_,h = 1. Ifh = 1 andL(v)6=L⁰(v⁰) we have kv,v⁰,h = 0. Forh >1, one can compute kv,v⁰,h as follows:

• Let M_v,v⁰ be the set of all matchings from the set δ(v) of neighbors of v to the set δ(v⁰) of neighbors ofv⁰,i.e.,

Mv,v⁰ ={R⊆δ(v)×δ(v⁰)|(∀(vi, v_i⁰),(vj, v_j⁰)∈R :vi =v_i⁰ ⇔vj =v⁰_j) (1.44)

∧(∀(v_k, v_k⁰)∈R:L(v_k) =L⁰(v⁰_k))}

• Compute

k_v,v⁰_,h =λ_vλ_v⁰ X

R∈M_v,v0

(v,v⁰)∈R

k_v,v⁰,h−1 (1.45)

Here λ_v and λ_v⁰ are positive values smaller than 1 to cause higher trees to have a smaller weight in the overall sum.

Given two graphs G(V, E), G⁰(V⁰, E⁰), then the subtree-pattern kernel of G and G⁰ is given by

k_tree,h(G, G⁰) = X

v∈V

v⁰∈V⁰

k_v,v⁰_,h. (1.46)

As the walk kernel, the subtree pattern kernel suffers from tottering. Due to the more complex patterns it examines, its runtime is even worse than that of the random walk kernel. It grows exponentially with the height h of the subtree-patterns considered.

1.4 Review on Graph Kernels 35 Cyclic Pattern Kernels [Horvath et al., 2004] decompose a graph into cyclic patterns, then count the number of common cyclic patterns which occur in both graphs. Their kernel is plagued by computational issues; in fact they show that computing the cyclic pattern kernel on a general graph is NP-hard. They consequently restrict their attention to practical problem classes where the number of simple cycles is bounded.

Fingerprint and Depth First Search Kernels [Ralaivola et al., 2005] define graph kernels based on molecular fingerprints and length-d paths from depth-first search. These kernels are tailored for applications in chemical informatics, and exploit the small size and low average degree of these molecular graphs.

Optimal Assignment Kernels In the aforementioned graph kernels, R-convolution often boils down to an all-pairs comparison of substructures from two composite objects.

Intuitively, finding a best match, an optimal assignment between the substructures from G and G⁰ would be more attractive than an all-pairs comparison. In this spirit, [Fr¨ohlich et al., 2005] define an optimal assignment kernel on composite objects that include graphs as a special instance.

Definition 22 (Optimal Assignment Kernel) Letκ:X×X→Rbe some non-negative, symmetric and positive definite kernel. Assume thatxandx⁰ are two composite objects that have been decomposed into their parts x := (x₁, x₂, . . . , x|x|) and x⁰ := (x⁰₁, x⁰₂, . . . , x⁰_|x0|).

Let Π(x) denote all possible permutations of x, and analogously Π(x⁰) all possible permu-tations of x⁰.

Then k_A:X×X→R with k_A(x, x⁰) :=

( maxπ∈Π(x⁰)

P|x|

i=1κ(x_i, x⁰_π(i)) if |x⁰|>|x|, max_π∈Π(x)P|x⁰|

j=1κ(x_π(j), x⁰_j) otherwise (1.47) is called an optimal assignment kernel.

While based on a nice idea, the optimal assignment kernel is unfortunately not positive definite [Vishwanathan et al., 2007b], seriously limiting its use in SVMs and other kernel methods.

Other Graph Kernels Two more types of graph kernels have been described in the literature: Graph edit distance kernels that employ results of a graph edit distance to give extra weight to matching vertices [Neuhaus, 2006], and weighted decomposition kernels that decompose a graph into small subparts and reward similarity of these subparts with different weights [Menchetti et al., 2005]. However, while the former fails to be positive definite, the latter can only deal with highly simplified representations of graphs efficiently.

Quality Criteria in Graph Kernel Design From our review on the state-of-the-art in graph kernels, it becomes apparent that all current graph kernels suffer from different kinds of weaknesses. The open question remains: How to design a ’good’ graph kernel?

The definition of ’good’ is the key to answering this question. Here we try to define several central requirements graph kernel design has to fulfill to yield a good graph kernel. A good

36 1. Introduction: Why Graph Kernels?

graph kernel that is theoretically sound and widely applicable should show the following characteristics:

• positive definiteness. A valid kernel function guarantees a global optimal solution when this graph kernel is employed within a convex optimization problem, as in SVMs.

• not restricted to special class of graphs. While kernels that are specialized to certain classes of graphs may be helpful in some applications, it is much more attractive to define a graph kernel that is generally applicable. This way, one needs not worry if the graph kernel is applicable to the particular problem at hand.

• efficient to compute. In practice it is not only desirable to theoretically define a kernel on graphs, but also to guarantee that it is fast to compute and has a low theoretical runtime complexity. A graph kernel needs to be efficient to compute, because otherwise one may also employ one of the many expensive graph matching and graph isomorphism approaches from Section 1.3, and then apply a kernel on the similarity scores obtained by these approaches.

• expressive. A graph kernel has to represent an expressive, non-trivial measure of similarity on graphs. It has to compare features or subgraphs of two graphs that allow to really tell if the topology and/or node and edge labels of two graphs are similar.

Some of these goals may be at loggerheads. Graph kernels for special classes of graphs, for example trees, can be computed highly efficiently, requiring quadratic [Lodhi et al., 2002] or with canonical ordering even linear runtime [Vishwanathan and Smola, 2004].

These kernels, however, cannot be applied to graphs in general. The graph kernels proposed in [Neuhaus, 2006] are expressive measures of similarity on graphs, but they lack validity, i.e., they are not positive definite. The all-subgraphs kernel [G¨artner et al., 2003] is extremely expressive, as it considers all pairs of common subgraphs from two graphs, but its computation is NP-hard.

For all these reasons, one central challenge in this thesis was the development of graph kernels that overcome the limitations of current graph kernels.

Im Dokument Graph Kernels (Seite 36-44)