Summary - Graphlet Kernels for Large Graph Comparison

2.3 Graphlet Kernels for Large Graph Comparison

2.3.6 Summary

In this section, motivated by the matrix reconstruction theorem and the graph recon-struction conjecture, we have defined a graph kernel counting common size-4 subgraphs, so-calledgraphlets, in two graphs. Kernel computation involves 2 expensive steps: enumer-ation of all graphlets in each graph and pairwise isomorphism checks on these graphlets.

The latter step can be performed efficiently by exploiting the limited size of graphlets and by precomputing isomorphism groups among them. We speed up the former step by an efficient sampling scheme that allows us to estimate the distribution over graphlet isomor-phism classes by sampling a constant number of graphlets. Both these methods allow us to apply our novel kernel to graph sizes that no other graph kernel could handle so far.

In our experimental evaluation on unlabeled graphs, the novel graphlet kernel reached excellent results, constantly reaching high levels of classification accuracy, and getting more competitive in runtime performance as graph size increases. Future work will look into ways of reducing the sample size required for labeled graphs, and on speeding up isomorphism checks on labeled graphs.

To conclude, in this chapter, we have sped up the random walk graph kernel toO(n³), defined a novel kernel on shortest paths that is efficient, avoids tottering and halting and is an expressive measure of graph similarity. In the last section, we have defined a graph kernel based on sampling small subgraphs from the input graphs that is also efficient, avoids tottering and halting, an expressive measure of graph similarity, and in addition, scales up to very large graphs hitherto not handled by graph kernels.

80 2. Fast Graph Kernel Functions

Chapter 3 Two-Sample Tests on Graphs

While we have enhanced the efficiency of graph kernels so far, we have not tackled another problem: Graph kernel values per se are a rather unintuitive measure of similarity on graphs. When comparing two graphs or when comparing two sets of graphs, the (average) graph kernel value will be large, if the graphs or the sets of graphs are very similar, and small otherwise. But how to judge what is small and what is large in terms of graph kernel values?

Ideally, we would employ a statistical test to decide whether graph similarity is signifi-cant. Little attention has been paid to the question if the similarity of graphs isstatistically significant. Even the question itself is problematic: What does it mean that the similarity of two graphs is statistically significant?

For set of graphs, this question can be answered more easily than for pairs of graphs.

Given two sets of graphs, we can regard each of these sets as a sample from an underlying distribution of graphs. We then have to define a statistical test to decide whether the underlying distributions of two samples are identical; this is known as the two-sample-problem, and an associated test is called a two-sample test. Unfortunately, no two-sample test for graphs is known from the literature.

In this chapter, we define the first two-sample test that is applicable to sets of graphs, as it is based on a test statistic whose empirical estimate can be expressed in terms of kernels.

In Section 3.1, we present this test statistic, the Maximum Mean Discrepancy (MMD) and its associated two-sample tests, and evaluate its performance on classic feature vector data.

In Section 3.2.1 we explain how the two-sample tests based on MMD can be applied to sets of graphs, and evaluate it on two datasets of protein structures represented as graphs. We then show that MMD can even be applied to define a statistical test of graph similarity on pairs of graph instances in Section 3.2.2, and employ it to measure similarity between protein-protein-interaction networks of different species.

A note to the reader: Our presented method uses several concepts and results from functional analysis and statistics. If you do not feel familiar with these domains, we recommend to read the primers on these topics in Appendix A.1 and Appendix A.2 of this thesis, before continuing with this chapter. To make the presentation easier to follow, we have also moved three long proofs from this chapter to a separate Appendix B.

82 3. Two-Sample Tests on Graphs

3.1 Maximum Mean Discrepancy

In this section, we address the problem of comparing samples from two probability dis-tributions, by proposing a statistical test of the hypothesis that these distributions are different (this is called the two-sample or homogeneity problem). This test has application in a variety of areas. In bioinformatics, it is of interest to compare microarray data from different tissue types, either to determine whether two subtypes of cancer may be treated as statistically indistinguishable from a diagnosis perspective, or to detect differences in healthy and cancerous tissue. In database attribute matching, it is desirable to merge databases containing multiple fields, where it is not known in advance which fields corre-spond: the fields are matched by maximizing the similarity in the distributions of their entries.

We propose to test whether distributions p and q are different on the basis of samples drawn from each of them, by finding a smooth function which is large on the points drawn fromp, and small (as negative as possible) on the points fromq. We use as our test statistic the difference between the mean function values on the two samples; when this is large, the samples are likely from different distributions. We call this statistic the Maximum Mean Discrepancy (MMD).

Clearly the quality of MMD as a statistic depends heavily on the class F of smooth functions that define it. On one hand, F must be “rich enough” so that the population MMD vanishes if and only if p = q. On the other hand, for the test to be consistent, F needs to be “restrictive” enough for the empirical estimate of MMD to converge quickly to its expectation as the sample size increases. We shall use the unit balls in universal Reproducing Kernel Hilbert Spaces [Steinwart, 2002] as our function class, since these will be shown to satisfy both of the foregoing properties. On a more practical note, MMD is cheap to compute: givenm₁ points sampled frompand m₂ fromq, the cost isO(m₁+m₂)² time.

We define two non-parametric statistical tests based on MMD. The first, which uses distribution-independent uniform convergence bounds, provides finite sample guarantees of test performance, at the expense of being conservative in detecting differences betweenp andq. The second test is based on the asymptotic distribution of MMD, and is in practice more sensitive to differences in distribution at small sample sizes.

We begin our presentation in Section 3.1.1 with a formal definition of the MMD, and a proof that the population MMD is zero if and only if p = q when F is the unit ball of a universal RKHS. We also give an overview of hypothesis testing as it applies to the two-sample-problem, and review previous approaches in Section 3.1.2. In Section 3.1.3, we provide a bound on the deviation between the population and empirical MMD, as a function of the Rademacher averages of F with respect to p and q. This leads to a first hypothesis test. We take a different approach in Section 3.1.4, where we use the asymptotic distribution of an unbiased estimate of the squared MMD as the basis for a second test.

Finally, in Section 3.1.5, we demonstrate the performance of our method on problems from neuroscience, bioinformatics, and attribute matching using the Hungarian marriage approach. Our approach performs well on high-dimensional data with low sample size. In

3.1 Maximum Mean Discrepancy 83 addition, we will show in Section 3.2 that we are able to successfully apply our test to graph data, for which no alternative tests exist.

Im Dokument Graph Kernels (Seite 87-91)