Two-Sample Test on Sets of Graphs - Graph Similarity via Maximum Mean Discrepancy

3.2 Graph Similarity via Maximum Mean Discrepancy

3.2.1 Two-Sample Test on Sets of Graphs

Given two sets of graphs X and Y, each of size m (assuming m = m₁ = m₂), from distributionspandq, and a universal graph kernelk, we can estimate MMD²_u via Lemma 40 and employ the asymptotic test from Section 3.1.4 in order to decide whether to reject the null hypothesisp=q. As an alternative to the asymptotic test, we could employ the biased estimate from Theorem 34, and the statistical test based on uniform convergence bounds from Section 3.1.3.

However, there are two open questions in this context: Which of the existing graph kernels is universal in the sense of [Steinwart, 2002]? If there are none, or none that are efficient to compute, can we still employ MMD on sets of graphs using a non-universal kernel? We will consider these two questions in the following.

Universal Kernels on Graphs

While many examples of universal kernels on compact subsets of R^d are known [Stein-wart, 2002], little attention has been given to finite domains. It turns out that the issue is considerably easier in this case: the weaker notion of strict positive definiteness (ker-nels inducing nonsingular Gram matrices (K_ij = k(x_i, x_j)) for arbitrary sets of distinct

3.2 Graph Similarity via Maximum Mean Discrepancy 95 points x_i) ensures that every function on a discrete domain X = {x₁, . . . , x_m} lies in the corresponding RKHS, and hence that the kernel is universal. To see this, let f ∈ R^m be an arbitrary function on X. Then α =K⁻¹f ensures that the function f = P

jα_jk(., x_j) satisfies f(x_i) = f_i for all i.

While there are strictly positive definite kernels on strings [Borgwardt et al., 2006], for graphs unfortunately no such strictly positive definite kernels exist which are efficiently computable. Note first that it is necessary for strict positive definiteness that φ(x) be injective, for otherwise we would have φ(x) = φ(x⁰) for some x 6= x⁰, implying that the kernel matrix obtained from X = {x, x⁰} is singular. However, as [G¨artner et al., 2003] show, an injective φ(x) allows one to match graphs by computing kφ(x)−φ(x⁰)k² = k(x, x) + k(x⁰, x⁰) −2k(x, x⁰). In Section 1.4, we have seen that the corresponding all-subgraphs kernel is NP-hard to compute, and hence impractical in real-world applications.

Due to these efficiency problems, let us discuss the consequences of employing a non-universal kernel with MMD next.

MMD and Non-Universal Kernels

So far, we have focused on the case of universal kernels, as MMD using universal kernels is a test for identity of arbitrary Borel probability distributions.

However, note that for instance in pattern recognition, there might well be situations where the best kernel for a given problem is not universal. In fact, the kernel corresponds to the choice of a prior, and thus using a kernel which does not afford approximations of arbitrary continuous functions can be very useful — provided that the functions it does approximate are known to be solutions of the given problem.

The situation is similar for MMD. Consider the following example: suppose we knew that the two distributions we are testing are both Gaussians (with unknown mean vectors and covariance matrices). Since the empirical means of products of input variables up to order two are sufficient statistics for the family of Gaussians, we should thus work in an RKHS spanned by products of order up to two — any higher order products contain no information about the underlying Gaussians and can therefore mislead us. It is straight-forward to see that for c >0, the polynomial kernel k(x, x⁰) = (hx, x⁰i+c)² does the job:

it equals

i,j=1

x_ix_jx⁰_ix⁰_j+ 2c

i=1

x_ix⁰_i+c² =hφ(x), φ(x⁰)i, where φ(x) = (c,√

2cx₁, . . . ,√

2cx_d, x_ix_j|i, j = 1, . . . , d)^>. If we want to test for differences in higher order moments, we use a higher order kernel² k(x, x⁰) = (hx, x⁰i+c)^p. To get a test for comparing two arbitrary distributions, we need to compare all moments, which is precisely what we do when we consider the infinite-dimensional RKHS associated with a universal kernel.

Based on these considerations and to keep computation practical, we resort to our graph kernels from Section 2 and from [Borgwardt et al., 2005] that are more efficient to

2Kernels with infinite-dimensional RKHS can be viewed as a nonparametric generalization where we have infinitely many sufficient statistics.

96 3. Two-Sample Tests on Graphs compute and provide useful measures of similarity on graphs, as demonstrated in several experiments.

Combining MMD with graph kernels, we are now in a position to compare two sets of graphs and to decide whether they are likely to originate from the same distribution based on a significance level α. Recall that the design parameter α is the probability of erroneously concluding that two sets of graphs follow different distributions, albeit they are drawn from the same distribution.

Experiments

We can employ this type of statistical test for the similarity of sets of graphs to find corresponding groups of graphs in two databases. Problems of this kind may arise in data integration, when two collections of graph-structured data shall be matched. We explore this application in our subsequent experimental evaluation. As we found the uni-form convergence-based test to be very conservative in our experimental evaluation in Section 3.1.5, we used the asymptotic test that showed superior performance on small sample sizes for our experiments.

To evaluate MMD on graph data, we obtained two datasets of protein graphs (Protein and Enzyme) and used the random walk graph kernel for proteins from [Borgwardt et al., 2005] for table matching via the Hungarian method (the other tests were not applicable to this graph data). The challenge here is to match tables representing one functional class of proteins (or enzymes) from datasetAto the corresponding tables (functional classes) in B.

Enzyme Graph Data

In more detail, we study the following scenario: Two researchers have each dealt with 300 enzyme protein structures. These two sets of 300 proteins are disjunct, i.e., there is no protein studied by both researchers. They have assigned the proteins to six different classes according to their enzyme activity. However, both have used different protein function classification schemas for these classes and are not sure which of these classes correspond to each other.

To find corresponding classes, MMD can be employed. We obtained 600 proteins mod-eled as graphs from [Borgwardt et al., 2005], and randomly split these into two subsets A and B of 300 proteins each, such that 50 enzymes in each subset belong to each of the six EC top level classes (EC1 to EC6). We then computed MMD for all pairs of the six EC classes from subset A and subset B to check if the null hypothesis is rejected or accepted. To compute MMD, we employed the protein random walk kernel function for protein graphs, following [Borgwardt et al., 2005]. This random walk kernel measures similarity between two graphs by counting matching walks in two graphs.

We compared all pairs of classes via MMD²_u B, and repeated the experiment 100 times.

Note that a comparison to competing statistical tests is unnecessary, as — to the best of our knowledge — no other distribution test for structured data exists.

We report results in Table 3.4. For a significance level of α = 0.05, MMD rejected the null hypothesis that both samples are from the same distribution whenever enzymes from

3.2 Graph Similarity via Maximum Mean Discrepancy 97 two different EC classes were compared. When enzymes from the same EC classes were compared, MMD accepted the null hypothesis. MMD thus achieves error-free data-based schema matching here.

Protein Graph Data

We consider a second schema matching problem on complex data which is motivated by bioinformatics: If two protein databases are merged, we want to automatically find out which tables represent enzymes and which do not represent enzymes. We assume that these molecules are represented as graphs in both databases.

We repeat the above experiments for graph representations of 1128 proteins, 665 of which are enzymes and 463 of which are non-enzymes. This time we consider 200 graphs per sample, i.e., two samples of 200 protein graphs are compared via the protein random walk kernel from above. Again, we compare samples from the same class (both enzymes or both non-enzymes), or samples from different classes (one enzymes, one non-enzymes) via MMD.

As on the enzyme dataset, MMD²_u B made no errors. Results are shown in Table 3.4.

Dataset Data type No. attributes Sample size Repetitions % correct matches

Enzyme graph 6 50 50 100.0

Protein graph 2 200 50 100.0

Table 3.4: Matching database tables via MMD²_uB on graph data (Enzyme, Protein) (α= 0.05; ‘%

correct matches’ is the percentage of the correct attribute matches detected over all repetitions).

Im Dokument Graph Kernels (Seite 102-105)