Experiments - Metric Learning for Structured Data

Figure 5.2:An illustration of the predictive pipeline. The data is first distributed into clusters via relational neural gas (blue circles and orange squares). Each cluster performs an independent prediction (half transparent blue circle and orange square) for the new data point (black diamond).

Finally, these predictions are merged to a final prediction (black diamond) via therobust Bayesian committee machine(rBCM).

4. For any test time series ¯x we compute the vector ofdistances d(x, ¯¯ x_i) orkernel valuesk(x, ¯¯ x_i)to the training data. In case ofdistances, we need to transform these distancesto similarities and need to extend the eigenvalue correction of the training data to these new values via out-of-sample extension as described by Gisbrecht and Schleif (2015).

5. We perform rBCM to infer a prediction f(φ(x¯)) in form of an affine coefficient vector~_α.

6. We extend ourdistance matrixorkernel matrixusing Theorem2.3to the predicted point.

7. We apply downstreamdistance- orkernel-based methods on the predicted point as desired.

The pipeline is illustrated in figure5.2, where data points are shown as small shapes and points within the same time series are connected via arrows. First, we cluster the data via relational neural gas, which places prototypes (large circle and square) into the data (small circles and squares) and thereby partitions data points into disjoint clusters (distinguished by shape). For each cluster, we train a separateGPR model. For a test data point (diamond shape), each of the GPs provides a separate predictive Gaussian distribution, which are given in terms of their means (half-transparent circle and square) and their variance (dashed, half-transparent circles). The predictive distributions are merged to an overall predictive distribution with the mean from Equation 2.53(solid diamond shape) and the variance from Equation2.52(dashed circle). Note that the overall predictive distribution is more similar to the prediction of the circle-cluster because the test data point is closer to this cluster and thus the predictive variance for the circle-cluster is lower, giving it a higher weight in the merge process.

This concludes our description of the predictive pipeline. We now go on to evaluate our pipeline experimentally.

we evaluate theroot mean square error (RMSE)of the prediction for each method in a leave-one-out-crossvalidation over the time series in our dataset. We apply a Markov assumption, thus only considering the end points for all time series. More specifically, we denote the current test time series asx⁰₁, . . . ,x⁰_T, the training series as{x₁^j, . . . ,x^j_T

j}_j₌_1,...,N, the predicted affine coefficients for pointx⁰_t0as~α_t⁰ = (α¹_t0,1, . . . ,α_t^N0,TN,α⁰_t0)and the matrix of squared pairwisedistances(including the test data points) as D². Accordingly, the RMSEfor each fold has the following form (resulting from Theorem2.3).

E= v u u t 1

T−1

T−1 t

∑

⁰=1

∑

N j=1

∑

α^j_t0,td(x_t^j,x⁰_t0+1)²+α⁰_t0d(x⁰_t0,x⁰_t0+1)²−¹

2~α^>_t0D²~α_t⁰ (5.2) We evaluate our four regression models, namelyone-nearest neighbor regression (1-NN),kernel regression(KR),Gaussian process regression(GPR) and therobust Bayesian committee machine(rBCM), as well as the identity function as baseline, i.e. we predict the current point as next point.

We optimized the hyper parameters for all methods using a random search with 10 random trials (Bergstra and Bengio 2012). In particular, given the average distance d¯in the training data, we drew theradial basis functionbandwidthξ from a uniform distribution in the range[0.05·d, ¯^¯ d]for the theoretical datasets and fixed it to 0.3·d^¯for the Java datasets to avoid the need for a new eigenvalue correction in each random trial.

We drew ˜σfrom an exponential distribution in the range[10⁻³·d, ¯^¯ d]for the theoretical and[10⁻²·d, ¯^¯ d]for the Java datasets. We fixed the prior standard deviation σprior=d^¯for all datasets. In each trial of the random search, we evaluated theRMSEin a nested leave-one-out-crossvalidation over the training time series and chose the hyper-parameters that corresponded to the lowestRMSE.

ForrBCMwe preprocessed the data via relational neural gas clustering with_M

100

clusters for all datasets. As this pre-processing could be applied before hyper-parameter selection, the runtime overhead of clustering was negligible and we did not need to rely on the linear-time speedup described above but could compute the clustering on the whole training dataset.

Our experimental hypotheses are that all prediction methods should yield lower RMSEcompared to the identity baseline (H1), thatrBCM should outperform1-NNand KR(H2) and thatrBCM should not be significantly worse compared to GPR(H3). To evaluate significance we use a Wilcoxon signed-rank test.

Theoretical Data Sets

We investigate the following theoretical datasets:

Barabási-Albert model: A simple stochastic model of graph growth in undirected graphs (Barabási and Albert1999). The growth process starts with a fully connected initialgraphofm₀ nodes and addsm−m₀ nodes one by one. Each newly added node is connected tokof the existing nodes. The existing nodes are randomly selected with the probabilityP(u) =deg_t(u)/(_∑_vdeg_t(v))where deg_tis the node degree at timet, i.e.

deg_t(v) =_∑_uρ((u,v),t). We generated time series data using this model by treating the graphafter every newly generated node as a new entry of the time series. In particular,

t=1 t =2 t=3 t =4

Figure 5.3: An excerpt of a time series resulting from the Barabási-Albert model. From left to right, the model starts with a fully connectedgraphwithm0=3 nodes and then grows, one node at a time, where each new node is connected withk=2 new edges to the existing nodes. New edges preferentially attach to nodes with a high degree.

blinker

t=1 t =2 beacon

t =1 t=2 toad

t=1 t=2

clock t =1 t =2

glider

t =1 t=2 t=3 t=4

Figure 5.4:The standard patterns used for theGame of Life-dataset, except for theblock and glider pattern. All unique states of the patterns are shown. Note that the state of glider att=3 equals the state att=1 up to rotation.

we generated 20 time series, each starting with a fully connectedgraph withm₀ = 3 nodes and then growing, one node at a time, to a total of m = 27 nodes withk = 2 new edges per node. This resulted in 500graphsoverall. Also refer to Figure5.3for an illustration of the growth process.

Conway’s Game of Life: John Conway’s Game of Life (Gardner 1970) is a simple, 2-dimensional cellular automaton model. Nodes are ordered in a regular, 2-2-dimensional grid and connected to their eight neighbors in the grid. Let N(x) denote this eight-neighborhood in the grid. Then, we can describe Conway’sGame of Lifewith the following sequential dynamical system for the node presence functionψand the edge presence functionρrespectively:

ψ(v,t) =

(1 if 5≤ψ(v,t−₁) +₂·_∑_u_∈_N₍_v₎ψ(u,t−₁)≤₇

0 otherwise (5.3)

ρ((u,v),t) =

(1 ifψ(_u,_t) =₁∧ψ(_v,_t) =₁

0 otherwise (5.4)

graph

1 2

3 4

shortest path length matrix

u \v 1 2 3 4

1 0 1 1 2

2 2 0 3 1

3 ∞ ∞ 0 ∞

4 1 2 2 0

features length count

∞ 3

1 4

2 4

3 1

Figure 5.5: An examplegraph, the associated matrix of shortest path lengths as returned by the Floyd- Warshall algorithm (Floyd1962) and the histogram over path lengths used as feature representation for our approach. Note that self-distances are ignored.

Table 5.1:The meanRMSEand runtime across cross validation trials for both theoretical datasets (x-axis) and all methods (y-axis). The standard deviation is shown in brackets. Runtime entries with 0.000 had a shorter runtime (and standard deviation) than 10⁻³ milliseconds. The best (lowest) value in each column is highlighted by bold print.

Barabási-Albert Game of Life

method RMSE runtime [ms] RMSE runtime [ms]

identity 0.137 (0.005) 0.000(0.000) 1.199 (0.455) 0.000(0.000) 1-NN 0.073 (0.034) 0.111 (0.017) 1.191 (0.442) 0.112 (0.025) KR 0.095 (0.039) 0.122 (0.016) 0.986 (0.398) 0.120 (0.040) GPR 0.064 (0.028) 0.148 (0.022) 0.965(0.442) 0.127 (0.026) rBCM 0.062(0.015) 0.312 (0.083) 0.967 (0.461) 0.267 (0.077)

Note that Conway’s Game of Life is Turing-complete and its evolution is, in general, unpredictable without computing every single step according to the rules (Adamatzky 2002). We created 30 time series by initializing a 20×20 grid with one of six standard patterns at a random position, namelyblinker,beacon,toad,block,glider, andblock and glider (see figure5.4). The first four patterns are simple oscillators with a period of two, the glider is an infinitely moving structure with a period of two (up to rotation) and the block and glideris a chaotic structure which converges to a block of four and a glider after 105 steps¹. We let the system run forT=10 time steps resulting in 300graphsoverall.

In every step, we further activated 5% of the cells at random, simulating observational noise.

As data representation for both theoretical datasets we use an explicit feature embed-ding inspired by the shortest-path-kernel of Borgwardt and Kriegel (2005). In particular, we compute the pairwise shortest paths between all nodes in thegraphvia the Floyd-Warshall algorithm (Floyd 1962) and then use the histogram over the lengths of these shortest paths as features. Figure5.5displays the feature computation for an example graph. We use the standardEuclidean distanceon these features as ourgraph distance and normalize thisdistanceby the averagedistanceacross the dataset. We obtained a kernel via theradial basis functiontransformation from Equation2.44.

The RMSE and runtimes for the two theoretical datasets are shown in Table 5.1.

As expected,KR,GPRandrBCM outperform the identity-baseline (p<10⁻³ for both

1 Also refer to theLife Wikihttp://conwaylife.com/wiki/for more information on the patterns.

Table 5.2:The meanRMSEand runtime across cross validation trials for both Java datasets (x-axis) and all methods (y-axis). The standard deviation is shown in brackets. Runtime entries with 0.000 had a shorter runtime (and standard deviation) than 10⁻³seconds. The best (lowest) value in each column is highlighted by bold print.

MiniPalindrome Sorting

method RMSE runtime [s] RMSE runtime [s]

identity 0.295 (0.036) 0.000(0.000) 0.391 (0.029) 0.000(0.000) 1-NN 0.076 (0.047) 0.000 (0.000) 0.090 (0.042) 0.000 (0.000) KR 0.115 (0.031) 1.308 (0.171) 0.112 (0.027) 1.979 (0.231) GPR 0.075 (0.064) 111.417 (0.304) 0.020 (0.034) 114.394 (0.301) rBCM 0.044(0.052) 11.698 (0.085) 0.010(0.025) 18.5709 (0.121)

datasets), supporting H1. 1-NNoutperforms the baseline only in the Barabási-Albert dataset (p < 10⁻³). Also, our results lend support to H2 asrBCM outperforms1-NN in both datasets (p < 0.05 for Barabási-Albert, and p < 0.01 for Conway’s Game of Life). However,rBCMis significantly better thanKRonly for the Barabási-Albert dataset (p<0.001), indicating that for simple datasets such as our theoretical ones,KRmight already provide sufficient predictive quality. Finally, we do not observe a significant difference between rBCMandGPR, as expected in H3. Interestingly, for these datasets, rBCM is slower compared to GP, which is explained by the overhead for maintaining multiple models.

Java Programs

Our two real-world Java datasets areMiniPalindromeandSortingfrom Section4.2. The motivation for time series prediction on such data is to help students achieve a correct solution in an intelligent tutoring system (ITS). In such an ITS, students incrementally work on their program until they might get stuck and do not know how to proceed. Then, we would like to predict the most likely next state of their program, given the time series of other students who have already correctly solved the problem; a setting that we will investigate in more detail in Chapter6.

Note that our datasets only contain final, working versions of the programs. We simulated thegraphgrowth as follows. First, we represented the programs as abstract syntaxtreesand then recursively removed the last node that opened a new scope in the Java program, until the abstract syntax tree was entirely deleted. Reversing this deletion process results in time series of a growing program. In particular, we thus obtained 834 syntaxtreesfor the MiniPalindrome and 800treesfor the Sorting dataset respectively. As adistance, we employed the learned affineedit distancefrom Section 3.2and obtained a kernelvia theradial basis functionin Equation2.44 and clip eigenvalue correction as described in the previous sections.

We show theRMSEsand runtimes for both Java datasets in Table5.2. In line with H1, 1-NN,KR,GPR, andrBCMall outperform the identity baseline (p< 0.01 in all cases).

Further,rBCMoutperforms both1-NNandKR(p<0.01 in all cases), which supports H2.

Interestingly,rBCMapparently achieves better results compared toGPR, which might be the case due to additional smoothing provided by the averaging operation over all

cluster-wiseGPRresults. This result supports H3. Finally, we observe thatrBCMis about 10 times faster compared toGPRon these data.

Im Dokument Metric Learning for Structured Data (Seite 91-96)