Conclusion - Metric Learning for Structured Data

Ablation Studies

In ablation studies, we studied the difference betweenGESLandBEDLin more detail. In particular, we tested the following different design choices

1. ClassicGESL(G1),

2. GESLusingcooptimalfrequency matrices instead of a singletree mappingmatrix (G2),

3. GESL using cooptimal frequency matrices and the prototypes fromMGLVQ as neighbors N⁺andN⁻(G3),

4. LVQ tree edit distance learning, directly learning the cost function parameters instead of an embedding, with a pseudo-metric normalization after each gradient step (L1), and

5. BEDLas proposed (L2).

Note that, for the ablation studies, we re-used the hyper-parameters which were optimal for the reference versions of the methods (G1 and L2).

Figure4.4shows the average classification error and standard deviation (as error bars) for all tree-structured datasets and the string dataset, both for the pseudo-edit distance as in Equation4.2, and for the actualtree edit distanceusing the learnedcost function.

We observe that usingcooptimalfrequency matrices (G2) andMGLVQ prototypes instead of ad-hoc nearest neighbors (G3) improvedGESLon the MiniPalindrome dataset, worsened it for the strings dataset, and otherwise showed no remarkable difference for the Sorting, Cystic, and Leukemia dataset.

Regarding the LVQ tree edit distance learning variants L1 and L2, we note that BEDLimproved the error for the actualtree edit distancebut worsened the result for the pseudo-edit distance.

In general,GESL variants performed better for the pseudo-edit distance than for the actual tree edit distance, and LVQ variants performed better for the actualtree edit distancecompared to the pseudo-edit distance.

0 0.1 0.2 0.3 0.4 0.5

avg.error

Strings

KNN MGLVQ SVM goodness

0 0.1 0.2 0.3 0.4 0.5

avg.error

MiniPalindrome

0 0.1 0.2 0.3 0.4 0.5

avg.error

Sorting

0 0.1 0.2 0.3 0.4 0.5

avg.error

Cystic

G1 G2 G3 L1 L2

0 0.1 0.2 0.3 0.4 0.5

avg.error

Leukemia

G1 G2 G3 L1 L2

Figure 4.4:Ablation results for all tree-structured datasets and the strings dataset. Each row of the figure shows the results for one dataset. The left column shows the results for the pseudo-edit distance, the right column for the actualtree edit distance. The x-axis in each plot displays the different design choices as described in the text (from G1 to L2), the y-axis in each plot displays the mean classification error after metric learning, averaged across crossvalidation trials, with error bars displaying the standard deviation. The different lines in each plot display the different classifiers used for evaluation.

syntax trees, tree-based molecule representations from a biomedical task, and syntax trees in natural language processing.

Now that we have developed methods to obtain viableedit distancesfor various cases of structured data, our next challenge is to utilize theseedit distancesfor downstream predictive tasks. We have already demonstrated our ability to perform classification. In the next chapter, we cover time series prediction.

5

T I M E S E R I E S P R E D I C T I O N F O R S T R U C T U R E D D ATA

Summary: Graph theory is a flexible and general formalism providing rich models in various important domains, such as distributed computing, intelligent tutoring systems, or social network analysis. In many cases, such models need to take changes in the graph structure into account, that is, changes in the number of nodes or in the graph connectivity. Predicting such changes within graphs can be expected to yield insight with respect to the underlying dynamics, e.g. with respect to user behavior. However, predictive techniques in the past have almost exclusively focused on single edges or nodes. In this chapter, we attempt to predict the future state of a graph as a whole.

Using the theory ofpseudo-Euclideanandkernelembeddings outlined in Section2.1, we propose to phrase time series prediction as a regression problem in an implicit vectorial space. Under this perspective, we can perform time series prediction via non-parametric regression techniques, such as 1-nearest neighbor regression, kernel regression, or Gaussian process regression. The output of the regression is another point in the implicit space, which can be subsequently processed using distance-based or kernel techniques.

We evaluate our approach on two well-established theoretical models of graph evolu-tion as well as two real datasets from the domain of intelligent tutoring systems. We find that simple regression methods, such as kernel regression, are sufficient to capture the dynamics in the theoretical models but that Gaussian process regression significantly improves the prediction error for real-world data.

Publications: This chapter is based on the following publications.

• Paaßen, Benjamin, Christina Göpfert, and Barbara Hammer (2016). “Gaussian pro-cess prediction for time series of structured data”. In:Proceedings of the 24th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learn-ing (ESANN 2016). (Bruges, Belgium). Ed. by Michel Verleysen. i6doc.com, pp. 41–

46. u r l: http://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2016-109.pdf.

• — (2018). “Time Series Prediction for Graphs in Kernel and Dissimilarity Spaces”.

In:Neural Processing Letters48.2, pp. 669–689. d o i:10.1007/s11063-017-9684-5.

Source Code: The MATLAB(R) source code is available athttp://doi.org/10.4119/

unibi/2913104.

Graphs provide an ideal theoretical framework to model connective structure between entities, for example traffic connections between and within cities (Papageorgiou 1990), data lines between computing nodes (Casteigts et al. 2012), communication between people in social networks (Liben-Nowell and Kleinberg 2007), or the structure of a student’s solution to a learning task in an intelligent tutoring system (Mokbel, Gross, et al.2013, also refer to Chapter6). However, a static view of graphsis seldom sufficient.

In all the previous examples, nodes as well as connections change significantly over time. In traffic graphs, the traffic load changes significantly over the course of a day,

making optimal routing a time-dependent problem (Papageorgiou1990); in distributed computing, the distribution of computing load and communication between machines crucially depends on the availability and speed of connections and the current load of the machines, which changes over time (Casteigts et al.2012); in social networks or communication networks, new users may enter the network, old users may leave, and the interactions between users may change rapidly (Liben-Nowell and Kleinberg2007); and in intelligent tutoring systems, students change their solution over time to get closer to a correct solution (Koedinger et al.2013; Mokbel, Gross, et al.2013, also refer to Chapter6).

In all these cases it would be beneficial to predict the next state of the graph in question, because it provides the opportunity to optimize system behavior in light of possible future developments, for example by re-routing traffic, providing additional bandwidth where required, or by providing helpful hints to students.

Traditionally, predicting the future development based on knowledge of the past is the topic of time series prediction, which has wide-ranging applications in physics, sociology, medicine, engineering, finance, and other fields (Sapankevych and Sankar2009;

Shumway and Stoffer2013). However, classic models in time series prediction such as ARIMA, NARX, Kalman filters, recurrent neural networks, or reservoir models focus on vectorial data representations and thus are not equipped to handle time series ofgraphs (Shumway and Stoffer2013). Accordingly, past work on predicting changes ingraphshas focused on simpler sub-problems that can be phrased as vectorial prediction problems, e.g. predicting the overall load in an energy network (A. Ahmad et al.2014) or predicting the appearance of single edges in a social network (Liben-Nowell and Kleinberg2007).

In this contribution, we develop an approach to address the time series prediction problem forgraphs, which we frame as a regression problem with structured data as input and as output. Our approach has two key steps: First, we represent graphs via pairwisedistancesorkernel values, which are well-researched in the scientific literature (refer to Section 2.2). This representation implicitly embeds the discrete set ofgraphs in a continuous vectorial space (refer to Section2.1). Second, within this space, we can apply non-parametric regression methods, such as nearest neighbor regression,kernel regression(Nadaraya1964), or Gaussian processes (Rasmussen and Williams2005) to predict the next position in thekernelspace given the current position. Note that this does notprovide us with the graph that corresponds to the predicted point in thekernelspace.

Indeed, identifying the corresponding graph in the primal space is a kernel pre-image problemthat is in general hard to solve (Bakır, Weston, and Schölkopf2003; Bakır, Zien, and Tsuda2004; Kwok and I. W.-H. Tsang2004). However, we will show that this data point can still be analyzed with subsequentkernel- ordistance-based methods.

A drawback ofGPRis its cubic computational complexity in the number of datapoints due to akernelmatrix inversion. Fortunately, Deisenroth and Ng (2015) have developed a simple strategy to permit predictions in linear time, namely distributing the prediction to multiple Gaussian processes, each of which handles only a constant-sized subset of the data.

The key contributions of our work are the following. First, we provide an integrative overview of research on time-varyinggraphs. Second, we provide a novel scheme for time series prediction inpseudo-Euclideanandkernelspaces. This scheme is compatible with explicit vectorial embeddings, as are provided by some graphkernel(Borgwardt and Kriegel2005; Aiolli, Martino, and Sperduti2015; Bacciu, Errica, and Micheli2018), but does not require such a representation. Third, we discuss how the predictive result, which is a point in an implicitkernelfeature space, can be analyzed using subsequent

Figure 5.1:An example of a time-varying graph modeling a public transportation graph drawn for three points in time: night time (left), the early morning (middle) and mid-day (right). Present edges or nodes are drawn as solid lines, non-present edges or nodes are drawn as dashed lines.

kernel- ordistance-based methods. Fourth, we provide an efficient realization of our pre-diction pipeline for Gaussian processes in linear time. Finally, we evaluate our proposed approaches on two theoretical and two practical data sets.

5.1 b a c k g r o u n d a n d r e l at e d w o r k

Time-varyinggraphsare relevant in many different fields, such as traffic (Papageorgiou 1990), distributed computing (Casteigts et al.2012), social networks (Liben-Nowell and Kleinberg 2007), or intelligent tutoring systems (Koedinger et al. 2013; Mokbel, Gross, et al.2013). Due to the breadth of the field, we focus here on relatively general concepts that can be applied to a wide variety of domains.

Models of Graph Dynamics

Time-Varying Graphs: Time-varyinggraphshave been introduced by Casteigts et al.

(2012) in an effort to integrate different notations found in the fields of delay-tolerant networks, opportunistic-mobility networks, and social networks. The authors note that changes ingraphsfor these domains should not be regarded as anomalies, but rather as an “integral part of the nature of the system” (Casteigts et al. 2012). Here, we present a slightly simplified version of the notation developed in their work.

Definition 5.1 (Time-Varying Graph (Casteigts et al. 2012)). A time-varying graph is defined as a five-tupleG = (V,E,T_,_ψ,ρ)_where

• V is an arbitrary set callednodes,

• E⊆V×Vis a set of node tuples callededges,

• T ={t ∈_N|t₀≤t ≤T}for somet₀,T∈_Nis calledlifetimeof the graph,

• ψ:V× T → {0, 1}is callednode presence function, and node xis called presentat time tif and only ifψ(x,t) =1, and

• ρ : E× T → {_{0, 1}} _{is called}edge presence function, and edge e is calledpresentat time tif and only ifρ(e,t) =1.

In figure5.1, we show an example of a time-varying graph modeling the connectivity in simple public transportation graph over the course of a day. In this example, nodes

model stations and edges model train connections between stations. In the night (left), all nodes may be present but no edges, because no lines are active yet. During the early morning (middle), some lines become active while others remain inactive. Finally, in mid-day (right), all lines are scheduled to be active, but due to a disturbance - e.g.

construction work - a station is closed and all adjacent connections become unavailable.

Note that the concept of time-varyinggraphsgenerally assumes all nodes and edges to be known in advance. In domains where that is not the case, one can frame the underlying graph as a fully connected graph with infinitely many nodes, from which only a finite subset is present at any given time.

Using the notion of a presence function, we can generalize many interesting concepts from classic graph theory to a dynamic version. In particular, we can define thetemporal subgraphG_t of graphG at timetas the graph of all nodes and edges ofG that are present at timet, that isG_t := (V_t,E_t)where

V_t :={v∈ V|ψ(v,t) =1}, E_t := {(u,v)∈ E|ρ (u,v),t

=1} (5.1)

Further, we can define the neighborhood of a nodeu∈V_tat timetas the set of nodes Nt(u):={v∈Vt|(u,v)∈ Et}; we can define a path between u∈Vt andv∈Vt at timet as a sequence of nodesv₀, . . . ,v_K ∈V_tsuch thatv₀=u,v_K =v, and for allk ∈ {1, . . . ,K} it holds:(v_k₋₁,v_k)∈ E_t; and we can call two nodesu∈V_tandv∈ V_tconnected at time t if a path between them exists at timet.

Note that we have assumed discrete time in our definition of a time-varying graph.

This is justified by the following consideration. Even if time is continuous, changes to the graph take the form of discrete value changes in the node or edge presence function, because a presence function can only take the values 0 or 1. Let us call such discrete change pointsevents. Assuming that there are only finitely many such events, we can write all events in the lifetime of a graph as an ascending sequencet₁, . . . ,t_T. Accordingly, all changes in the graph are fully described by the sequence of temporal subgraphs G_t₁, . . . ,G_t_T (Casteigts et al. 2012; Scherrer et al. 2008). Therefore, even time-varying graphsdefined on continuous time can be fully described by considering the discrete lifetime{1, . . . ,T}.

Sequential Dynamical Systems: Sequential dynamical systems (SDS) have been intro-duced by Barrett, Mortveit, and Reidys (2000) as a generalization of cellular automata to arbitrary neighborhood structures. In essence, SDSs assign a binary stateψ(x,t)_to each nodexin a static graphG = (V,E). This state is updated according to a transition function f_x, which maps the current states of the node and all of its neighbors to the next state of the node x itself. This induces a discrete dynamical system on graphs (where edges and neighborhoods stay fixed) (Barrett, Mortveit, and Reidys2000; Barrett, Mortveit, and Reidys2003; Barrett and Reidys1999). Interestingly, SDSs can be related to time-varyinggraphsby interpreting the binary state of a node xat timetas the value of its presence functionψ(x,t). Note that we can predict the future state of an SDS by simply executing the SDS transition function f_xfor all nodesxrepeatedly. As such, SDSs provide elegant and compact models for time series prediction on graphs. Indeed, we use an SDS in our experimental section to compactly describe Conway’s Game of Life (Gardner 1970). Unfortunately, there are no learning schemes to date that can infer an SDS from data. Therefore, other predictive methods are required.

Predicting Changes in Graphs

To our knowledge, there does not exist a time series prediction forgraphsas a whole.

However, ample prior work has focused on more specific predictive problems, namely the prediction of new edges and nodes.

Link Prediction: In the realm of social network analysis, Liben-Nowell and Kleinberg (2007) have formulated thelink prediction problem, which can be stated as follows: Given a time series of temporal subgraphsG₀, . . . ,G_tfor a time-varyinggraphG, which edges will be added to the graph in the next time step, i.e. for which edges do we findρ(e,t) =0 but ρ(e,t+₁) =1? For example, given all past collaborations in a scientific community, can we predict new collaborations in the future?

The simplest approach to address this problem is to compute a similarity index s(u,v)between nodes(u,v)for whichρ((u,v),t) =0, and to predictρ((u,v),t+1) =1 if and only if s(u,v) exceeds a certain threshold (Liben-Nowell and Kleinberg 2007;

Lichtenwalter, Lussier, and Chawla 2010). Typical similarity indices for this purpose include the number of common neighbors at timet, the Jaccard index at timet, or the Adar index at timet (Liben-Nowell and Kleinberg2007). A more recent approach is to train a classifier that predicts the value of the edge presence function ρ(e,t+1)for all edges with ρ(e,t) = 0 using a vectorial feature representation of the edgee at time t, where features include the similarity indices discussed above (Lichtenwalter, Lussier, and Chawla 2010). In a survey, Lü and Zhou (2011) further list maximum-likelihood approaches on stochastic models and probabilistic relational models for link prediction.

Growth models: In a seminal paper, Barabási and Albert (1999) described a simple model to incrementally grow an undirected graph node by node from a small, fully connected seed graph (also refer to the experimental section below). Since then, many other models of graph growth have emerged, most notably stochastic block models and latent space models (Clauset2013; Goldenberg et al.2010). Stochastic block models assign each node to a block and model the probability of an edge between two nodes only dependent on their respective blocks (Holland, Laskey, and Leinhardt1983). Latent space models embed all nodes in an underlying, latent space and model the probability of an edge depending on the distance in this space (Hoff, Raftery, and Handcock2002). Both classes of models can be used for link prediction as well as graph generation. Further, they can be trained with pre-observed data in order to provide more accurate models of the data. However, graph growth models have two severe drawbacks. First, they do not cover deletions of nodes or edges, and second, they typically can not guarantee accurate predictions in detail, but only high-level properties, such as a certain edge degree distribution. As such, using growth models for time series prediction would likely yield unsatisfactory results.

In the next section, we develop our own method to predict general changes ingraphs.

Im Dokument Metric Learning for Structured Data (Seite 79-87)