Conclusion and Limitations - Metric Learning for Structured Data

0 2 4 6

·10⁻² type

scope parent codePosition name className returnType externalDependencies numberOfEdges

normalized weight

Figure 3.3:The weights~_λ after metric learning on the affineedit distance. The weights were normalized by their frequency in the dataset.

reduced to zero, whereas the scope and the type feature are strongly emphasized. This makes intuitive sense as the type feature is invariant against many stylistic choices and captures the local function of the current syntactic element, whereas the scope feature indicates the rough position in the tree of the current syntactic element. Less frequent elements such as name, className, and returnType, can fulfill auxiliary function and disambiguate special types of nodes, namely class declarations, method declarations, and variable declarations.

and corrected distances may differ significantly (refer e.g. to Nebel, Kaden, et al.2017).

Third, our proposed method is relatively slow because each gradient step requires us to compute the gradients of all pairwise distances in the data set, which is only feasible for small datasets and a small number of gradient steps. Fourth, there is no guarantee regarding the approximation quality of the softmin-approximatededit distance. While we have shown that a single softmin application approximates the actual minimum, approximation errors may accumulate over the computation, leading to higher errors for longer inputsequences. Fifth, we are currently limited tosequence edit distances, which can not directly process tree or graph data. In the next section, we will address all of these limitations with an extended metric learning scheme that works ontrees.

4

T R E E E D I T D I S TA N C E L E A R N I N G

Summary: Trees are versatile data structures, which can be used to model syntax of natural and formal languages as well as biological data such as RNA secondary structures or glycan molecules. In all these cases, the tree edit distance offers an interpretable and actionable metric, which is useful for various downstream applications. However, the tree edit distance may be misleading if its parameters do not fit the task at hand. In this chapter, we present embedding edit distance learning (BEDL), an effective and scalable tree edit distance learning approach. In our evaluation on datasets from natural language processing, biology, and intelligent tutoring systems we demonstrate that our method can not only improve the default tree edit distance, but can also outperform the state-of-the-art.

Publications: This chapter is based on the following publications.

• Paaßen, Benjamin (2018).Revisiting the tree edit distance and its backtracing: A tutorial.

arXiv:1805.06869 [cs.DS].

• Paaßen, Benjamin, Claudio Gallicchio, et al. (2018). “Tree Edit Distance Learning via Adaptive Symbol Embeddings”. In:Proceedings of the 35th International Conference on Machine Learning (ICML 2018). (Stockholm, Sweden). Ed. by Jennifer Dy and Andreas Krause. Vol. 80. Proceedings of Machine Learning Research, pp. 3973–3982.

u r l:http://proceedings.mlr.press/v80/paassen18a.html.

Source Code: The Java(R) source code for the tree edit distance andcooptimalfrequency matrix computation is available athttps://openresearch.cit-ec.de/projects/tcs.

The Java(R) source code for median generalized vector quantization is available at https://gitlab.ub.uni-bielefeld.de/bpaassen/median_relational_glvq

The MATLAB(R) source code for tree edit distance learning is available at http:

//doi.org/10.4119/unibi/2919994.

Trees occur in many shapes across various application domains, such as syntax trees of natural language (Socher, Perelygin, et al.2013), syntax trees of programming languages (Rivers and Koedinger2015), or descriptors of RNA secondary structures and glycan molecules in biology (Akutsu2010). In all these areas, thetree edit distance(Zhang and Shasha 1989) is a useful measure of distance, as it can support information retrieval and other downstream tasks (Akutsu 2010). For example, the tree edit distance has achieved increasing popularity in recent years in the field of intelligent tutoring systems for computer programming (Mokbel, Gross, et al. 2013; Gross, Mokbel, et al.2014; Price, Dong, and Lipovac 2017; Rivers and Koedinger 2015). In such systems, the tree edit distance can pinpoint exactly which nodes in an abstract syntax tree of a student’s current program have to be changed in order to arrive at a correct solution, and we can use these editsto guide a student through a programming task (Mokbel, Gross, et al.

2013; Gross, Mokbel, et al.2014; Price, Dong, and Lipovac2017; Rivers and Koedinger 2015).

As withsequence edit distances, thetree edit distanceis only useful if itscost function fits the task at hand. Per default, thetree edit distanceregards all possibletreenodes as equidistant, which may be misleading for all the domains above. In particular, natural language words may have overlapping semantics (Pennington, Socher, and Manning2014;

Socher, Perelygin, et al.2013), glycan molecule descriptors may be referring to biologically similar elements (Gallicchio and Micheli2013), and syntactic nodes in computer programs may fulfill the same or similar functions (Paaßen, Jensen, and Hammer2016), in which case the pairwise replacement costs should be lowered. This begs the question how the tree edit distancecan beadaptedto be better suited for the domain and task at hand.

In this chapter, we propose a novel method to learn thetree edit distancethat goes beyond the state of the art ofgood edit similarity learning(GESL, Bellet, Habrard, and Sebban2012, also refer to Section2.4.1) in several respects.

• We consider allcooptimalpairwisetree mappingsinstead of just onetree mapping by means of a novel forward-backward algorithm.

• We select reference pairs for metric learning in a principled fashion based on learning vector quantizationprototypes for each class instead of ad-hoc selection schemes.

• Most importantly, we learn an embedding of the syntactic elements oftreesinstead of direct cost parameters, thus guaranteeing metric properties and higher efficiency for large alphabets.

We call our resulting metric learning approachembeddingeditdistancelearning(BEDL).

In our evaluation, we show thatBEDLoutperformsGESLin terms of classification accuracy across several classifiers and several real-world datasets from natural language processing, biology, and intelligent tutoring systems. We also demonstrate that BEDL can be scaled up to a large natural language processing dataset. In the next section, we describeBEDLin detail, before we continue to our experimental evaluation.

4.1 m e t h o d

Our aim is to adapt thecost functionof thetree edit distanceof Zhang and Shasha (1989) to improve the accuracy of a classifier based on thisedit distance. More specifically, we intend to improve the accuracy of a median generalized learning vector quantization (MGLVQ) model. For our purposes, MGLVQ has two key advantages compared to RGLVQ, which we used in the previous chapter. First, MGLVQ does not require an Euclidean distance as input such that we can avoid eigenvalue correction. Second, MGLVQrepresentsprototypessparsely by setting eachprototypeto a single datapoint.

As such, we only need to consider linearly manydistancevalues to optimize theGLVQ cost functionin Equation2.28, namely the data-to-prototype distances.

More precisely, assume that we wish to optimize the parameters~_λ_{of the}cost function c_~_λ via gradient-based techniques on theGLVQ cost functionE_GLVQ with the gradient3.2.

To compute this gradient, we require thetree edit distancegradients∇_~

λd⁺_i and∇_~

λd⁻_i . Computing these gradients is our next step. To do so, we decompose thetree edit distance into a scalar product oftree mappingmatrices and pairwise costs.

Co-Optimal Frequency Matrices

Our decomposition is similar to theGESLapproach of Bellet, Habrard, and Sebban (2012, also refer to Section2.4.1). However, in contrast to their approach, we consider not only a single tree mappingmatrix, but an average of all cooptimal tree mappingmatrices.

Recall that we denote thetree edit distance with respect to acost function cas d_c (refer to Definition 2.12), thetree mapping edit distanceas Dc (refer to Definition2.14), and the tree mappingmatrix with respect to atree mappingMbetween twotreesx˜ and ˜yas P(M, ˜x, ˜y)(refer to Definition2.16). We define thecooptimalfrequency matrix as follows.

Definition 4.1 (Co-optimal Frequency Matrix). Let ˜x and ˜ybetreesover somealphabet A, let Mbe atree mappingbetween ˜x and ˜y, and letcbe acost functionoverA.

Further, letM(c, ˜x, ˜y)be the set of allcooptimal tree mappingsbetween ˜xand ˜y, i.e.

alltree mappingsM such thatc(M, ˜x, ˜y) =Dc(x, ˜˜ y).

We define thecooptimalfrequency matrixP_c(x, ˜˜ y)as the|x˜| × |y˜|matrix P_c(x, ˜˜ y):= ^∑^M^∈M(^{c, ˜}^{x, ˜}^y⁾^P(M, ˜x, ˜y)

|M(c, ˜x, ˜y)| ^(4.1)

In other words, thecooptimalfrequency matrixP_c(x, ˜˜ y)is defined as the averagetree mapping matrixP(M, ˜x, ˜y)for allcooptimal tree mappings M. As an example, consider the trees x˜ = a(b(c,d),e)and ˜y = f(g). Allcooptimal tree mappings between ˜x and

y according to default costs, along with the respective tree mapping matrix and the resultingcooptimalfrequency matrix are listed in Figure4.1.

Using the concepts oftree mappingmatrices andcooptimalfrequency matrices, we obtain the following decomposition for thetree edit distance.

Theorem 4.1. Letx and˜ y be˜ treesover somealphabetAand let c be acost functionoverA. Then, if c is non-negative, self-equal, and conforms to the triangular inequality, it holds:

d_c(x, ˜˜ y) =

|x˜| i

∑

|y˜| j

∑

P_c(x, ˜˜ y)_i,j·c(x_i,y_j) _(4.2)

|x˜| i

∑

1−

|y˜| j

∑

P_c(x, ˜˜ y)_i,j·c(x_i,−) +

|y˜| j

∑

1−

|x˜| i

∑

P_c(x, ˜˜ y)_i,j·c(−,y_j)

Proof. First, Theorem2.5implies that d_c(x, ˜˜ y) =D_c(x, ˜˜ y).

Further, note that for anyM ∈ M(c, ˜x, ˜y)it holds:c(M, ˜x, ˜y) =D_c(x, ˜˜ y). Accordingly, we obtain:

dc(x, ˜˜ y) = ^∑^M^∈M(^{c, ˜}^{x, ˜}^y⁾^c(M, ˜x, ˜y)

|M(c, ˜x, ˜y)|

By virtue of Equation2.26we obtain:

d_c(x, ˜˜ y) = ¹

|M(c, ˜x, ˜y)|·

∑

M∈M(c, ˜x, ˜y)

|x˜| i

∑

|y˜|

∑

j=1

P(M, ˜x, ˜y)_i,j·c(x_i,y_j)

|x˜| i

∑

1−

|y˜| j

∑

P(M, ˜x, ˜y)_i,j·c(x_i,−) +

|y˜| j

∑

1−

|x˜| i

∑

P(M, ˜x, ˜y)_i,j·c(−,y_j)

a e b

d c

f g

M={(1, 1),(2, 2)}

i\j 1 2

1 1 0

2 0 1

3 0 0

4 0 0

5 0 0

P(M, ˜x, ˜y)

a e b

d c

f g

M= {(1, 1),(3, 2)}

i\j 1 2

1 1 0

2 0 0

3 0 1

4 0 0

5 0 0

P(M, ˜x, ˜y)

a e b

d c

f g

M= {(1, 1),(4, 2)}

i\j 1 2

1 1 0

2 0 0

3 0 0

4 0 1

5 0 0

P(M, ˜x, ˜y)

a e b

d c

f g

M={(_{1, 1})_,(_{5, 2})}

i\j 1 2

1 1 0

2 0 0

3 0 0

4 0 0

5 0 1

P(M, ˜x, ˜y)

a e b

d c

f g

M= {(_{2, 1})_,(_{3, 2})}

i\j 1 2

1 0 0

2 1 0

3 0 1

4 0 0

5 0 0

P(M, ˜x, ˜y)

a e b

d c

f g

M= {(_{2, 1})_,(_{4, 2})}

i\j 1 2

1 0 0

2 1 0

3 0 0

4 0 1

5 0 0

P(M, ˜x, ˜y)

i\j 1 2 1 ²/3 0 2 ¹/3 1/6

3 0 ¹/3

4 0 ¹/3

5 0 ¹/6

P_c(x, ˜˜ y)

Figure 4.1:Allcooptimal tree mappingsbetween thetreesx˜=a(b(c,d),e)and ˜y=f(g) accord-ing to default costs (top and middle), the correspondaccord-ingtree mappingmatricesP(M, ˜x, ˜y)(below thetree mappingdiagrams), and the resultingcooptimalfrequency matrixPc(x, ˜˜ y)(bottom).

which can be re-written into Equation4.2, which completes the proof.

Note that it is not trivial to compute thecooptimal frequency matrix because the set M(c, ˜x, ˜y) may have exponential size. To address this issue, we introduce a novel forward-backward algorithm.

Theorem 4.2. Let x and˜ y be˜ trees over somealphabet Aand let c be acost function overA that conforms to the triangular inequality. Then, Algorithm4.1computesP_c(x, ˜˜ y)· |M(c, ˜x, ˜y)|

as first output and|M(c, ˜x, ˜y)|as second output. Further, Algorithm4.1runs inO(|x˜|⁶· |y˜|⁶) time complexity andO(|x˜|²· |y˜|²)space complexity in the worst case.

Proof. Refer to AppendixA.13.

Algorithm 4.1An algorithm that computes the matrixPc(x, ˜˜ y)for two giventreesx˜and ˜y and acost functionc. More specifically, the first output argument isP_c(x, ˜˜ y)· |M(c, ˜x, ˜y)|, and the second output argument is|M(c, ˜x, ˜y)|. For the forward and backward algorithm, refer to AppendixA.13.

1: function c o o p t i m a l s(Two trees x˜ and ˜y, the matrices d and D after executing algorithm2.1, and acost functionc)

2: (C,A)← f o r wa r d( ˜x, ˜y,d,D,c). .Refer to AlgorithmA.1.

3: B← b a c k wa r d( ˜x, ˜y,d,D,c,C). .Refer to AlgorithmA.2.

4: InitializeΓas a |x˜| × |y˜|matrix of zeros.

5: for(i,j)∈Cdo

6: ifi=|x˜|+1∨j=|y˜|+1then

7: continue

8: end if

9: if rlx˜(i) =|x˜| ∧rly˜(j) =|y˜|∨c(x_i,y_j) =c(x_i,−) +c(−,y_j)then

10: ifD_i,j =D_i+1,j+1+c(x_i,y_j)then

11: Γi,j ←_Γ_i,j+A_i,j·B_i+1,j+1.

12: end if

13: else

14: ifD_i,j =D_rl_x_˜₍_i₎₊_1,rl_y_˜₍_j₎₊₁+d_i,j then

15: γ←A_i,j·B_rl_x_˜₍_i₎₊_1,rl_y_˜₍_j₎₊₁.

16: Compute D⁰ andd⁰ via Algorithm2.1for the subtrees ˜xⁱ and ˜y^j.

17: D⁰_1,2←_∞. D_2,1⁰ ←_∞.

18: (_Γ⁰,|M(c, ˜xⁱ, ˜y^j)|)← c o o p t i m a l s( ˜xⁱ, ˜y^j,D⁰,d⁰,c).

19: fori⁰ ←1, . . . ,|x˜ⁱ|do

20: forj⁰ ←1, . . . ,|y˜^j|_do

21: Γi+i⁰−1,j+j⁰−1 ←_Γ_i₊_i⁰₋_1,j₊_j⁰₋₁+_Γ⁰_i0,j⁰·γ.

22: end for

23: end for

24: end if

25: end if

26: end for

27: return(_Γ,A_|_x_˜_|+_1,_|_y_˜_|+₁).

28: end function

In rough terms, Algorithm4.1works as follows. We first compute the matrixA, where A_i,j essentially contains the number ofcooptimal tree mappings between ˜x and ˜y up

to nodesx_i and y_j respectively. Then, we compute the matrixB, where B_i,j essentially contains the number ofcooptimal tree mappingsbetween ˜x and ˜y, starting from nodesx_i andy_j respectively. Accordingly, the number ofcooptimal tree mappingswhich contain the pairing (i,j) can be computed as Γi,j = A_i,j ·B_i₊_1,j₊₁, as is visible in line 11 of Algorithm4.1. What complicates this process, however, is that we also need to compute cooptimal tree mappingsbetween subtrees, which are recursively computed in lines 15-23 of Algorithm4.1. After executing the algorithm, the desired frequency matrix P_c(x, ˜˜ y) can easily be computed asΓ/A_|_x_˜_|+_1,_|_y_˜_|+₁, where / denotes the element-wise division.

Note that the version of the algorithm presented here is dedicated to minimize space complexity. By additionally tabulatingΓfor all subtrees, space complexity rises toO(|x˜|⁴· |y˜|⁴)in the worst case, but runtime complexity is reduced to O(|x˜|³· |y˜|³). Another point to note is that the worst case for this algorithm is quite unlikely. First, both inputtreeswould have to be left-heavy, such as in the worst case for the originaltree edit distance(Zhang and Shasha1989). Second, in every step of the computation, multiple options have to becooptimal, which only occurs in degenerate cases where, for example, the deletion or insertion cost for all symbols is zero.

After obtaining the cooptimalfrequency matrixP_c(x, ˜˜ y), we can utilize the decom-position above to compute the gradient of the tree edit distance dc~_λ with respect to the parameters~λ. Similar to Bellet, Habrard, and Sebban (2012), we assume that the cooptimalfrequency matricesP_c_~

λ(x, ˜˜ y)stay constant under changes of~λ. Thus, we obtain the gradient:

∇_~

λdc~_λ(x, ˜˜ y)

const.Pc~_λ

|x˜| i

∑

|y˜| j

∑

P_c_~

λ(x, ˜˜ y)_i,j· ∇_~

λc~_λ(x_i,y_j) (4.3)

|x˜| i

∑

1−

|y˜| j

∑

P_c_~

λ(x, ˜˜ y)_i,j· ∇_~

λc_~

λ(x_i,−) +

|y˜|

∑

j=1

1−

|x˜| i

∑

P_c_~

λ(x, ˜˜ y)_i,j· ∇_~

λc_~

λ(−_,y_j) This gradient expression is efficiently computable and permits optimization via gradient-based techniques, such as stochastic gradient descent or L-BFGS (Liu and Nocedal1989).

Note that optimizing the parameters~_λwith respect to theGLVQ cost function 2.28 may yield atree edit distanceunder which a betterMGLVQmodel is possible. Therefore, we recommend an alternating optimization scheme with two steps, namelyMGLVQand metric learning, which are iterated until convergence. This yields Algorithm4.2.

Algorithm 4.2Thetree edit distancelearning algorithm for arbitrary parameters~_λ.

1: function t e d_l e a r n(A dataset of trees x˜₁, . . . , ˜x_M over some alphabet A, class labelsy₁, . . . ,y_M, no. ofprototypes K, acost functionc_~_λ, and initial parameters~_λ)

2: whileEhas changeddo

3: Compute pairwisetree edit distancesD_i,j =d_c_~

λ(x˜_i, ˜y_j)via Algorithm2.1.

4: (w₁, . . . ,w_K,E)←MGLVQforD.

5: ComputePc~_λ(x˜_i,w_k)for alli,kvia Algorithm4.1.

6: Optimize~_λwith respect to theGLVQ cost function Eusing Equation4.3.

7: end while

8: return(~_λ,_E).

9: end function

Regarding runtime complexity, computing all pairwise tree edit distance requires O(M²· |x˜|²· |y˜|²)steps, trainingMGLVQrequiresO(M²)steps, computing thecooptimal frequency matrices for all datapoint-prototype pairs requires O(M· |x˜|⁶· |y˜|⁶) in the worst case, after which each gradient computation according to Equations3.2and4.3 requires only O(M· |x˜| · |y˜|) steps. In terms of space, we requireO(M²) to represent the pairwisedistancematrix andO(|x˜|²· |y˜|²)for computing thecooptimalfrequency matrices. Therefore, assuming a constant number of iterations τ, we obtain an overall runtime complexity in the order of O(τ·M· |x˜|²· |y˜|²·(M+|x˜|⁴· |y˜|⁴))and an overall space complexity ofO(M²+|x˜|²· |y˜|²).

Note that the decomposition in Equation4.2only holds under the assumption thatc~_λ

is non-negative, self-equal, and fulfills the triangular inequality. Unfortunately, ensuring metric axioms on c~_λ imposes additional constraints on the optimization in line 6 of Algorithm4.2, which prevent us from directly applying gradient-based solvers. To avoid such explicit constraints, we introduce an additional innovation, namely a representation of thealphabetAvia symbol embeddings.

Adaptive Symbol Embeddings

In this section, we introducesymbol embeddings, that is, we represent the elements of an alphabet Aby vectors in anEuclideanspace. The main motivation for this alternative representation is that theEuclidean distanceguarantees pseudo-metric properties, which in turn ensure that the assumptions of the tree edit distancealgorithm as well as the decomposition in Equation4.2are fulfilled without having to constrain the optimization process. Another advantage is that we can potentially interpret the positions of the vectorial representations and thus gain additional insight regarding the role different symbols play.

We define a symbol embedding as follows.

Definition 4.2 (Symbol Embedding). LetAbe analphabetwith−∈ A/ . Then, asymbol embeddingofAis defined as a mappingφ:A →_R^m for some m∈ {1, . . . ,|A|}. For any symbol embedding, we defineφ(−):=~_{0, where}~_{0 is the}m-dimensional zero-vector.

Further, we define thecost functionc_φ with respect to the symbol embeddingφas follows.

c_φ(x,y):= kφ(x)−φ(y)k

In other words,cφis anEuclidean distanceonA ∪ {−} with the spatial mappingφ.

Note that the gradient ofcφ(x,y)with respect to φ(x)is given as:

∇_φ₍_x₎c_φ(x,y) =∇_φ₍_x₎ q

φ(x)−φ(y)^>· φ(x)−φ(y)

= ∇_φ₍_x₎ φ(x)−φ(y)^>· φ(x)−φ(y) 2·

φ(x)−φ(y)^>· φ(x)−φ(y)

= ^φ(x)−φ(y)

kφ(x)−φ(y)k ^(4.4)

Plugging this result into Equation4.3, we obtain a gradient of thetree edit distance with respect to the embedding vectors for every symbol, under the assumption that the

cooptimalfrequency matrices stay constant.

∇_φ₍_x₎d^˜_c_φ(x, ˜˜ y)^const.=^P^c^φ (4.5)

|x˜| i

∑

δ(x,x_i)· |y˜|

∑

P_c_φ(x, ˜˜ y)_i,j· ^φ(x)−φ(y_j)

kφ(x)−φ(y_j)k+₁−

|y˜|

∑

j=1

P_c_φ(x, ˜˜ y)_i,j· ^φ(x) kφ(x)k

|y˜|

∑

j=1

δ(x,y_j)· |x˜|

∑

P_c_φ(x, ˜˜ y)_i,j· ^φ(x)−φ(x_i)

kφ(x)−φ(x_i)k+1−

|x˜| i

∑

P_c_φ(x, ˜˜ y)_i,j· ^φ(x) kφ(x)k

whereδ is the Kronecker-Delta, i.e.:δ(x,y) =1 ifx=y and 0 otherwise.

Finally, we can plug this gradient into Equation3.2and thus obtain a gradient of the GLVQ cost functionwith respect to the rows of our symbol embedding matrix. Using this gradient, we can learn a symbol embedding via any gradient-based technique.

Two challenges remain in this setup. First, we have to obtain a viable initialization for the embeddingφ. In this regard, we suggest to use an initialization, which yields the default edit costs of thetreeandsequence edit distance, that is, c(x,y)should be 1 in all cases, except ifx=y, where it is zero. Indeed, such an initialization exists.

Theorem 4.3. LetA={x₁, . . . ,xn}be analphabet. Then, the following functionφ:A →_Rⁿ with

φ(x_i)_j :=







0 if j>i ρ_j if j<i ρ_i·(i+1) if j=i

where (4.6)

ρ_i =_1/

2·i·(i+₁) _(4.7)

is a symbol embedding ofAsuch that:

c_φ(x,y) =

(0 if x =y 1 otherwise Proof. Refer to AppendixA.14.

As an example, consider the two-dimensional embedding generated for the alphabet A={a,b}in Figure4.2.

The second challenge concerns the handling of the following degenerate cases for the embeddingφ. First,c_φ is non-differentiable with respect toφ(x)at the point φ(x) =~_0.

However, this may not pose a problem as such. In particular,φ(x) =~0 implies thatx is unimportant to distinguish betweentrees, which is an interesting observation to make.

Therefore, we simply extend the definition of the gradient ofc_φ to be zero atφ(x) =~_0.

Second, φ may “flatten” the data too much, in the sense that the matrix Φ = φ(x₁), . . . ,φ(xn) for the alphabetA= {x₁, . . . ,xn}becomes low rank. Such oversim-plification effects have previously been observed ingeneralized matrix learning vector quantization(GMLVQ) as well (Schneider, Bunte, et al.2010). Indeed, we can make the

1 1

φ(−) = 0

φ(a) = 1

0 φ(b) =

√2 3 2

Figure 4.2: The initial 2-dimensional embedding of the alphabetA ={a,b}. This embedding ensures that the pairwise costs of all symbols, including the−-symbol, is 1.

similarity between GMLVQand our metric learning scheme more obvious by re-writing c_φ(x_i,x_j)as follows.

c_φ(x_i,x_j) = q

φ(x_i)−φ(x_j)^>· φ(x_i)−φ(x_j)

q Φ·e_i−_Φ·e_j>

· _Φ·e_i−_Φ·e_j

= q

e_i−e_j>

·_Φ^>·_Φ· e_i−e_j wheree_i is theith unit vector ande_j is the jth unit vector.

In this representation it is obvious thatΦplays the role of theprojection matrixΩin GMLVQ(compare to Equation2.29). As Biehl et al. (2015) have shown,Ωtends to very low-rank solutions, which may be overly simplistic in practical cases. To prevent such a low-rank solution, we follow the recommendation of Schneider, Bunte, et al. (2010) and add the regularization term −λ·log(det(_Φ·_Φ^>))for some constant λ>0 to theGLVQ cost function 2.28, which becomes large if any eigenvalue of Φ·_Φ^>gets close to zero.

The gradient of the regularization with respect toΦis theMoore-Penrose Pseudo-Inverse

−λ·_Φ^† (Schneider, Bunte, et al.2010; Petersen and Pedersen2012).

Finally, the GLVQ cost function 2.28 is inherently scale-invariant in terms of the distances, that is, if we multiply all pairwise distances with some constant, the loss will stay the same. Given this degree of freedom, we may converge to an an embedding with needlessly large scaling. To prevent such a case, we additionally add the regularization term λ· k_Φk_F, wherek_Φk_Fis the Frobenius norm ofΦ.

This concludes our basic setup fortree edit distancelearning via adaptive symbol embeddings. We call our approachembeddingeditdistancelearning(BEDL).

Before we evaluateBEDLexperimentally, we wish to highlight one additional desir-able property of BEDL, namely that we can utilize alternative initializations and metrics if suitable for the domain at hand. For example, in the domain of natural language processing, we do not need to learn an embedding of words from scratch but can rely on existing word embeddings, such as GloVe (Pennington, Socher, and Manning2014).

We can then learn toadaptthe word embedding instead of learning it from scratch by only learning a linear transformationΩ that maps the pre-existing word embedding into another space in which classification is simpler. Furthermore, for word embeddings, the cosine similarity is typically favored over theEuclideandistance (Pennington, Socher,

and Manning2014). We can include the cosine distance inBEDLeasily by re-defining thecost functionc_φ,Ω as follows.

c_φ,_Ω(x,y):= ¹

2·1−s_Ω(φ(x),φ(y)) where (4.8) s_Ω(φ(x),φ(y)) = (_Ω·φ(x))^>·_Ω·φ(y)

k_Ω·φ(x)k · k_Ω·φ(y)k For the gradient, we obtain:

∇_Ωc_φ,Ω(x,y) =−¹

2∇_Ωs_Ω(φ(x),φ(y))

=−¹

2 ·_Ω·^φ(x)·φ(y)^>+φ(y)·φ(x)^>

k_Ω·φ(x)k · k_Ω·φ(y)k +¹

2 ·_Ω·s_Ω(φ(x),φ(y))·^h^φ(x)·φ(x)^>

k_Ω·φ(x)k² +^φ(y)·φ(y)^>

k_Ω·φ(y)k² i

We will utilize both the basic Euclidean version ofBEDL and the cosine distance variation in our next section, in which we evaluateBEDLexperimentally.

Im Dokument Metric Learning for Structured Data (Seite 62-74)