Experiments - Metric Learning for Structured Data

We evaluate our metric learning scheme on two artificial datasets, namely:

Strings: An artificial, balanced two-class dataset of 200 strings of length 12. Strings in class 1 consist of 6 aorb symbols, followed by acord, followed by another 5 a orb symbols. Which of the two respective symbols is selected is chosen uniformly at random. Strings in class 2 are constructed in much the same way, except that they consist of 5 aorbsymbols, followed by acord, followed by another 6a or b symbols. Note that the classes can be neither discriminated via length nor via symbol frequency features. The decisive discriminative feature is wherecs and ds are located in the string.

Gap: An artificial, balanced two-class dataset of 200 uniform random strings over the alphabet A={a,b,c,d}, where the strings in class 1 have length 10 and the strings in class 2 have length 12. In this data set, the discriminative feature is the length, with replacement costs being irrelevant.

For these data, our aim is to optimize the standard stringedit distancewithsignature S_ALI= ({_del}_,{_rep}_,{_ins})_andedit tree grammarG_ALIas defined in Equation2.16. For eachalphabetA={x₁, . . . ,xm}we employ the followingalgebraF_~

λ. c_rep(x_i,x_j) =λ₍_m₊₁_)·(_i₋₁₎₊_j

c_del(x_i) =λ₍_m₊₁_)·_i c_ins(x_j) =λ₍_m₊₁_)·_m₊_j

In other words, we consider the replacement, deletion, and insertion costs as parameters, which results in(|A|+1)²−1 parameters overall. As initialization, we use thealgebra F_ALIspecified in Equation2.15.

For metric learning, we train aRGLVQmodel with oneprototypeper class and then perform ten gradient descent steps to learn the parameters. As learning rate for gradient descent we employ η=1/M for both datasets, where Mis the number of data points.

After each gradient step, we normalize the parameters by clipping negative values to zero, by setting self-replacement costs to zero, by symmetrizing the parameters, and by using the Floyd-Warshal algorithm for pairwise shortest paths to enforce the triangular inequality (Floyd 1962). We set thecrispnessparameter toβ=1.

We evaluate the average classification error on our learned metric in a crossvalidation with 20 folds using four different classifiers, namely a 1-nearest neighbor classifier (1-NN), aRGLVQclassifier with one prototype per class, asupport vector machine (SVM)with

Table 3.2:The mean classification error±standard deviation of multiple classifiers across 20 crossvalidation trials on both artificial datasets. The first column lists the results for the standard stringedit distance, the second column forGESL, and the final column for our proposed metric learning scheme. Datasets and the different classifiers are listed as rows. The best results for each dataset are highlighted in bold print.

classifier Initial GESL proposed

String

1-NN 20.0±9.2% 10.5±9.4% 0.0±0.0%

RGLVQ 39.5±14.3% 18.0±12.8% 0.0±0.0%

SVM 7.0±8.6% 11.0±8.5% 0.0±0.0%

good 4.5±5.1% 4.5±5.1% 0.0±0.0%

Gap

1-NN 51.5±15.7% 10.0±11.2% 5.0±15.4%

RGLVQ 54.5±10.5% 34.5±11.9% 7.0±17.2%

SVM 20.0±_10.3% _45.0±_6.9% _5.0±_15.4%

good 6.5±6.7% 1.0±3.1% 5.0±15.4%

radial basis functionkernel (refer to Equation2.44) and clip eigenvalue correction, and the goodness classifier of Balcan, Blum, and Srebro (2008) with the similarity metric s_d(x, ¯¯ y) = 2·exp(−d_c(x, ¯¯ y))−1 as suggested in Section2.4.1. We optimized theradial basis functionbandwidthξ forSVMand the sparsity hyperparameterνfor the goodness classifier in a nested crossvalidation with 5 folds. We compare these classification errors with the errors obtained via the initial stringedit distance and the pseudo-edit distance obtained viagood edit similarity learning(GESL, Bellet, Habrard, and Sebban2012, also refer to Section2.4.1) withK=1 reference point from the same and from the other class for each point.

The results are displayed in Table 3.2. For the strings dataset, our proposed metric learning scheme could improve the stringedit distance such that all classifiers could classify the data perfectly in all folds, whereas GESL only yielded improvements for the 1-NN andRGLVQclassifier. Overall, our proposedsequence edit distancelearning scheme significantly outperformed both the initial string edit distance and GESL for all classifiers (p < 0.01 according to a Wilcoxon signed-rank test). Deeper inspection revealed that our proposed scheme did indeed reduce the pairwise replacement costs c(a,b) =c(b,a)to zero in all folds as expected.

For the gaps dataset, our proposed scheme could improve the metric such that perfect classification was possible in 17 out of 20 folds. In these cases, our approach did correctly set the pairwise replacement costsc(a,b) =c(b,a)to zero while gap costs remained nonzero. This result was surpassed byGESL for the goodness classifier, where GESL achieved 1% error. However,GESL achieved much worse results for the other classifiers, indicating that the learned metric is rather specific to the goodness classifier, whereas our proposed approach achieves a metric that is viable across the board. For all classifiers except the goodness classifier, our proposed approach achieved significantly better results than the initial stringedit distance(p <0.01); and for bothRGLVQand SVMwe significantly outperformedGESL(p<0.001).

p u b l i c c l a s s A d d e r {

p u b l i c int add (int a , int b ) { r e t u r n a + b ;

} }

class method

return variable

variable

Figure 3.1:Left: A snippet of example Java source code. Right: The corresponding abstract syntax tree (AST). Note that the AST also includes backreferences from the “return” node to both variable nodes, because both variables are referenced in the return statement.

Real-World Data

We consider the following real-world datasets.

Copenhagen Chromosomes: A balanced two-class dataset of 400 strings, consisting of the classes 4 and 5 of the Copenhagen Chromosomes database (Lundsieen, Philip, and Granum 1980). Each string describes the density of a human chromosome in differential coding with a 13-letter alphabet A = {f, . . . ,a,=,A, . . . ,F}, where lower case letters mark negative changes in density, upper case letters mark positive changes in density, and =codes equal density.

Sorting: An unbalanced, two-class dataset consisting of 64 Java programs collected from 37 different web sites (Paaßen 2016a). All programs are implementations of sorting algorithms for an input array of integers. In particular, 35 programs implement BubbleSort, and 29 programs implementInsertionSort. We preprocess all programs by extracting their abstract syntax trees(ASTs) using the Oracle Java™ Compiler API.

To each node of these ASTs, we attach a feature vector incorporating characteristic properties, namely a discrete type label (e.g. class declaration, method, variable declaration, for loop, etc.), an encoding of the visibility scope the node is located in, the index of the parent node, the row and column index of the node within the original source code, the name of a declared class, method, or variable if applicable, the name of the class of the declared variable if applicable, the name of the class of a returned variable if applicable, the number of references to other nodes within the AST, and a list of strings of references to external classes, methods, or variables.

Finally, we flatten all ASTs tosequences by considering thesequenceof nodes in depth-first search oder. As an example, consider the source code listed in Figure3.1.

The corresponding depth-first-searchsequenceis shown in Table3.3.

For the CopenhagenChromosomes dataset, we again evaluate the standard string edit distance and learn the pairwise replacement and gap costs directly. For the Sorting dataset, we learn both the standard stringedit distanceas well as the affineedit distance of Gotoh (1982) with the signature S_LOCAL as defined in Equation 2.17, the edit tree

Table 3.3: The depth-first search sequence generated for the “Adder” source code listed in Figure3.1. Each column corresponds to one AST node, each row to one feature.

type class method variable variable return

scope [] [0] [0,0] [0,0] [0,0]

parent -1 0 1 1 1

codePosition (1,1)-(5,2) (2,3)-(4,4) (2,18)-(2,23) (2,25)-(2,30) (3,5)-(3,16)

name Adder add a b −

className − − int int −

returnType − int − − −

numberOfEdges 1 3 0 0 2

externalDeps − int − − −

grammarG_AFFINE as defined in Equation2.18, and the followingalgebraF_~

λ. c_del(x) =c_rep(x) =c^l,o_skip(x) =c^r,o_skip(x) =₁ ∀x∈ A c^l_skip(x) =c^r_skip(x) =0.5 ∀x∈ A crep(_x,_y) =

∑

9 r=1

λ_r·cr(_x_r_,_y_r) ∀x,y∈ A

where x_r denotes therth feature of x, λ_r is a real number in the range[0, 1]such that

∑⁹r=1λ_r = 1, and c_r is a specific metric for the rth feature. In particular, for the type feature, we assign a distance of 1 if the type is not equal and a distance of 0 otherwise.

For the scope feature, we use one minus the length of the longest common prefix divided by the longer scope. For the parent feature, the code position, and the number of edges, we use the Manhattan distance. For the name, the className, the returnType, and the externalDeps feature we compute character frequencies and use the Manhattan distance on the character frequency vectors. Our adaptable metric parameters are the weightsλ_r. We initialize these weights asλr=1/9.

As with the artificial data, we first train a RGLVQ model with oneprototype per class and then perform ten gradient descent steps to learn the respective parameters. As learning rate for gradient descent, we employη=0.45/Mfor the CopenhagenChromo-somes dataset andη = 2/(M· |x¯|)for the Sorting dataset, where M is the number of data points and|x¯|is the average sequence size in the dataset. After each gradient step, we normalize the parameters, using the same normalization as for the artificial data in case of the CopenhagenChromosomes dataset, and by clipping negative values to 0 and normalizing the sum to 1 for the Sorting dataset. We set thecrispnessparameter βto 7 for the CopenhagenChromosomes dataset, and to 200 for the Sorting dataset.

We evaluate the average classification error across 5 crossvalidation folds of three different classifiers, namely 5-nearest neighbor,RGLVQ, and aSVMwith a kernel obtained via double-centering (refer to Equation2.6). Note that we do not compare toGESLat this point because our parametrization for the Sorting dataset is not compatible withGESL.

We repeat the crossvalidation 10 times for CopenhagenChromosomes and 5 times for Sorting.

The results are displayed in Table3.4. For both datasets, metric learning improves the classification accuracy across classifiers. In particular, we improve theRGLVQaccuracy by

Table 3.4: The mean classification error of multiple classifiers across crossvalidation trials and repeats on the CopenhagenChromosomes and Sorting datasets. Columns correspond to classifiers, while rows correspond to different metrics on different datasets. The best results for each dataset are highlighted in bold print.

classifier RGLVQ SVM 5-NN

CopenhagenChromosomes

initial 11% 4% 3%

learned 5% 3% 3%

Sorting

global initial 26% 35% 23%

global learned 20% 37% 8%

affine initial 26% 26% 38%

affine learned 15% 22% 0%

BubbleSort InsertionSort

Figure 3.2: Two-dimensional t-SNE embeddings of the Sorting dataset without metric learn-ing (left) and with metric learnlearn-ing (right) for the affineedit distance.BubbleSortprograms are visualized as blue circles,InsertionSortprograms as orange triangles.

about 6% for the standard stringedit distanceand by about 11% for the affineedit distance.

For SVMand 5-NN we also observe improvements, except for SVMfor the standard sequence edit distanceon the Sorting dataset. For 5-NN, we observe particularly striking improvements on the Sorting dataset with 15% for the standardsequence edit distance and 38% for the affineedit distance. We can also inspect the change in representation visually. Figure3.2displays two-dimensional t-SNE embeddings (Van der Maaten and Hinton2008) of the Sorting dataset for the default affineedit distanceand the learned affine edit distance. As visible in the figure, classes get more compact and more distinct.

The resulting weights~_λafter metric learning on the Sorting dataset (normalized by their frequency in the data) are shown in Figure 3.3. As we can see, the weights for the numberofEdges feature, the codePosition feature, and the parent feature has been

0 2 4 6

·10⁻² type

scope parent codePosition name className returnType externalDependencies numberOfEdges

normalized weight

Figure 3.3:The weights~_λ after metric learning on the affineedit distance. The weights were normalized by their frequency in the dataset.

reduced to zero, whereas the scope and the type feature are strongly emphasized. This makes intuitive sense as the type feature is invariant against many stylistic choices and captures the local function of the current syntactic element, whereas the scope feature indicates the rough position in the tree of the current syntactic element. Less frequent elements such as name, className, and returnType, can fulfill auxiliary function and disambiguate special types of nodes, namely class declarations, method declarations, and variable declarations.

Im Dokument Metric Learning for Structured Data (Seite 57-62)