Experiments - Metric Learning for Structured Data

and Manning2014). We can include the cosine distance inBEDLeasily by re-defining thecost functionc_φ,Ω as follows.

c_φ,_Ω(x,y):= ¹

2·1−s_Ω(φ(x),φ(y)) where (4.8) s_Ω(φ(x),φ(y)) = (_Ω·φ(x))^>·_Ω·φ(y)

k_Ω·φ(x)k · k_Ω·φ(y)k For the gradient, we obtain:

∇_Ωc_φ,Ω(x,y) =−¹

2∇_Ωs_Ω(φ(x),φ(y))

=−¹

2 ·_Ω·^φ(x)·φ(y)^>+φ(y)·φ(x)^>

k_Ω·φ(x)k · k_Ω·φ(y)k +¹

2 ·_Ω·s_Ω(φ(x),φ(y))·^h^φ(x)·φ(x)^>

k_Ω·φ(x)k² +^φ(y)·φ(y)^>

k_Ω·φ(y)k² i

We will utilize both the basic Euclidean version ofBEDL and the cosine distance variation in our next section, in which we evaluateBEDLexperimentally.

neighbors for KNN in the range [_{1, 15}], the kernel bandwidth for SVM in the range [0.1, 10], the sparsity parameterνfor the goodness classifier in the range [10⁻⁵, 10], and the regularization strengthλforGESLand BEDLin the range 2·K·M·[10⁻⁶, 10⁻²]. We chose the number of prototypes forBEDL, as well as the number of neighbors forGESL as the optimal number of prototypesKforMGLVQ.

As implementations, we used custom implementations ofKNN,MGLVQ, the good-ness classifier,GESL, andBEDL, which are availabe athttps://doi.org/10.4119/unibi/

2919994. ForSVM, we utilized the LIBSVM standard implementation (Chang and Lin 2011). All experiments were performed on a consumer-grade laptop with an Intel Core i7-7700 HQ CPU.

Artificial Datasets

We evaluate the defaulttree edit distance,GESL, andBEDLon the Strings and on the Gap data set from Section3.2. The results are shown in Table4.1. In both datasets,BEDL could reduce the error consistently to 0%. Closer inspection revealed that BEDL did indeed consistently identify the desired representation, namely embedding the symbols a andbas well ascanddat the same point respectively (also refer to Figure4.3, left).

By contrast,GESLonly achieved low errors for the goodness classifier, while remaining at high errors for all other classifiers. Using a one sided Wilcoxon signed-rank test we found thatBEDLsignificantly outperformedGESLand the initialedit distancefor the KNNandMGLVQclassifiers on both datasets (p<0.001 after Bonferroni correction).

Note that the GESL results differ from the results reported in Section3.2. This is likely due to different crossvalidation folds, the fact that we used anMGLVQinstead of a RGLVQclassifier, and the fact that we optimized classifier hyper-parameters, which may have lead to different choices compared to the previous experiments.

Regarding runtime, we note thatGESLis clearly faster due to its convex programming structure with a runtime advantage of about factor 20-30.

Real-World Data

Beyond the artificial data, we evaluated our methods on six real-world datasets.

CopenhagenChromosomes: A balanced two-class dataset of 400 chromosome density strings, as described in Section 3.2.

MiniPalindrome: A balanced eight-class dataset of 48 Java programs, where each class represents one strategy to detect whether an input string contains only palindromes (Paaßen2016b). The programs are represented by their abstract syntaxtree, where the label corresponds to one of 24 programming concepts (e.g. class declaration, function declaration, method call, etc.).

Sorting: A two-class dataset of 64 Java sorting programs as described in Section3.2.

Cystic: A dataset of 160 glycan molecules where the class label 1 is assigned to every molecule associated with cystic fibrosis and 0 is assigned to other molecules. The molecules were extracted from the KEGG/Glycan data base (Hashimoto et al.2006)

Table 4.1:The mean test classification error and runtimes for metric learning on the artificial datasets, averaged over the cross validation trials, as well as the standard deviation. The x-axis shows the metric learning schemes, the y-axis the different classifiers used for evaluation. The table is sub-divided for each dataset. The lowest classification error for each dataset is highlighted via bold print.

classifier initial GESL BEDL

Strings

KNN 21.0±10.2% 23.0±10.8% 0.0±0.0%

MGLVQ 36.0±_15.7% _34.0±_11.0% _0.0±_0.0%

SVM 9.0±11.2% 10.0±8.6% 0.0±0.0%

goodness 11.5±9.3% 0.5±2.2% 0.0±0.0%

runtime [s] 0±₀ _0.030±_0.002 _1.077±_0.098 Gap

KNN 30.0±10.8% 22.5±16.8% 0.0±0.0%

MGLVQ 49.5±17.0% 48.5±16.6% 0.0±0.0%

SVM 0.0±_0.0% _5.0±_13.6% _0.0±_0.0%

goodness 0.5±2.2% 0.5±2.2% 0.0±0.0%

runtime [s] 0±0 0.037±0.004 0.865±0.139

according to the scheme described by Gallicchio and Micheli (2013). Each molecule is represented as a tree, where the label corresponds to mono-saccharide identifiers (one out of 29) and the roots are chosen according to biological meaning (Hashimoto et al.2006).

Leukemia: A dataset of 442 glycan molecules from the same source as the Cystic dataset.

For this dataset, a class label 1 represents that the molecule is associated with Leukemia.

Sentiment: A large-scale two-class dataset of 9613 sentences from movie reviews, where one class (4650trees) corresponds to negative and the other class (4963trees) to positive reviews. The sentences are represented by their syntaxtrees, where inner nodes are unlabeled and leaves are labeled with one of over 30, 000 words (Socher, Pennington, et al.2011). Note thatGESLis not practically applicable for this dataset, as the number of parameters to learn scales quadratically with the number of words, i.e. > 30, 000². To make BEDL applicable in this case, we do not learn a full embedding, but instead we initialize the embedding matrix with the 300-dimensional Common Crawl GloVe embedding (Pennington, Socher, and Manning 2014), which we reduce via PCA, retaining 95% of the data variance (m=16.4±2.3 dimensions on average±standard deviation). We adapt this initial embedding via a linear transformation, using the cosine distance (refer to Equation4.8) instead of the Euclidean distance, as introduced in the previous section.

The results of our experiments are displayed in Table4.2. In all datasets and for all classifiers,BEDLyields lower classification error compared toGESL. Furthermore, in four of six datasets,BEDLyields the best overall classification results (the exceptions being CopenhagenChromosomes and Cystic). In five out of six cases,BEDLcould improve the

Table 4.2:The mean test classification error and runtimes for metric learning on the real-world datasets, averaged over the cross validation trials, as well as the standard deviation. The x-axis shows the metric learning schemes, the y-axis the different classifiers used for evaluation. The table is sub-divided for each dataset. The lowest classification error for each dataset is highlighted via bold print.

classifier initial GESL BEDL

CopenhagenChromosomes

KNN 4.5±4.6% 14.8±7.7% 6.2±7.6%

MGLVQ 13.2±7.8% 26.8±9.4% 11.2±8.4%

SVM 2.7±3.4% 21.2±10.6% 5.3±7.2%

goodness 3.0±4.1% 7.0±6.2% 6.0±7.7%

runtime [s] 0±0 4.833±1.200 10.267±1.954

MiniPalindrome

KNN 12.5±11.2% 12.5±7.9% 10.4±9.4%

MGLVQ 2.1±5.1% 4.2±6.5% 0.0±0.0%

SVM 4.2±6.5% 20.8±15.1% 0.0±0.0%

goodness 6.2±6.8% 14.6±5.1% 8.3±10.2%

runtime [s] 0±0 0.103±0.014 2.785±0.631

Sorting

KNN 15.6±8.8% 18.8±16.4% 10.9±8.0%

MGLVQ 14.1±10.4% 14.1±8.0% 14.1±8.0%

SVM 10.9±8.0% 9.4±8.8% 9.4±8.8%

goodness 15.6±11.1% 17.2±14.8% 17.2±9.3%

runtime [s] 0±0 0.352±0.102 3.358±0.748

Cystic

KNN 31.2±6.6% 32.5±10.1% 28.1±8.5%

MGLVQ 34.4±_6.8% _33.1±_9.8% _30.0±_10.1%

SVM 28.1±9.0% 33.1±8.9% 29.4±12.5%

goodness 28.1±8.5% 26.2±14.4% 24.4±13.3%

runtime [s] 0±₀ _0.353±_0.292 _0.864±_0.767

Leukemia

KNN 7.5±2.6% 8.2±4.6% 7.3±4.3%

MGLVQ 9.5±4.0% 10.9±4.7% 9.5±3.0%

SVM 7.0±_4.1% _8.8±_2.9% _6.8±_4.7%

goodness 6.1±4.3% 10.0±4.4% 6.3±3.8%

runtime [s] 0±0 2.208±0.919 6.550±2.706

Sentiment

KNN 40.2±2.8% − 38.2±3.3%

MGLVQ 44.0±2.6% − 41.3±5.7%

SVM 34.3±3.0% − 33.3±3.6%

goodness 43.7±1.9% − 42.5±3.1%

runtime [s] 0±0 − 69.385±58.064

0 2 4

−0.1 0 0.1

a,b,− c,d

Strings

0 5

10 −5 0 5

−5 0

−

block

modifiers while

parameterized type MiniPalindrome

Figure 4.3: A PCA of the learned embeddings for the Strings (left) and the MiniPalindrome dataset (right), covering 100% and 83.54% of the variance respectively.

accuracy forKNN(except for CopenhagenChromosomes), in four out of six cases for SVM(the exception being CopenhagenChromosomes and Cystic), in four out of six cases forMGLVQ(in Sorting and Leukemia it stayed equal), and in two out of six cases for the goodness classifier. For the Sentiment datasets we can also verify this result statistically with p<0.05 for all classifiers.

Note that the focus of our work is to improve classification accuracy via metric learning, not to develop state-of-the-art classifiers as such. However, we note that our results for the Sorting dataset outperform the best reported results by Paaßen, Mokbel, and Hammer (2016) of 15%. For the Cystic dataset we improve the AUC from 76.93± 0.97% mean and standard deviation across crossvalidation trials to 79.2±13.6%, and for the Leukemia dataset from 93.8±3.3% to 94.6±4.5%. Both values are competitive with the results obtained via recursive neural networks and a multitude of graph kernels by Gallicchio and Micheli (2013). For the Sentiment dataset, we obtain aSVMclassification error of 27.51% on the validation set, which is noticeably worse than the reported literature results of around 12.5% (Socher, Pennington, et al.2011). However, we note that we used considerably less data to train our classifier due to the cost of eigenvalue correction (only 500 points for the validation).

While most embeddings ofBEDLwhere too intrinsically high-dimensional to inspect visually, the embedding for the MiniPalindrome dataset revealed that most symbols could be embedded close to zero while a few discriminative syntactic concepts remained distinct from zero, thus giving an indication of the relevant syntactic concepts for the given task (refer to Figure4.3).

Interestingly,GESLtended to decrease classification accuracy compared to the initial tree edit distance. Likely, GESL requires more neighbors K for better results (Bellet, Habrard, and Sebban 2012). However, scaling up to a high number of neighbors lead to prohibitively high runtimes for our experiments such that we do not report these results here. These high runtimes can be explained by the fact that the number of slack variables inGESLincreases withO(M·K)where M is the number of data points and Kis the number of neighbors. The scaling behavior is also visible in our experimental results. For datasets with few data points and neighbors, such as Strings, MiniPalin-drome, and Sorting,GESL is 10 to 30 times faster compared to BEDL. However, for CopenhagenChromosomes, Cystic, and Leukemia, the runtime advantage shrinks to a factor of 2 to 3.

Ablation Studies

In ablation studies, we studied the difference betweenGESLandBEDLin more detail. In particular, we tested the following different design choices

1. ClassicGESL(G1),

2. GESLusingcooptimalfrequency matrices instead of a singletree mappingmatrix (G2),

3. GESL using cooptimal frequency matrices and the prototypes fromMGLVQ as neighbors N⁺andN⁻(G3),

4. LVQ tree edit distance learning, directly learning the cost function parameters instead of an embedding, with a pseudo-metric normalization after each gradient step (L1), and

5. BEDLas proposed (L2).

Note that, for the ablation studies, we re-used the hyper-parameters which were optimal for the reference versions of the methods (G1 and L2).

Figure4.4shows the average classification error and standard deviation (as error bars) for all tree-structured datasets and the string dataset, both for the pseudo-edit distance as in Equation4.2, and for the actualtree edit distanceusing the learnedcost function.

We observe that usingcooptimalfrequency matrices (G2) andMGLVQ prototypes instead of ad-hoc nearest neighbors (G3) improvedGESLon the MiniPalindrome dataset, worsened it for the strings dataset, and otherwise showed no remarkable difference for the Sorting, Cystic, and Leukemia dataset.

Regarding the LVQ tree edit distance learning variants L1 and L2, we note that BEDLimproved the error for the actualtree edit distancebut worsened the result for the pseudo-edit distance.

In general,GESL variants performed better for the pseudo-edit distance than for the actual tree edit distance, and LVQ variants performed better for the actualtree edit distancecompared to the pseudo-edit distance.

Im Dokument Metric Learning for Structured Data (Seite 74-79)