• Keine Ergebnisse gefunden

Multi-Ontology Classification

5.2 Classification

5.2.4 Multi-Ontology Classification

Here, we will investigate the effect of joining and separating the labelsets of a multi-ontology dataset. In the reviewed papers in the literature, there is no examination of such a case, each labelset was always predicted independently. In this experiment, each dataset has multiple labels of different labelsets assigned to each instance. As proposed in several approaches on the MLC literature, the use of the whole labelset is not always advisable, therefore some approaches break the unique labels into smaller ones (see Section 2.1.5 and 3.1.2). We will not examine such approaches, because they include an additional parameter (to control the splitting of the unique labels) and in cases where there is an evolving data stream of labels, they can miss the changes and theoretically hinder the learning procedure (which also conflicts with online learning).

The question is here also much more favorable to explore: we will use as split point the predefined labelset, i.e. we will compare the results of the simultaneous prediction of both labelsets to the previous separate setup.

Reuters Topics-Industries

Table 5.11 shows the results for the multi-label classification comparison between ML-ARAM, ML-HARAM and the selected classifiers.

30A side remark regarding the C++ implementation and the SVM BR stripped version: the C++

sparse ML-HARAM with clustering vigilance 0.975 took about 2.8 seconds to train, 230 second to create the clusters and about 50 seconds to classify per slice and voter. The SVM BR stripped took about 2400 seconds to train and 23 seconds to test per slice. As can be seen from the test time of the Table, the stripped version was used here.

Table 5.11: Results for labelset Topics-Industries

Classifier Accuracy mF1 IF1 LF1 tr(s) ts(s)

ML-ARAM 0.513±0.006 0.652±0.005 0.635±0.005 0.255±0.006 103.328±4.8 1490.025±62.2 ML-HARAM 0.486±0.006 0.628±0.005 0.607±0.006 0.247±0.007 101.062±5.0 667.751±23.7

ML-kNN 0.342±0.006 0.503±0.006 0.472±0.006 0.172±0.007 855.1±15.8 91.2±1.6 SVM BR 0.619±0.006 0.744±0.005 0.739±0.005 0.334±0.009 927.6±27.5 141.4±4.7 SVM LP 0.501±0.006 0.622±0.005 0.622±0.005 0.263±0.009 5522.2±314.3 4.6±0.2 TWCNB 0.198±0.003 0.295±0.004 0.312±0.004 0.192±0.004 17.277±0.356 10.575±0.533

The performance of SVM BR can be estimated from the Tables 5.7 and 5.8, and using a weighted sum over the label cardinality of the datasets: mF1e=(3.65∗0.829+3.61∗0.645)

(3.65+3.61) =0.738, a small difference of 0.004, which can be explained by the variances of the mean (used in the label cardinalities). For SVM LP, the estimated, if they were predicted sepa-rated and then remerged, would be mF1e = (3.65∗0.808+3.61∗0.600)

(3.65+3.61) = 0.705 against actual 0.622. The gap of about 0.08 points is significant, and cannot be ignored. ML-ARAM (and similarly ML-HARAM) can perform better, as the combined values would give an estimated mF1e=0.682, but it achieves only mF1=0.659, still the gap is not that much, compared to SVM LP. This estimated value and the actual mF1 is nearly the same for ML-kNN, which makes sense, since the labels are independently evaluated.

TWCNB performance is the same as the industries, mostly because the threshold is just too high. This will be discussed later in the improvement section. This estimation method must be modified for LF1: instead of weighting with the cardinality, it must be weighted with the number of labels in the dataset. The estimation for SVM BR produces LF1e=(103∗0.445+364∗0.301)

(103+364) =0.332 against actual 0.334, which is very close. For LP, the estimation produces LF1e=(103∗0.363+364∗0.302)

(103+364) = 0.315 versus actual 0.263, a no-ticeable difference of around 0.05. ML-(H)ARAM perform even worse, with differences around 0.10 to the estimated values.

The difficulty of the LP based classifiers lies, for a great part in the number of unique labels in the dataset: 12,790. While Topics had 2236 unique labels and Industries 3645, we can see that the combination of both create many unique labels. This consequently di-minished the number of training samples per unique label, greatly increasing the difficulty of the task for pure LP methods like SVM LP. One of the reasons why ML-(H)ARAM performed better was that it does not use an WTA rule on unique labels as does SVM LP. In the next experiment, we can examine further how unique labels influence the prediction quality of simultaneously predicted labelsets.

EUR-Lex

In Table 5.12, the results for this experiment are depicted. Most of the results are very similar with the results of EUR-Lex EUROVOC, since it has higher cardinality, more number of labels and unique labels than the DC labelset, then dominating the results.

Table 5.12: Results for dataset DC-EUROVOC: ML-ARAM and ML-HARAM (H) trained with vigilance=0.999, threshold=0.0001, voters=5, and NAC=3, clustering vigilance in brackets

Classifier Accuracy mF1 IF1 LF1 tr(s) ts(s)

ML-ARAM 0.438±0.010 0.522±0.018 0.542±0.013 0.304±0.008 23.4±1.7 4.2E3±286.4 H [0.9] 0.489±0.007 0.642±0.007 0.610±0.007 0.339±0.007 71.3±1.1 1.3E3±34.3 H [0.99] 0.517±0.005 0.657±0.005 0.643±0.004 0.349±0.004 17.7±1.4 1.3E4±1.1E3 ML-kNN 0.412±0.008 0.577±0.008 0.541±0.009 0.239±0.005 2.0E3±16.8 225.0±2.0 SVM BR 0.536±0.005 0.683±0.005 0.671±0.005 0.392±0.008 1.5E4±208.6 1.9E3±60.3 SVM LP 0.462±0.006 0.591±0.006 0.586±0.005 0.308±0.006 5.4E3±619.8 31.5±0.4 TWCNB 0.224±0.007 0.253±0.009 0.347±0.009 0.028±0.002 378.0±23.1 68.9±1.8

We can further see here that, the weighted sum taking the label cardinalities gives the same relation between estimated and actual for SVM BR mF1e=(3.7474∗0.835+15.4963∗0.627)

(3.7474+15.4963) = 0.668 and actual mF1=0.683 (i.e. 0.015 is actual higher). It is also true for SVM LP mF1e=(3.7474∗0.828+15.4963∗0.550)

(3.7474+15.4963) =0.604 (.591). The estimation also holds for SVM BR and LF1: LF1e=(440*0.594+4207*0.363)/(440+4207)=0.385 which is also close to the actual 0.392 achieved. The estimated LF1e=0.304 for SVM LP is, this time, close to the actual 0.308. The number of unique labels increased slightly from 16,400 for EUROVOC to 16,965 for DC-EUROVOC. Apparently, the DC structure was partly embedded into the EUROVOC. This is our main motivation for focusing on the increase of performance of the smaller support classes. With the more training samples, the classes of DC can be predicted more easily, and so the classes from EUROVOC “connected” to these of DC can be better predicted. We will investigate that in Section 5.4.

5.2.5 Discussion

In this section, we performed various classification experiments with two objectives: to show that ML-HARAM is an important development for ML-ARAM and to point out the promising potential of MLC improvement, especially in large labelsets.

The comparison of the MLC prediction of two ontologies separately and simultaneously predicted is new in the literature, to the best of the author’s knowledge, and similar experiments had only been performed in one single labelset so far. We showed evidence for embedded substructures of the smaller labelset in the large labelset. The prediction quality was higher in the smaller labelset, inspiring the assumption that an explicit connection between them can improve the prediction.

Further, we showed evidence for the advantages of ML-HARAM over ML-ARAM.

Based on the findings in these experiments, we can state that ML-(H)ARAM can out-perform BR SVM in some cases (Yeast, EUR-Lex DC, Reuters-Topics and -Industries have comparable results). When there were comparable results, SVM BR was most of the times better for mF1 measure and IF1 measure, whereas ML-(H)ARAM performed

bet-ter in LF1. Although in EUR-Lex EUROVOC, the SVM LP was much worse than BR, the ML-HARAM’s performance measure’s results were between them and ML-ARAM was a little worse than SVM LP. As was shown in the several datasets, there is strong indication that ML-(H)ARAM has state-of-the-art prediction quality.

Scalability

ML-HARAM can easily outperform ML-ARAM in terms of speed with comparable pre-diction quality. In difficult tasks with large labelsets, it can even achieve a better predic-tion quality, as in EUR-Lex EUROVOC. Further, with an increasing amount of labelsets and examples, it can classify faster. Using an LP approach without match tracking for training and a clustering on the input space for testing allows ML-HARAM to compare the input to only a fraction of the possible objects. This is an important consideration in MLC since the possible multi-label combinations can become a problem within large datasets with large labelsets.

We will now investigate how to select some of these multi-label combinations to fulfill different tasks.