EUR-Lex - Knowledge Extraction - Multi-label Classification with Multiple Class Ontologies

5.3 Knowledge Extraction

5.3.4 EUR-Lex

In this experiment, there was no ground truth rule set to be discovered but we investi-gated how often evident rules appeared in the top 200 rules for each measure, with the objective to see how the measures and the ATS method behave on this dataset. Those evident rules were obvious rules like “Motor vehicles→motor vehicle” or “Silkworms→

sericulture” and should receive high scores by all IMs. Although 200 rules are a very small fraction of the whole rule set, the pruning methods should be able to retain them.

We must emphasize that a bad score for F-1 for the top 200 is not always a sign that the measure is unsuitable for the task, but much more that it could discover unexpected rules. Since our focus lies in the pruning ATS method on exploratory dataset, we leave a more thorough examination of the rule set for each metric as future work.

Another comparison was performed between true and predicted labels with the aim of selecting a measure for the ATS method, which behaved similarly with both. The reason is that it is usually expected the distribution of labels in the predicted labelset to be similar to that of the true labelset and the combination of IM with pruning should not distort it. Thus, each deviation can point to an important aspect of the classification process, which later can extend the knowledge about the domain or improve the classifier Accuracy. The true labels were used to extract the reference rules and the predictions to extract the rules to be compared to. The predictions were obtained by the classifier trained in 10-fold CRV fashion.

The number of rules extracted without any pruning was 103,185 for true labels and 84,199 for predicted labels. There were more rules for true than for predicted labels because the classifier did not predict the labels down in the hierarchy unless they were sufficiently high ranked. Another reason is that some unique multi-labels were preferred over others, i.e. were predicted more often than they appeared in the training set, changing the co-occurrence of the labels. Thus, some co-occurrences of labels in the true set did not happen in the predicted one. The results of the rule extraction and pruning are depicted in Table 5.15. The starting rule sets are the results of the pruning with PAR, i.e. a rule is pruned if it has a value not higher than any of the parent rule.

For assessing the quality of the intersection between true labels and predicted ones, we used an analog of the F-1 measure, denoted as F⁰. It was calculated as the harmonic mean of precision and recall, whereas the length of the intersection set divided by the number of true labels corresponded to recall, and the same length divided by the number of predicted labels corresponded to precision. The last columns of Table 5.15 show the

Table 5.15: EUR-Lex DC-EUROVOC results: extracted rules with PAR; recall (R), pre-cision (P) and F⁰ is over the intersection of true labels (tl) and predicted ones (pl). Pt: number of evident rules divided by 2 in the top 200 rules for true labels, Pp the same as Pt for predicted labels. fint is the number of rules in the intersection between top 200 for true and predicted and eint is the number of evident rules among the rules of fint.

Measure PAR TPS top 200

tl pl F⁰ tl pl int F⁰ Pt Pp fint eint Sup 18843 14973 71.58 495 415 345 75.82 49.5 43 168 78 Cnf 79557 62774 58.38 10383 8887 5717 59.34 23 13.5 69 6

Jac 63357 50550 57.64 3377 3137 1952 59.93 84.5 69 99 84 Cos 56770 45162 57.28 5448 4323 2894 59.24 83 65 95 80 ACnf 63568 50033 58.06 3882 3122 2036 58.14 81.5 63.5 106 83 Kulc 50431 40617 55.26 9614 9445 4951 51.95 80 59.5 92 75 Lift 79557 62774 58.38 1076 1093 347 32.00 41.5 39.5 43 24 Cnv 79557 62774 58.38 990 644 349 42.72 80.5 58 59 49 Crf 79557 62774 58.38 6627 6383 3792 58.29 23 13.5 58 6 Seb 79557 62774 58.38 939 767 414 48.53 77 55.5 71 54

PS 62041 53960 56.01 840 949 620 69.31 63.5 56.5 150 94 BF 79557 62774 58.38 1287 1115 395 32.89 57 49 44 28 Loe 79557 62774 58.38 7620 6986 4258 58.30 23 13.5 57 6 CCnf 79557 62774 58.38 7413 7352 4254 57.62 39 29 27 8 JacDif 66461 52408 56.23 1253 1142 673 56.20 80.5 64.5 95 75 CnfDif 74536 57442 56.66 4390 3834 2180 53.02 34 21 16 7 SupDif 44054 37394 54.98 191 180 141 76.01 57.5 48.5 153 86

Logarithmic factor for each iteration step of TPS

10¹ 10²

Figure 5.7: Relation factors between each iteration step, each gray level stand for an iteration step of TPS for true labels.

results of searching the top 200 rules of each rule set for evident rules.

The relative difference between the sizes of true label rule sets for PAR and TPS as well as for predicted label rule set was almost the same: this means that TPS cut proportionally in the case of true labels and predicted ones. After PAR, there were more rules for true labels than for predicted ones for all measures. The same also applied after TPS, except for Lift andPS. Since the ATS method’s thresholds resulted in rule sets of similar sizes for true and predicted labels and since their sets were similar, theF⁰ values were relatively high for all measures.

Another interesting observation was also that the Cnf (and measures ranking like Cnf), Crf and Loe had much more rules than all other measures. It is mainly because there were 466 rules with the maximum value and several with values near to the maxi-mum but with shallow slope, which shifted the peak of the difference values; therefore, the threshold was reached later than for the other measures. For TPS,Cnf,Crf andLoe had very different results. This is related to the fact that TPS did not treat these mea-sures equally, since the lines connecting the maximum and the minimum of each curve did not have the same slope. In addition, for TPS, the global aspects of the curve affect the decision of the threshold. Crf and Loe had negative values, so that their tangents’

slopes were steeper than that of Cnf, despite a strong similarity at the beginning of the curves.

All of the measures had more than 200 entries after the first step of pruning by the ATS method, except for CCnf and SupDif and SupDif for TPS. The TPS method does not need any adjustment of parameters, making it very appealing for a fully automated setup. Furthermore, it can be applied iteratively until a predefined number of rules is trespassed. Thus, in each iteration step, zones with poor discrimination can be pruned.

Further, the TPS method does not depend on any parameter since it is auto-scale, i.e.

Figure 5.8: Sorted values of AR forCnf andJac. Iteration steps cut-offs points are cross marked and the dashed lines are their respective tangent for true labels.

each time that rules are pruned, it will adapt the discrimination range so that the cut-off threshold will depend on the range at hand. Therefore, we applied this method until less than 20 rules were extracted or the method could not remove any more rule because all the remaining rules had the same value. Figure 5.7 shows the compression rates of each iteration step for each measure. The whiter the bar, the more steps it needed to reach the final step, with less than 20 rules. The mean number of steps was 4.7, most measures required three or five steps, as can be seen on the dark bars of the Figure 5.7. SupDif had the highest compression factor on the first step, since the number of rules in the rule set decreased from 44,054 to 191 and then to 18. This indicates that the most significant rules were selected in only two steps. For BF, TPS needed only 3 steps in order to prune all but 16 rules. In contrast, CCnf and CnfDif required 11 and 10 steps to achieve 12 and 15 rules, respectively. This was mainly because the curves were very flat on the top 100 rules (i.e. had a poor discrimination) and had not a smooth course, as can be seen in Figure 5.8, that shows how TPS performed iteratively on Cnf, Jac, BF and CCnf . The curves are plotted on the logarithmic scale over the number of rules so that the tangents are better seen and do not hide each other. The graphs also show that TPS placed the cut-off points almost equally spaced in the logarithmic scale, which highlights to the advantage of being scale-invariant. Another interesting issue finding regards how the thresholds were selected for Cnf: they were first ¹₈ then ¹₃,¹₂,²₃, and finally 1, which means that they were well distributed over the range of [0,1].

We used the 200 evident rules for a twofold purpose: first, to identify which metrics had

the most evident rules. This evident rules would have a high correlation and should be top ranked by the measures that are not designed for surprisingness. Second, we wanted to see how many rules of the intersection of true and predicted labels were actually shared for a fix number of rules for all measures, and how much evident rules were among them.

Obviously,Sup had the best results, confirming the good classification results: for micro, F-1 was 68.59 for DC and 51.50 for EUROVOC³¹. PS had also very good results, even better than Sup for the number of evident rules in true and predicted labels. Cnf did not show good results and even the intersection had only six evident rules.

Due to its iterative application, TPS constitutes a good automatic solution for rule pruning and discovering zones of high discrimination. The rules found bySup andSupDif with TPS had a high F⁰ value indicating triviality of the rule set. This was expected, since these measures point to rules with high support.

In the next part, we will focus again on the hierarchical IMs, and rare association in order to find interesting rules. For this purpose, we will use a setup with a very large ontology.

Im Dokument Multi-label Classification with Multiple Class Ontologies (Seite 154-158)