Movies Dataset: automatic Threshold

5.3 Knowledge Extraction

5.3.2 Movies Dataset: automatic Threshold

This experiment focuses on the impact of using different IMs on discovering the hand-coded associations of the Movies dataset. We first investigated how the measures rank the true rules among an increasing amount of found rules (Figure 5.5). The whole rule set consisted of 4895 rules. At the Y-axis, the graph shows the number of true rules among the top X rules as found by the measures Jac and Cnf as well as by their respective Dif counterparts and Int. Obviously, the steeper the increase of a curve, the better the IM. One can see that Cnf was the worst among the presented measures. Jac, Cos, and ACnf were among the best measures in this task, the latter two having curves very similar to Jac (not shown in the figure). Note that all three measures are considered to be well-suited for rare ARs. Indeed, the manually created associations of this dataset were rather rare. In only eight of them, the Support was higher than 5%. This was the reason why they were difficult to find. Thus, all hand-coded relations could be found by

10⁰ 10¹ 10² 10³ 10⁴ 0

10 20 30 40 50

Number of true rules found

Number of rules extracted

Figure 5.5: Number of true rules found in the top X rules extracted by Cnf and Jac, their respectiveDifs and Int.

Jac only after reaching about 3000 rules in total. The growth of other curves was even slower.

The comparison of curves betweenCnf andCnfDif is especially interesting, as well as between Jac and JacDif. Both Dif measures significantly outperformed their conven-tional counterparts at first, but lost their advantage towards the end. It can be explained by the hierarchically redundant nature of a large part of the true rules. Both curves had a steep increase approaching the end because redundant true rules were discovered by both measures very late, i.e. the rules which are expected from the hierarchy. This is due to the fact that such rules typically have a small difference between their actual values and the expectations and are therefore ranked very low by theDif measures. Int behaved similarly: it had a good start but it could not find general rules and therefore it performed worse than Cnf towards the end.

We are also interested in the impact of using the ATS method with different IMs on discovering the hand-coded associations of the Movies dataset. Table 5.13 shows the number of rules extracted by each measure, the number of true rules among them found after PAR and after using the ATS method as well as the corresponding performance measures. We removed iteratively one rule from the PAR set of rules for each measure and registered corresponding F-1 values until there was no rule left. After that, we extracted the best score among the F-1 values and used the corresponding values in the Table. The performance scores obtained in this way are shown in the column “best possible” (rule set). The aim was to compare the ATS method with the special case of constructing the optimal rule set.

The ATS method could greatly improve the F-1 values of every metric compared to the PAR method. TPS produced results not too far from the best possible F-1 value, with some exceptions. The recall was relatively high for all measures, despite the fact that some rules were already pruned by the PAR and some had very low support, i.e.

Table 5.13: Movies: the number of found rules (f), the number of true rules among them (t) and the compression rate (comp) after each pruning step. Performances are in %. Three best values are shown in bold.

Measure PAR TPS Best possible

f t F-1 f comp t F-1 f comp t F-1 Sup 862 16 0.04 57 15.12 6 11.43 34 24.63 6 14.63 Cnf 2954 40 0.03 356 8.27 32 15.84 246 11.96 29 19.73 Jac 1980 36 0.04 135 14.67 3234.97 45 43.04 2247.31 Cos 1292 35 0.05 129 10.02 28 31.64 53 23.93 22 43.56 ACnf 2482 37 0.03 142 17.48 30 31.58 57 42.79 2649.52 Kulc 1125 21 0.04 226 4.98 18 13.14 17 62.50 10 30.77 Lift 2954 40 0.03 183 16.14 27 23.38 35 82.06 17 40.96 Cnv 2954 40 0.03 80 36.92 18 28.13 89 32.82 22 32.12 Crf 2954 40 0.03 171 17.17 30 27.40 89 32.82 22 32.12 Seb 2954 40 0.03 96 30.77 11 15.28 246 11.96 29 19.73 PS 2684 26 0.02 89 30.16 21 30.66 89 29.82 21 30.66 BF 2954 40 0.03 121 24.41 3136.69 43 67.14 21 46.15 Loe 2954 40 0.03 171 17.17 30 27.40 89 32.82 22 32.12 CCnf 2954 40 0.03 219 13.49 33 24.72 55 52.75 21 40.78 JacDif 2822 33 0.02 67 42.12 2645.22 38 72.36 2353.49 CnfDif 2985 34 0.02 105 28.43 22 28.76 83 35.54 20 30.53 SupDif 2402 22 0.02 36 66.72 8 19.05 12 184.77 7 23.33 mean 2486 34 0.03 140 23.18 24 26.17 78 51.44 20 34.56

could be seen as rare rules.

The improvement of the F-1 values is explained by the fact that the increase in the number of pruned rules enhanced the precision much more than the following degradation of recall. This degradation was in turn caused by the loss of the true rules through this severe pruning. The high F-1 values also indicate that the true rules were, by a great part, ranked very high by most of the IMs. The mean F-1 value was 26.2 for TPS. For best possible, it was 34.4. JacDif was actually the IM with the best possible F-1 value and the ATS method could find this out.

The reduction of the rule set length after PAR to the one set after the ATS method can best be described by the compression rate: the overall compression rate was 23.2 for TPS and 51.4 for best possible. SupDif had the best mean value for TPS and the best possible rule set. The worst measure for compression was Kulc for TPS, and, in contrast, for the best possible rule set the worst wasCnf. This means thatSupDif could discriminate very well between the rules andCnf distributed the rule values evenly over the whole value range, i.e. there was no relatively high concentration of rules for a specific slice of the value range. It is, however, not that clear for Kulc and Cos, since compared to the other measures, both had the most rules already filtered out by PAR. Therefore, the most part of the uninteresting rules were already ruled out and the reference rule set length was much smaller. After PAR, only the rules on the top of the hierarchy remained for Sup.That is why there were only very few after PAR. SupDif allows rules from deeper levels of the hierarchy and so it had more true rules and better values than

0 500 1000 1500 2000 2500 3000

0 200 400 600 800 1000 1200

0.2

0 500 1000 1500 2000 2500 3000 0

0 500 1000 1500 2000 2500 3000

−0.2

Figure 5.6: Cut-off points for Cnf,Jac and CCnf in Movies dataset.

Sup.

Figure 5.6 shows examples of the cut-off points obtained by TPS for Cnf, Jac, Cnv , and CCnf. Further, it shows the tangent on pl₀ (tan) (see Section 4.1.2) and the true rules (True). We selected these IMs because the other IMs’ graphs had shapes similar to one of them, i.e. they represent them well. As can be seen from 5.6a, the Cnf curve did not decrease as fast asJac and therefore the method set a relatively large amount of rules between the point where to start to prune the rules. The ATS method managed to find a reasonable threshold onCCnf, although its curve had not a strictly concave form.

Most of the true rules were distributed around thepl₀ and up to the highest value along the curve as shown in the graphs of 5.6.

Most of the measures for the ATS method had values near their best possible F-1 value, but there were exceptions, as mentioned before. Kulc had a very different number of rules after the ATS method and for best possible. There are multiple reasons: not many true rules were ranked high, the curve was very steep and concave, also because there were not many rules in the extracted rule set after PAR and also among these not many true rules, as can be seen on Figure 5.6c. Remarkably,JacDif had the best result for TPS, although the best possible method points out that even more rules could be dropped (29 rules) to have a greater density of true rules.

Despite the fact that the measures, as seen in the graphs of Figure 5.6, show very different curves, several measures obeyed the confidence condition in the PAR method, having the same rule set after PAR as Cnf. Therefore, it is interesting to analyze the

difference between these measures, namelyCnf,Lift,Cnv,Crf,Seb,BF,Loe, and CCnf , since they not only had the same start rule set, but also the results of some of these measures were very similar, indicating the same ranking of rules, like Crf and Loe. Actually, both measures had the same rules extracted by the ATS method. Although they are not obviously connected, for Cnf > p_b this is actually so. This is especially the case for the high values extracted by TPS.

We can state that JacDif had the best F-1 value for the TPS method. The ATS method retained a number of rules far below 100 and achieved high F-1 performance.

TPS delivered good results, having the important advantage of abdicating any expert intervention in form of parameters. Although it had good values, it is difficult to find the right cut without an objective target, as will be seen in the EUR-Lex experiment later on. The next experiment will be similar to this one but in a much larger dataset.

Im Dokument Multi-label Classification with Multiple Class Ontologies (Seite 147-151)