• Keine Ergebnisse gefunden

Label Constraints with Association Rules

3.2 Knowledge Extraction with Association Rules

3.3.2 Label Constraints with Association Rules

One approach that meets our requirements involves the use of AR mining to gather the correlations of the labels and subsequent post-processing of the predictions of the classifiers to comply with the rules over the label.

In [PF08] and [CHDH13] (which is based on the former), ARs were used as a post-processing step. The authors sought to extract rules from the training set of the type itemset → item (i.e. constraints in the form of an implication from a multi-item set towards an item). These constraints can be positive or negative, setting or removing the given label.

In the former approach, instead of BRs pairwise classifiers (SVMs are used as base classifiers in both approaches) are used in a CLR setup. The constraint rules serve to create a distance metric assessing the difference between the predicted rankings and a ranking that would hold for all constraints considered. To minimize this distance, the authors apply two strategies to change the ranking order: preference swapping and neighborhood swapping. The first strategy swaps the CLR preferences in order to change the ranking, whereas the second changes the ranking directly. The first method will not change the labels directly in order to satisfy the constraint and output a valid prediction, but it must change several pairwise classification outputs. Because the method uses label calibration (i.e. there is a pseudo-label that defines the threshold from which the labels are set), this classifier must also be “bypassed”, raising doubts as to the suitability of the threshold-setting classifier for this task. The second method directly alters the ranking,

a simpler and probably more effective strategy. The constraints are also extracted using the standard AR mining framework, with the restriction that the rules must be many-to-one. The results reported in [PF08] were mixed: the methods achieved better results than the baseline on synthetically generated data, but not on real-world datasets. The author’s explanation is that the rules extracted were of no use; moreover, the synthetic dataset was built to work with the algorithms, and pairwise preferences might not work with the ranking improvement. In our opinion, the sizes of the real-world datasets were too small for this approach to be successful, as they lacked the kinds of correlations present in the synthetic datasets. Furthermore, the extracted ARs might be of poor quality, but this might be due to the AR framework. Changing this framework might improve the results, as will be shown in the next chapter.

In the latter approach [CHDH13], the objective is to create simple implication rules to facilitate a binary ensemble strategy. First, binary SVMs are trained, and then the rules are extracted from a multi-label subset of the training data; these subsets are discovered by clustering of the multi-label training set (using the Affinity Propagation (AP) cluster-ing [FD07]). In the testcluster-ing step, the outputs of the binary SVMs are combined uscluster-ing the ensemble strategy. A final threshold strategy converts the consensus probability output from the ensemble strategy into a multi-label prediction. The key aspect of this approach is the division of the subsets of labels through the AP clustering method, which does not define the number of clusters in advance. The partitioning of the training set avoids the explosion of ARs between the labels which could be potentially discovered. After separating the training set into subsets with clustered labels, the AR learning method is applied to mine label constraint rules. Afterward, the scores produced by the SVM ranking method (note that the output is a probability, not a binary value) are employed in an ensemble strategy to combine these local predictions of base classifiers with the constraint decisions arising from the other classifiers, as in Eq. 3.2.

pj(x) =w×p¯j(x) + 1−w

j(x)|

X

i∈φj(x)

¯

pi(x) (3.2)

The consensus probabilitypj(x) of classj depends on the local predictions ¯pi(x) from the classifier i but also on the local predictions of the constraints φ = li1, li2, . . . , lik extracted from the rule li1, li2, . . . , lik →lj, where lj is the label of j. Thus, the rankings of classifiers belonging to the consequents of the rules will indirectly depend on the rankings of the classifiers from the antecedents.

A previous version of this approach is presented in [GCH10], which uses a simplistic method of extracting the ARs but invested a great deal to calculate the value of the weight w. The value of w and the rules are selected by applying cross-validation to the training data with the criterion of reducing the ranking loss value. The approach using AP can result in a heavy workload and is not recommendable for large datasets. For the sake of comparison, we implemented the approach as proposed in [GCH10] for the experiments part.

In [CRdJH14] a method for the reduction of the label dimensionality problem is

pre-Table 3.7: Comparison of MLC Improvement Approaches with Constraints Approach Online IM Use ranks Multi-taxonomy

[BBS11a] Y N N N

LC [PF08] N N N N

LCS [CHDH13] N N Y N

LCS [GCH10] N N Y N

LI-MLC[CRdJH12] N N N N

LI-MLC[CRdJH14] N Y N N

sented. First, an AR mining algorithm (FP-growth [HPY00]) extracts candidate rules (given a minimum Support and Conviction threshold). Subsequently, the labels that only appear in consequents of these rules are removed, and the classifier is applied to the reduced labelset. After classification, the rules are applied to recover the missing labels.

One significant problem with this approach is the unification of labels. Although it may improve the precision and recall in some noisy cases, it implies a systematic error, as rare associations and labels will not be affected by the approach.

The approaches are considered with respect to our objectives in Table 3.7. Only one approach (in two different versions) targets the strategy of changing predicted label rankings. By changing the ranking, the threshold may indirectly be altered; thus, such changes will also influence labels that may be uninvolved in the rules. The approach (in the version presented in [GCH10]) does consider the fact that some labels (rules) might be more useful than others (prediction is easier for the antecedent but not for the consequent), but it uses a computation-intensive method for the selection of such rules.

As discussed earlier, we expect that some labels will be better suited to be antecedents, and this is especially true in the multi-taxonomy setup. Unfortunately, none of the approaches consider this aspect.

Only the approach of [BBS11a] involves an online learning method. LI-MLC is de-signed for large datasets, because the approach tries to diminish the labelset, but it was not tested on actual large datasets. It requires a preprocessing step, which is still instance-incremental learning compatible; however, depending on the data, it can de-crease the prediction quality instead of increasing it. Only a variation of LI-MLC ([CRdJH14]) deviates from the standard AR framework and uses a different IM. Nonethe-less, the drawbacks for our DMS remain.

3.4 Discussion

In this chapter, we presented the state-of-the-art approaches that come closest to our goals and requirements. In Section 3.1, we examined MLC methods in relation to the requirements of online learning and the ability to handle large datasets and multiple tax-onomies while maintaining a simple model. In the discussion, we analyzed the various disadvantages of the approaches; in particular, the lack of a simple model for

debug-ging and knowledge extraction makes the investigated approaches unsuitable for our objectives.

In 3.2, we discussed approaches which could extract interesting rules from large data.

A special case present the cross-ontologies approaches since they come close to our re-quirements. Although all approaches deviate from the standard AR framework, the majority uses a different IM and the hierarchy, only two uses hierarchy expectation and only three focuses on cross-ontology. None of them presented new developments to use rare ARs with hierarchy, in order to obtain specific interesting rules.

In 3.3, methods seeking to improve MLC predictions were reviewed. Only one approach used the predicted score of the classifiers to the labels (ranking). In addition, only one used online learning classifiers, and one did not use Confidence. None of the strategies investigated a multi-taxonomy setup, and only one regarded some labels as more suitable for antecedents than other labels, but through a very cost-intensive, not online learning-compatible method.

None of the reviewed approaches fulfill all of our objectives for each subtopic. More importantly, their focus lies usually where we do not expect any improvement towards our goals: they do not investigate how to create understandable models for large datasets, search for rare ARs or seek to improve predictions to cope with large datasets.

We now present our contributions and explain their advantages over these approaches.

The challenge we target is to efficiently and precisely classify large amount of samples in a cross-ontology setup. Also important is the possibility to understand the classification process. The reviewed approaches cannot fulfill all these requirements.

Our approach bases on a simple premise: labels with small Support (later referred as small-support labels) are often considered noise or too unimportant to be linked to other labels, but they compose a large part of the classification problem. When more labels and samples are added to the classification problem, multi-label approaches tend to become inaccurate and very expensive. The combinatorial explosion takes place in such setups, and only a small, privileged number of labels have enough data to be learned well.

The approach will search for cross-ontology rules to improve these labels. This is the main idea of this thesis. To cope to the main challenge, it is supported by two other contributions, improving further the DMS.

The main contribution of this study, Multi-label Improvement with Rare Association Rules (MIRAR), will be presented later; first, the knowledge extraction process of the multi-label classification will be discussed. In this regard, two methods will be intro-duced, that are important for MIRAR in order to obtain the right amount of IMRARs.1 The innovation of MIRAR involves the use of very large data and a focus on small-support labels with suitable IMs. The first point is complementary to the second, since the relations between small-support labels or positioned deep in large hierarchies may be unknown.

Afterward we present ML-HARAM, a method that can handle efficiently large data, as well as enables easy rule extraction and retraceability of the rule learning procedure.

Lastly, the Rule Explorer is presented. It is a graphical user interface for the DMS facilitates IM rule examination together with examination of the rules and predictions of the classification process.

4.1 Multi-label Classification Improvement by RAR

Multi-label classification and association analysis have been combined in studies to clas-sify and improve predictions. However, many aspects related to association analysis have been neglected. The previously introduced methods in our approach can be used

1We count these methods as part of the MIRAR contribution.

in combination to improve the labels. In this section, we introduce the strategies we devised and analyzed in key cases of improvement.