• Keine Ergebnisse gefunden

1.6 Overview of the Thesis

2.1.6 Performance Measures

The quality of MLC results can be assessed by many different performance measures, since the predictions cannot be only classified as wrong or right. How to weight the prediction quality is a decisive question when discussing results under a performance measures. First, we describe common approaches that employ a measure counting simple true and false predicted labels. Other ways to measure the prediction quality of MLC approaches will then be presented.

Instance-based

Two basic concepts can be defined instead of wrong or right prediction: the precision P, which measures how many of the predicted labels are actually present, and therecall R, which measures how many of the present labels have been predicted [TSK06]. The instance-based F-measure can then be defined as the harmonic mean of precision and recall for each sample i [TKV10]:

Pi := |yi∩y˘i|

where ˘yi is the predicted multi-label of xi and yi is its true label. The accuracy A of the predictions refers to the proportion of correctly predicted labels:

Accuracy := 1

Label-based

However, averaging over the instances is only one way to average and to weight the results. Two other approaches to average are micro-averaged and macro-averaged. The former builds a single global contingency table and thus only calculates one MLC mea-sure for the whole dataset. In the latter, on the other hand, the overall MLC meamea-sure for the whole dataset is given by the mean measure value over the labels. As a consequence, micro-averaging gives equal weight to each multi-label, whereas macro-averaging assigns equal weight to each label [Yan99]. A single label with higher support can have more influence on the outcome of micro-averaged measures than many small support labels, with macro-averaging, the contrary is generally true. A contingency table can be con-structed based on the number of true positives TP, false positives FP, true negatives TN and false negatives FN. These will be calculated here as follows:14

TPλ := The micro-average F-1 measure can then be defined as:

mF := 2∗P ∗R

P +R = 2∗TP

2∗TP +FP +FN. (2.10)

Instead of using the term macro F-1 measure we will use Label F-1 (LF-1) or just LF since we want to direct the focus, that this measure give more weight for the labels.

Because we are interested in the performance of the classifier on rare labels, as described in the Introduction, assigning equal weights to the labels allows a more fine-grained evaluation of the prediction quality among the rare labels:

LF := 1 An additional problem with macro F-1 is that it can be understood as the calculation ofLF (as in [TKV10]) or as the harmonic measure between macro Recall and Precision, but they do not yield generally the same result.15

14Here, we sum over the labels so that it is easier to define macro-averaging, but for micro-averaging, we can also sum over the instances.

15Since this generally holds: Q1 PQ λ=1

All of the measures above (instance- and label-based) take values in [0,1], with higher values indicating better predictions. The difference between choosing which way to average may be consequently substantial leaning the discussion of the results towards certain directions. However, there are approaches that do not directly rely on simple contingency tables, providing more independence over these averages (usually also at a much higher cost and with a more difficult analysis).

Ranking-based

There are many measures for prediction quality estimation of classifiers based on label ranking [TKV10], e.g. one-error, ranking loss, coverage and average precision. In our experiments, the results of such measures will be not analyzed, since we are interested in the label predictions and not their ranking. This would also entail an analysis of the rankings for each classifier, which would surpass the scope of this study.

Only the ranking loss, defined as the average fraction of pairs of labels that are ordered incorrectly, is important in the context of this dissertation as a cost function.

Hierarchy-based

A wide range of HMC performance measures were compared in [BBS11b]. We presented there a method to evaluate a set of performance measures for hierarchical multi-label classification with regard to redundancy and discrimination power. The most important aspect of that study was that some measures can be combined to identify different characteristics of the predictions, but bias and similarities between the measures must be taken into account. That can be used in very thorough and hierarchical evaluations, but such evaluations would also surpass the scope of this study.

By definition, HMC datasets are also MLC datasets, and hence traditional, “flat” MLC performance measures can also be used to evaluate the HMC performance. They lack the semantic of the hierarchy, and so “flat” ones can misjudge the difficulty to predict a certain label, but they are easier to understand.

Voting of Measures

Many studies have counted how many wins over the selected performance measures the MLC algorithms have in order to assess the goodness of the predictions of the classifiers.

However, they do not verify the correlation between these measures. If two measures seek to evaluate the same aspect, they may only have different values but the ordering of the predictions, i.e. whether one prediction is better than the other, will still hold (the discriminancy power will be the same). This is principally an issue in the context of hierarchical multi-label classification. [BBS11b] presents a methodology for assessing the correlation of the performance measures and determining whether a voting system will be fair to a classifier. Because we have already chosen to use only a few flat measures, this methodology can be avoided here.