• Keine Ergebnisse gefunden

Ranking for Multi-label Datasets with hierarchies

known animal species, they use additionally more information than merely visual cues, e.g.

they group animals by being insect, mammal or fish.

2.5 Ranking for Multi-label Datasets with hierarchies

The first preliminary is the fact that we need to consider for each image the set of all labels.

For the ATax score being a taxonomy-aware extension of the AP score we consider instead of one single binary labelyk(c)for the class in questioncthe set of labels based on all classes in the multi-label problem{y(r)k ∈ {0,1}, r ∈ {1, . . . , C}}. y(r)k is the label for data samplek and classr.

The second preliminary is an representation of the AP score as an average of top-rank-list precisions derived from distance functions over a set of samples.

Let us define for a[0,1]-bounded distance functionl(y)the top-rank-list precision of the top rankedisamplesP rec[l](i)to be

P rec[l](i) = 1 i

Xi

k=1

1−l(yk) (2.22)

Then average precision can be seen as an average of top-rank-list precisions over a partic-ular setSof samples:

AP(c)= 1

|S| X

i∈S

P rec[l(c)01](i) (2.23) where the set of samplesSis given in according to Equation (2.21) asS ={i|I{y(c)i = 1}}

and

l01(c)(yk) =I{y(c)k 6= 1} (2.24) is the zero-one discretized distance of the class labely(c)∈ {−1,+1}to the label value1.

This representation holds because of 1

|S| X

i∈S

P rec[l(c)01](i) = 1

|S| X

i∈S

1 i

Xi

k=1

1−l(c)01(yk)

= 1

|S| X

i∈S

1 i

Xi

k=1

1−I{yk(c) 6= 1}

= 1

|S| X

i∈S

1 i

Xi

k=1

I{yk(c)= 1}

= 1

n+(c)

X

i∈{m|I{y(c)m=1}}

1 i

Xi

k=1

I{y(c)k = 1}

= 1

n+(c) Xn

i=1

I{yi(c) = 1}1 i

Xi

k=1

I{y(c)k = 1}

=AP(c)((zk(c), y(c)k )nk=1)see Equation (2.21). (2.25)

We compute a ranking score for a fixed class in question c of the multi-label problem.

Therefore note that we can replace in the original AP score hierarchy-unaware precision score l(c)01 by a term dependent on the a priori fixed class c. The Atax score will be defined by a replacement term given in equation (2.26) based on the minimal taxonomy distanceδT between the fixed classcand all positive labels in the ground truth{y(r)k , r∈ {1, . . . , C} |yk(r)= 1}of a fixed samplek:

l(c)T ({yk(r), r= 1, . . . , C}) = min

r∈{1,...,C}|y(r)k =1

δ(c, r)) (2.26)

Again, assume that the data samples(xk,{yk(r), r = 1, . . . , C}), and thus their labelsyk(r)for all classesr, are sorted according to the descending order of the SVM outputszk(c)for the fixed classc. The set of samplesS is given again asS={i|I{yi(c) = 1}}.

Then we define the ATax score for classcto be:

AT ax(c)= 1

|S| X

i∈S

P rec[l(c)T ](i) (2.27)

= 1 n(c)+

Xn

i=1

I{yi(c) = 1}1 i

Xi

k=1

1− min

r∈{1,...,C}|yk(r)=1

δT(c, r) (2.28)

The above derivation shows that the ATax score can be seen as a taxonomy-aware extension of the established AP score. Since the taxonomy distanceδT from equation (2.2) is scaled to lie in[0,1]and a correct prediction implies scores ofI{y(c)k = 1}= 1respectively1−l(c)T (yk) = 1, the ATax score is never smaller than the AP score. The precision function used in the AP score can be interpreted as a zero-one discretization of the taxonomy score1−l(c)T (yk). Both scores, AP and ATax, have the advantage of being invariant against the classification threshold and evaluate the ranking of images. We did not use the ranking based scores for the multi-class problem, however. Inspecting the constraints of the structured prediction formulation from (2.8) shows that it aims at classifying each image correctly in the sense of obtaining a correct ranking of classes for each image. Its optimization does not aim at obtaining a correct ranking of images for each class. Thus, using a ranking score would be a biased measure against the structured approaches.

2.5 Ranking for Multi-label Datasets with hierarchies

2.5.2 Datasets

VOC2006 multi-label data

We use the VOC2006 dataset (84) consisting of 10 object classes and 5301images with its original, unmodified labels. The full taxonomy is given in Figure2.3.

VOC2009 multi-label classification task data

This dataset consists of 20 classes with7054labeled images. It serves as a second multi-label setting for the local algorithms. The full taxonomy is given in Figure2.13.

2.5.3 Experimental Results

Note that for multi-label data the structured algorithms cannot be applied in their current form as the multi-class constraints are not well-defined anymore. Therefore we will compare one-versus-all classification against local hierarchical approaches. As this frees us of time and memory consumption problems related to the structured algorithms we will use crossvalidation with 20 folds. We will use the same features and kernels as described in sections2.4.2and2.4.3 and measure with AP and ATax scores.

Table 2.16: Ranking scores on VOC06 as multi-label problem, 20-fold crossvalidation.

Higher scores are better.

Method ATax AP

one versus all 90.10±3.46 80.13±7.21 local tax. scaled geometric mean 91.29±3.34 79.96±7.23 local tax. scaled, harmonic mean 90.85±3.28 80.61±7.06

Table 2.17: Ranking scores on VOC09 as multi-label problem, 20-fold crossvalidation.

Higher scores are better.

Method ATax AP

one versus all 79.02±8.72 55.92±15.91 local tax. scaled geometric mean 80.68±8.20 54.62±16.08 local tax. scaled, harmonic mean 80.03±8.33 56.43±15.77

Tables2.16and2.17show that even for a multi-label setting, introducing a taxonomy can improve taxonomy based as well as flat ranking scores, despite we have no notion of avoiding confusions anymore.

This may become relevant when using classifier scores for ranking images for retrieval. A higher ATax score implies that the desired class and similar classes are ranked higher than more distant classes which in effect leads to a subjectively improved ranking result from a human viewpoint. When looking for cats, humans tend to be more impressed by results which return erroneously other pets than cars. Highly ranked images from very distant categories tend to be perceived as strong outliers.

Figure2.11shows examples where the hierarchical classifier is able to improve rankings simultaneously for classes which are far apart in the taxonomy given in Figure2.3. This shows that taxonomy learning for multi-label problems does not lead necessarily to mutual exclusion of taxonomy branches. In both images, the classes under consideration are separated already at the top level. We observe that images can be re-ranked to top positions despite average rankings at all edges. For the upper image this occurs for the cow class, for the lower image this occurs for the motorbike class as can be seen from the rankings given along the paths.

This can be explained by the property of the nonpositive p-means to be upper-bounded by the smallest score (see Section 2.2.5). Many images which achieved higher scores and ranks at some edges along the considered path were effectively ranked lower because they received very low scores at one edge at least in the same path. Note that the observed improvement in ranking is independent of the ranking loss.

Table2.18compares the performance of scaled versus unscaled combinations of scores for both multi-label problems. We see clearly that scaling of scores onto a compact interval con-tributes to the good performance of the local models. The good performance of scaled scores is not surprising as one can expect the SVM outputs to have different distribution statistics like variances across the edges. Please note that for one versus all classification the scaling has no influence on the ranking scores as it is monotonous and rank-preserving and the score computation is done for each class separately.