Comparison of Systems - External Evaluation

6.4 External Evaluation – Comparison of Systems

6.4.2 Comparison of Systems

The comparison of two CBIR-systems is a straight forward approach to present the ben-efits of the individual ones. One situation where two CBIR-systems may be compared are presentations of individual (new) image retrieval approaches. The comparison with another, commonly known or at least publicated earlier system, will support the presen-tation of the individual benefits. The comparison with an earlier version of the same CBIR-system may be the most simple approach to expose the benefits of the new version.

Few comparisons of different and independently developed systems are published:

GIFT/Viper vs. Histogram Intersection

At the University of Genf in addition to the CBIR-system GIFT/Viper [GIFT] an evalu-ation framework is developed. The main aspect is the communicevalu-ation protocol MRML to implement a client/server architecture. This approach should be established in an eval-uation event called benchathlon. In [M¨uller et al., 2003] the suitablity of this evaluation framework is presented based on the comparison of Viper with a histogram intersection approach to retrieve images.

retrieval approaches a number of performance measures (see section 6.2) are presented. In order to evaluate the relevance feedback, the precision is plotted for a number of iterations.

Although it was just a secondary intention to show benefits of Viper this can be observed on the presented comparison.

PicSOM vs. GIFT/Viper

PicSOM should be compared with another CBIR-system [Rummukainen et al., 2003].

Therefore, it was modified to be able to communicate using MRML. Since GIFT/Viper is the only system using MRML this is used as the comparative systems. Therefore GIFT/Viper is used in a black-box fashion.

Computation time and storage requirements are compared marginally. Indeed PicSOM is clearly faster and requires less storage. For evaluating the performance a PicSOM specific measure is used. That is based on assumptions GIFT does not achieve. Namely GIFT presents single pictures repeatedly. Furthermore precision-recall-diagrams are used.

Thus the images in each iteration step determines the performance. Overlap of the non-relevant images with the preceding iterations is acceptable although obviously not very user-satisfying.

No clear winner in the performance based on the six analysed classes could be elected.

Nevertheless, some insights regarding the PicSOM system are gained. Thus classes where PicSOM lose (horses, planes, cars) are detected. The reason for that is still unknown.

However, it becomes obvious that a variety in the used classes is important to get a fair benchmark. The same is true for different performance measures.

INDI vs. PicSOM

Starting from the INDI point of view a comparative evaluation is based on thepr-measure.

To get comparable values at least the underlying image set and the search task have to be equal. Further variable things like the used image features or the rating levels for the relevance feedback are taken as system specific things. For deeper analyses they should be equal. Indeed a first rough evaluation compares the overall performance.

A set of example queries is used in the already used artexplosion image collection. The relevance feedback ratings are based on the predefined categories. Positive and negative ratings (as + and –) are used foris in the same category as the queryandis not in the same category. Figure 6.15 presents an example plot of the averaged maximum precision-recall.

For the PicSOM system a better performance can be observed compared to the INDI system. INDI depends strongly on the initial configuration and the retrieved images build one cluster in the feature space. PicSOM retrieves images based on relevance values.

Thus results of different map regions are possible and the retrieved images may come from different clusters. However more detailed insights are not possible here.

Further analyses are necessary if two systems are compared on a performance measure only since the systems have different striking attributes. PicSOM computes the relevance based on TR-SOMs (see section 2.2.1). INDI focus on the user interaction. To cover the special qualities of these systems usually they are evaluated in completely different ways:

PicSOM presents convolution maps (see figure 6.16) to visualise the relevance landscapes.

INDI arranges user experiments to analyse the interface (see figure 6.17).

0 5 10 15 20

iteration 0

0 5 10 15 20

iteration

Figure 6.15: The averaged maximum precision-recallprfor an underwater category search with INDI (lower line) and PicSOM (upper line). PicSOM adapts better and outperforms INDI.

Figure 6.16: Convolved SOMs of the seven artexplosion categories (from left to right:

underthesea, zoo, doorsandwindows, teddybears, sunrise, venezuela and iceland). Used feature is a colour layout feature. The dark map regions contain the relevant images with respect to the different categories.

Figure 6.17: Evaluation of the user handling in INDI. Different interaction modalities are investigated regarding (from left to right) accomodation, efficiency, handling, nasty, fun, learn and patience. The interaction modalities are mouse+speech, touchscreen, touch-screen+speech and mouse. The extend along each axis represents how far the user has agreed the corresponding thesis in a questionair. (For details see [Bauckhage et al., 2003]) AQUISAR vs. INDI

The benefits of the image retrieval system AQUISAR (section 2.2.5) are exposed in com-parison to the INDI system (section 2.2.4). Table 6.1 presents the precision of both systems for retrieving aquarium images based on the same query set. Figure 6.18 shows extracs of some retrieval results.

The striking advantage of AQUISAR is that it can retrieve images with similar entities from different webcam settings, i.e. angles of view. The results of INDI contain just images taken from the same angle of view as the reference image. This disadvantage of a conventional CBIR-system like INDI is rooted in the fact that the main part of each aquarium image is covered by the background. Therefore, the surrounding is dominant for calculating the result lists. The comparison shows that the used INDI release depends

AQUISAR 0.65 0.04 0.25 1

INDI 0.48 0.04 0.13 0.88

Table 6.1: The achieved precision P r(i) = ^N_iⁱ⁺ is presented. A human user has rated the results and determined the number of interesting images N_i⁺ fori= 8 retrieved images.

For a comparison the same has been done for the CBIR-system INDI.

AQUISAR^"^"

b INDI AQUISAR^"^"

b INDI

Figure 6.18: Retrieval results of AQUISAR and INDI.

strongly on the background of the images. This is overcome by the segmentation step of AQUISAR. Thus the intergration of a segmentation module into INDI is motivated by such a comparison. Indeed the relevance feedback, which INDI offers, is not used.

Im Dokument Content-based image retrieval and the use of neural networks for user adaptation (Seite 116-119)