Experimental Evaluation - Similarity processing in multi-observation data

7.4 Experimental Evaluation 65

Table 7.1: Datasets used in the evaluation.

Name Rows Dimensions Type

ALOI-27 110,250 27 color hisograms, Zipfian

CLUSTERED-50 300,000 20 synthetic, uniform, 50 Gaussian clusters

PHOG 10,715 110 gradient histograms

• CLUSTERED-50: Synthetic dataset of 300,000 20-dimensional feature vectors, or-ganized in 50 clusters. The means of the clusters are uniformly distributed in the data space. Each cluster follows a multivariate Gaussian distribution.

• PHOG: 10,715 feature vectors derived from pyramid histograms of oriented gradi-ents (HoG Features) with 110 dimensions. The features were provided from the work of [87] and represent gradient histograms which were extracted from medical computer tomography images. The features were already reduced in dimensionality by applying a Principal Component Analysis (PCA) [116] and the dimensions are ordered by decreasing value of the eigenvalues.

7.4.2 Pruning Power Evaluation

In the experiments, 50 k-NN queries (k = 10) were submitted to the database and the number of feature vectors was measured that were pruned after a data column was resolved and the distance approximations were recomputed. The charts in Figures 7.2, 7.3 and 7.4 represent the averaged cumulative amount of feature vectors that were pruned after a column was resolved; the x-axis counts the number of resolved dimensions, whereas the achieved pruning power is marked on the y-axis. The areas under the curves can, thus, be regarded as the amount of data that does not need to be resolved from disk, whereas the areas above the curves indicate the amount of data that needs to be taken into account for further refinement from disk and for the computation of the distance approximations.

This observation can be regarded as a simple visual proof that tighter approximations yield higher pruning power, as more feature vectors can be pruned at a very early stage of the computation, so that further data columns of this feature vector do not have to be resolved. In the ideal case, only a few columns have to be resolved until thek final nearest neighbors remain in the dataset.

ComparingALOI-27 with the other datasets, it can be observed thatBond+performs as expected on Zipfian distributed histogram-like datasets. Nevertheless, Bond+resolves about half of the data on theCLUSTERED-50 dataset and almost all columns on PHOG.

This behavior proves the strong dependence on the data distribution, which is addressed in this chapter.

In the first improvement step ofBeyOND (cf. Subsection 7.3.2), the approach presented in this chapter proposed to divide the feature space into subcubes and, thus, to refine the simple Euclidean distance approximation Bond. The gain of pruning power on is the ALOI-27 dataset is clearly visible in Figure 7.2. Nevertheless,Bond+still achieves higher

7.4 Experimental Evaluation 67

0 20,000 40,000 60,000 80,000 100,000 120,000

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

PrunedVectors

ResolvedDimensions

Bond Bond+ BeyondͲ1 BeyondMBRͲ1 BeyondͲ2 Figure 7.2: Pruning power onALOI-27.

0 50,000 100,000 150,000 200,000 250,000 300,000

0 2 4 6 8 10 12 14 16 18 20

PrunedVectors

ResolvedDimensions

Bond Bond+ BeyondͲ1 BeyondMBRͲ1 BeyondͲ2 Figure 7.3: Pruning power on CLUSTERED-50.

0 2,000 4,000 6,000 8,000 10,000 12,000

0 10 20 30 40 50 60 70 80 90 100 110

PrunedVectors

Resolved Dimensions

Bond Bond+ BeyondͲ1 BeyondMBRͲ1 BeyondͲ2

Figure 7.4: Pruning power on PHOG.

pruning power. On the contrary, on CLUSTERED-50, Beyond-1 outperforms Bond+, as the subcubes provide good approximations of the clusters. These approximations are, in this case, superior to the bounds of Bond+, since the underlying distribution of the data is not Zipfian (cf. Figure 7.3). Finally, the impact of Beyond-1 onPHOG is significantly higher than the impact of bothBondand Bond+, which do not perform on PCA’ed data at all and achieve a comparably high pruning power while resolving the last dimensions.

Table 7.2 provides “snapshots” of the pruning power curves depicted in Figures 7.2, 7.3 and 7.4. The columns show the amount of resolved columns (in percent), where more than 25%, 50% and 90% of the candidates were pruned. The observations from rows 1-3 support the above statements; the best results forBeyond-1are achieved on theCLUSTERED-50 dataset, where only half the number of dimensions have to be resolved to achieve a pruning power of 90%.

The intuitive approach to add more splits per dimension (Beyond-2) and, thus, to de-crease the size of the subcubes performs well onALOI-27 andCLUSTERED-50. This can be observed with the curve for the approach in Figures 7.2 and 7.3 and also from the rows 4-6 of Table 7.2. In particular theCLUSTERED-50 dataset takes most advantage from the quadratic growth of additional subcubes (2^d→4^d), which poses a very good approximation of the clusters. Figure 7.4 shows that the improvement on PHOG is negligible.

The second improvement with BeyOND precomputes the MBRs in the case a subcube contains more than a single feature vector, the MBR would be small enough and the maximum number of MBRs is not reached yet (cf. Subsection 7.3.3). In this case, the portion of created MBRs that yield the largest volume decrease within the respective subcube w.r.t. the score function f(MBR) was limited to 1% of the overall number of

7.4 Experimental Evaluation 69 Table 7.2: Pruning power of Beyond-1, Beyond-2 and BeyondMBR-1.

Dataset Approach 25% 50% 90%

ALOI-27 Beyond-1 16 (59%) 19 (70%) 23 (85%)

CLUSTERED-50 Beyond-1 7 (35%) 8 (40%) 10 (50%) PHOG Beyond-1 45 (41%) 58 (53%) 80 (73%)

ALOI-27 Beyond-2 7 (26%) 9 (33%) 21 (75%)

CLUSTERED-50 Beyond-2 1 (5%) 1 (5%) 1 (5%)

PHOG Beyond-2 45 (41%) 55 (50%) 79 (72%)

ALOI-27 BeyondMBR-1 1 (4%) 1 (4%) 10 (37%)

CLUSTERED-50 BeyondMBR-1 1 (5%) 1 (5%) 1 (5%) PHOG BeyondMBR-1 37 (34%) 50 (45%) 77 (70%) Table 7.3: Total amount of data viewed with the different approaches.

Dataset Bond Bond+ Beyond-1 Beyond-2 BeyondMBR-1

ALOI-27 96.3% 3.2% 66.9% 38.3% 7.7%

CLUSTERED-50 81.6% 51.4% 36.3% 1.6% 1.4%

PHOG 97.6% 99% 52.6% 52.3% 45.4%

feature vectors. The results are shown with the approachBeyondMBR-1, which denotes the MBR-based variant of Beyond-1. Here again, the results can be observed from Figures 7.2, 7.3 and 7.4, where BeyondMBR-1 is indicated by the dotted line, and in rows 7-9 of Table 7.2. On the ALOI-27 dataset, the initial pruning power in the first dimension is even comparable to Bond+. On CLUSTERED-50, BeyondMBR-1 yields comparable results to Beyond-2. Here, 98% of the data could be pruned at once. PHOG again poses the hardest challenge due to its very high dimensionality. However, there is a slight improvement compared to the basic subcube approaches Beyond-1 and Beyond-2.

7.4.3 Additional Splits vs. MBRs

Table 7.3 shows the total amount of resolved data, which is computed by r =

i=1(# resolved vectors)·i (# vectors)·d−k .

It can be observed that in case of ALOI-27 and PHOG, it is more profitable to extend the original idea of BOND with a 1-level VA-file and additional MBRs (BeyondMBR-1) instead of simply adding more layers (Beyond-2) which generates more subcubes.

On theCLUSTERED-50 dataset, there is almost no difference between BeyondMBR-1 and Beyond-2. Nevertheless, the solution with BeyondMBR-1 offers more flexibility regarding the choice of MBRs and the control of additional memory consumption than simply increasing the split level.

Overall, it can clearly be observed that on ALOI-27, Bond+ outperforms the other approaches significantly. This was expected due to the distribution of the dataset. How-ever, the MBR Caching with BeyondMBR-1 achieves almost similar pruning power. On CLUSTERED-50, the improvements of theBeyOND approaches are significant, in particu-lar with one more split (Beyond-2) or MBR Caching (BeyondMBR-1). Finally, PHOG is a hard challenge for both Bond and Bond+, whereas Beyond-1 provides reasonably tight bounds with one split of the data cube. This result, however, can hardly be further improved.

Im Dokument Similarity processing in multi-observation data (Seite 81-86)