Index-Based SSS – Bottom-Up - Similarity processing in multi-observation data

query subspace (d_S ≤ d) represented by a corresponding vector S of weights, a subspace k-NN query retrieves the set NN(k, S, q) that contains k objects from D for which the following condition holds:

∀x∈NN(k, S, q),∀y ∈ D \NN(k, S, q) :distS(x, q)≤distS(y, q). (8.3) Some of the rare existing approaches for subspace similarity search focus onε-range queries, which is a considerable lack (cf. Chapter 1). This is even more evident when searching subspaces of different dimensionality, because in this case, also the value of ε needs to be adjusted to the subspace dimensionality in order to produce meaningful results. This is a non-trivial task even for expert users, since recall and precision of an ε-sphere becomes highly sensitive to even small changes of ε depending on the dimensionality of the data space. Also, many applications like data mining algorithms that further process the results of subspace similarity queries require to control the cardinality of such query results [137].

Therefore, the approaches that will be introduced in this chapter will focus onk-NN queries.

8.3 Index-Based SSS – Bottom-Up

8.3.1 The Dimension-Merge Index

The solution to subspace similarity search that will be proposed in this section (referred to as Dimension-Merge Index) is based on the ad hoc combination of one-dimensional index structures. The combination technique is algorithmically inspired by top-k queries on a number of different rankings of objects according to different criteria. In the current scenario, if the objects are assumed to be ranked w.r.t. the distance to the query object in each dimension, respectively, it is possible to apply top-k methods to solve subspace k-NN queries with the rankings of the given subspace dimensions.

8.3.2 Data Structures

The key idea is to vertically decompose the data contained inDfor the organization of full-dimensional space, as also performed in Chapter 7. For the restriction to subspaces, each dimension is organized separately in an indexI_i (1≤i≤d), using the feature value of the dimension as spatial key and the ID of the corresponding object as value. For this purpose, a B⁺-tree seems adequate, as it is specialized for indexing one-dimensional data. In this problem, however, it is even more appropriate to use a (one-dimensional) R^∗-tree [23], as it heuristically tries to minimize the extension of nodes, which was shown to be important for spatial queries. The used R^∗-tree has the following two modifications:

• Each leaf node has a link to its left and right neighbor. This relation is well defined since the tree only organizes a single dimension on which a canonical order is defined.

• Each leaf node stores the values of the facing boundaries of its two neighbors.

q

₁

q

₂

Figure 8.1: Initial situation of the data space.

objectTable

Object dist_I₁ dist_I₂ minDistS maxDistS

Index Bounds Index I^min I^max

I1 0.3 9.0 I2 0.0 8.5 L2 0.3 12.4

Table 8.1: Initial situation of indexes and objectTable.

The second data structure needed is a hash table for storing (possibly incomplete) object information with the object ID as key. This table will be referred to as objectTable. It is used to store, for each object, the distance to the query object in each dimension. If this information is not known, the corresponding field remains empty. In Figure 8.1 and Table 8.1, an example for a two-dimensional subspace query is shown. In the example, the leaf nodes (pages) of the two relevant index structures I₁ and I₂ organizing the ob-jects in the dimensions of the subspace are illustrated at the borders of the data space.

Initially, the objectTable is empty. Along the fields for distance values for each dimension in the objectTable, the values for lower and upper bounds can be computed, using the current information of the index bounds for the one-dimensional indexes I1 and I2. The computation of these bounds will be detailed in the following.

8.3.3 Query Processing

When a subspace query (q, S) arrives, only those indexes Ii are considered where Si = 1.

On these one-dimensional indexes, incremental NN queries (where q_i is the query for I_i) are performed. A call of the function getNext() on the index I_i returns the leaf node closest to the queryqi in dimensioni, whose contained objects have not yet been reported.

The challenge is to combine the results of the single dimensions to a result on the whole subspace. This is done by the objectTable, which is empty at the beginning of the query process. For each object x, which was reported by an index Ii, an entry in the objectTable is created. If it already exists, the corresponding entry is updated (i.e., the distance w.r.t.

dimensioniof objectxis set todistI_i(q_i, x_i), where the distance is restricted to the subspace corresponding to the current dimensioni). If an objectxhas not yet been seen in indexIj

(j 6=i), its value in dimension j in the objectTable is undefined. The distance dist_S(q, x)

8.3 Index-Based SSS – Bottom-Up 75 Algorithm 5k-NN Query on Dimension-Merge Index: kNN-DMI(q, S,k, I)

Require: q, S,k, I

1: maxKnnDist ← ∞

2: while maxKnnDist ≥minObjectDist_S(q, I) do

3: i← chooseIndex(I)

4: leafNode ← I_i.getNextNode(q_i)

5: objectTable.insert(leafNode.elements)

6: maxKnnDist ←objectTable.getMaxKnnDist(k)

7: end while

8: objectTable.refine()

between an objectx∈ D and q in the subspace S can be upper bounded by

maxDist_S(q, x) = ^p v u u t

i=1

S_i·

|x_i−q_i|^p (Case 1)

max(|I_i^min−q_i|,|I_i^max−q_i|)^p (Case 2) , (8.4) whereI_i^minandI_i^maxare the lower and upper bound of the data contained inDin dimension i, respectively. These bounds can be obtained directly from the indexI_i, as this corresponds to the boundaries of the root node. Obviously, it holds that maxDist_S(q, x)≥dist_S(q, x).

For the calculation ofmaxDistS(q, x), two cases have to be considered: if objectxhas been found in index I_i (Case 1), the exact value in this dimension can be used. Otherwise, the bounds of the data space have to be used in order to approximate the value in this dimension (Case 2). Using Equation (8.4) and the information contained in theobjectTable, an upper bound for the distance dist_S(q, x) can be obtained for each object x∈ D. Therefore, it is also possible to calculate an upper bound for the distance of the kth-nearest neighbor to the query object, which can be used as pruning distance. The upper bound is recorded in the objectTable, and updated if necessary.

Analogously, a lower bound for each object in the objectTable can be obtained by

minDist_S(q, x) = ^p v u u t

i=1

S_i·

|x_i−q_i|^p (Case 1)

|I_i^next−q_i|^p (Case 2) , (8.5) where I_i^next is the position of the query-facing boundary of the page obtained by the next call of the function getNext() on Ii. Again, it is necessary distinguish the cases where xi

has been reported (Case 1) or where it is undefined at the moment (Case 2). This lower bound is important for the refinement step of the query algorithm and it is recorded in the objectTable, and updated if necessary.

The pseudocode for a subspace k-NN query on the Dimension-Merge Index is given in Algorithm 5. Initially, the upper bound of thekth-nearest neighbor distance (maxKnnDist) is set to infinity. As long as there exists an object which could have a lower distance than the current maxKnnDist and which is not in the objectTable, the filter step has to be

q

₁

x₁

x₂

x₃

x₄ x₅

x₆ x₇

x₈

q

₂

q

Figure 8.2: Situation of the data space af-ter four getNext()-calls.

objectTable

Object dist_I₁ dist_I₂ minDistS maxDistS

x₁ 0.5 2.94 9.01

x2 1.0 3.06 9.06

x3 0.4 1.94 8.51

x4 1.0 2.15 8.56

x5 1.5 2.42 8.63

x6 1.0 3.07 9.06

x7 1.9 1.5 2.42 2.42

x₈ 1.7 2.55 8.67

Index Bounds Index I^min I^max

I1 2.9 9.0 I2 1.9 8.5 L2 3.47 12.4

Table 8.2: Situation of indexes and object-Table after four getNext()-calls.

continued and, thus, more points have to be inserted into the objectTable. The minimum distance of an object which is not in the objectTable is given by

minObjectDist_S(q,I) = ^p v u u t

i=1

S_i· |I_i^next −q_i|^p. (8.6) At the point where minObjectDist is larger than the maxKnnDist (as seen in Figure 8.2 and Table 8.2 for k = 1), the algorithm enters the refinement step. The objects were retrieved in ascending order of their indices. In the current state, minObjectDist exceeds maxKnnDist = maxDist_S(q, x₇) for the first time. Now, no object which is not in the objectTable can be part of the result, therefore only objects contained in the objectTable at this time have to be considered. In order to keep the number of resolved objects (corresponding to the number of expensive page accesses) low, the technique for refinement from optimal multi-step processing, proposed in [135], is used.

Algorithm 5 can easily be adapted toε-range queries. Only themaxKnnDist has to be set to ε and does not have to be updated (i.e., line 6 is to be omitted).

8.3.4 Index Selection Heuristics

The most important part of the algorithm considering the performance is thechooseIndex method (line 3). For a fast termination of the filter step it is necessary to

• find and minimize the upper bound of maxKnnDist and

• increase the minimum distance a page can have

8.4 Index-Based SSS – Top-Down 77

Im Dokument Similarity processing in multi-observation data (Seite 89-93)