Accelerated Probability Computation - Similarity processing in multi-observation data

Assuming that the observation x_i,j is currently processed, the following characteristics hold: since the observations are processed in ascending order w.r.t. the distance to q, the observation table entry OT[h][1] reflects the probability that object X_h is closer toq than the observation x_i,j. On the other hand,OT[h][0] reflects the probability thatx_i,j is closer toq than X_h.

The entries of the probability table can now be computed by iteratively fetching the observations from B. Thereby, all entries of the probability table are initially set to 0.

Then, the distance browsing B, which reports one observation of an uncertain object in each iteration, is started. Each reported observation xi,j is used to compute for all r (1 ≤ r ≤ N) the probability value that corresponds to the table entry P T[r][i][j]. After filling the (i,j)-column of the probability table, the next observation is fetched from B in the same way as this was done with xi,j. This procedure is repeated until all observations are fetched from B.

The computation of the probability table can be very costly in space and time. One reason is the size of the table that grows drastically with the number of objects and the number of observations for each object. Another problem is the very expensive computation of the probability table entries P T[r][i][j], which is the computational bottleneck of the proposed probabilistic ranking algorithm. For each entry ofP T[r][i][j], the computation is required for the probabilities according to

N −1 r−1

different r-set permutations which have to be summed up to the final probability value. For example, assuming N−1 = 100 and r−1 = 20, about 1.73·10¹³ r-set permutations need to be considered. Therefore, Section 11.4 will propose methods that achieve a considerable reduction of the overall query cost.

11.4 Accelerated Probability Computation 115 each row of PT that corresponds to a ranking position which is not within a particular ranking range can be skipped as well. This range is given by the minimum and maximum ranking position of uncertain objects for which currently a column of the probability table has to be maintained. The following lemmata utilize the bounds for uncertain distances that were introduced in Definition 11.2. A lower bound for the ranking position of an uncertain object is defined as follows.

Lemma 11.1 (Minimum Ranking Position) Let X ∈ D be an uncertain object and let q be the query observation. Furthermore, let N₁ < N objects Y ∈ D \ {X} have a maximum distance that is smaller than the minimum distance of X, i.e., |{Y ∈ D \ {X}: maxDist(Y, q)<minDist(X, q)}|=N₁. Then, the ranking position of object X must be at least N₁+ 1.

Analogously, an upper bound for the ranking position of an uncertain object is defined as follows.

Lemma 11.2 (Maximum Ranking Position) Let X ∈ D be an uncertain object and let q be the query observation. Furthermore, let N₂ < N objects Y ∈ D \ {X} have a minimum distance that is higher than the maximum distance of X, i.e., |{Y ∈ D \ {X} : minDist(Y, q)>maxDist(X, q)}|=N₂. Then, the ranking position of object X must be at most N −N₂.

As mentioned above, the computation of the object probabilities according to the ranking position i only requires to consider those objects whose minimum and maximum ranking position cover the ranking position i. This holds for those objects having at least one observation within the current ranking position range. For all other objects, this rule of spatial pruning can be applied. Usually, in practice, this is the case for only a small set of objects, depending on their spatial variance, also referred to as degree of uncertainty.

The experimental section of this chapter (Section 11.5) will reflect the degree uncertainty of an object by the spatial variance of its observations. This definition will slightly vary in Chapters 12 and 13, where the degree of uncertainty will correspond to the side length of the hyperrectangle in which the observations are distributed and to the standard deviation of the observations. However, the information contained in these different semantics can be regarded as similar.

11.4.2 Bisection-Based Algorithm

In the case of subsequently fetching observations belonging to the same object, the ranking probabilities according to this object do not change. Hence, obviously only one compu-tation of the probability value is required. However, the general case where two adjacent observations reported from the ranking belong to different objects occurs more frequently.

For this case, the computational cost can be significantly reduced if a bisection-based algorithm is utilized, as proposed in [45]. The bisection-based algorithm uses a divide-and-conquer technique which computes, for a query observationqand a database objectX, the

Algorithm 6Bisection-Based Algorithm: bisection(OT, min, max, r) Require: OT,min, max, r

1: result ←0

2: N ←max −min+ 1

3: if r = 1 then

4: result ←Qmax

i=minOT[i][0]

5: else if r ≥N then

6: result ←Qmax

i=minOT[i][1]

7: else

8: mid ← d(min+max)/2e

9: for (i= 0→min(d(max −min)/2e, r−1)) do

10: P_left ←bisection(OT,min, mid−1, r−i−1)

11: P_right ← bisection(OT, mid, max,i)

12: result ←result + (P_left ·P_right)

13: end for

14: end if

15: return result

probability that the object X is on rank r w.r.t. the distance to the query observation q, i.e., that exactly r−1 other objects are closer toq than the objectX. Hence, the number ofr-set permutations that have to be computed can be reduced drastically. The main idea is to recursively perform a binary split of the set of relevant objects, i.e., objects which have to be taken into account for the probability computation. Instead of considering all r−1 out of N −1 permutations, the r-set is split into two subsets of equal size. Then, only r−i−1 out of ^N₂⁻¹ permutations for i ∈ {0, . . . , r −1} have to be considered for the one subset, combined with the i out of ^N−1₂ permutations of the other subset. As a consequence, instead of considering

N −1 r−1

r-set permutations, the number of r-set permutations to be considered can be reduced to

r−1

i=0

_N−1

r−i−1

_N−1

The pseudocode for the computation of the rank probability is illustrated in Algorithm 6.

The bucket range of the r-set that is currently worked on is limited by the parameters min and max. The observation table, which is used for probability computation (cf.

Subsection 11.3.2), is denoted by the additional parameter OT. The r-set split can be recursively repeated for each subset. The recursive decomposition of a subset into two buckets for each recursion, from whichr−1 (0< r < N) out ofN −1 permutations have to be computed, stops if r ≥ N. Then, there exists only one permutation σ_r(i) in the current bucket that can be immediately computed and reported to the calling function of the recursion (line 6). Otherwise, the actual recursive splitting, that computes the results for the two summands P_left and P_right in each recursion, is performed in lines 9ff. The

11.4 Accelerated Probability Computation 117 size of the divided permutations σ_r(i) is determined by the minimum of the bucket size d^max−min₂ e and r−1. If r= 1, the probability that there is no object closer to q than x_i,j is computed (line 4).

Afterwards, the corresponding results can be efficiently merged into the final result.

Although this approach accelerates the computational cost of the P T[r][i][j] significantly, the asymptotical cost is still exponential in the ranking range.

11.4.3 Dynamic-Programming-Based Algorithm

In the following, an algorithm will be introduced that accelerates the computation by several orders of magnitude. This algorithm utilizes a dynamic-programming scheme, also known asPoisson Binomial Recurrence, first introduced in [147]. For the context uncertain top-k queries, this scheme was originally proposed in [214] on the x-relation model, which was the first approach that solves probabilistic queries efficiently by means of dynamic-programming techniques. Here, this scheme is extended to the use with spatial data and computes the probability that an uncertain object X ∈ D is assigned to a certain ranking position w.r.t. the distance to a query observationq.

The probabilities of PT can be efficiently computed requiring a complexity of O(N³).

The key idea of this approach is based on the following property. Given a query ob-servation q, an observation x of an uncertain database object X and a set of h objects S = {Z₁, Z₂, . . . , Z_h} for which the probability P_x(Z) that Z ∈ S is closer to the query observation q than x (i.e., that Z is closer to q than x) is known (i.e., all objects Z for which at least one observation has been retrieved from B). The probability P_x(Z) can be computed according to the following lemma.

Lemma 11.3 Let q be the query object and let (x, P(X =x)) be the observation x of an object X fetched from the distance browsing B in the current processing iteration. The probability that an object Z 6=X is closer to q than x is

P_x(Z) =

i=1

P(Z =z_i),

where z_i ∈Z,1≤i≤j are the observations of Z fetched in previous processing iterations.

Lemma 11.3 says that it is possible to accumulate, in overall linear space, the probabilities of all observations for all objects which have been seen so far and to use them to compute Px(Z), given the current observation x and any object Z ∈ D \ {X}.

Now, the probabilityPi,S,x that exactlyiobjectsZ ∈ S are ranked higher thanx w.r.t.

the distance to q can be computed efficiently, utilizing the following lemma.

Lemma 11.4 The event that i objects of S are closer to q than x occurs if one of the following conditions holds. In the case that an object Z ∈ S is closer to q than x, then i−1 objects of S \ {Z} must be closer to q. Otherwise, if the assumption is made that object Z ∈ S is farther from q than x, then i objects of S \ {Z} must be closer to q.

o q P(x,i)=P

_iͲ1,S,x

P

_i,S,x

=0,i>|S| P

iͲ1,S\{Z},x

s closer to

0 0

0 0 kͲ1

ranki

…

|S|

NͲ1

object s

0 0

0 0 1

1 2

P

iͲ2,S\{Z},x

1

…

|S|

P

_0,Ø,x

=1

Figure 11.3: Visualization of the dynamic-programming scheme.

The above lemma leads to the following recursion that allows to compute P_i,S,x by means of the paradigm of dynamic programming:

Pi,S,x =Pi−1,S\{Z},x·Px(Z) +Pi,S\{Z},x·(1−Px(Z)), where

P0,∅,x= 1 and Pi,S,x = 0 ifi <0∨i >|S|. (11.1) An illustration of this dynamic-programming scheme is given in Figure 11.3, where the size of S is marked along the x-axis and the number of objects that are closer to q than the currently processed observation x is marked along the y-axis. The shaded cells represent the probabilities that have to be determined during the process of the RPD computation.

As illustrated, each grid cell (which is exemplary marked with a dot in Figure 11.3) can be computed using the values contained in the left and the lower left cells. If the ranking depth is restricted to k, all probabilities are needed that up to k−1 out of N −1 objects – not N objects, asx cannot be preceded by the object it belongs to – are closer to x. In each iteration of the dynamic-programming algorithm, O(N·k) cells have to be computed (which is O(N²) in the setting of this chapter). Performing this for each observation that is retrieved from the distance browsing B, this yields an overall runtime of (N³), as it can be assumed that the total number of observations in the database is linear in the number of database objects.

Regarding the storage requirements for the probability values, the computation of each probabilityPi,S,x only requires information stored in the current line and the previous line to access the probabilities Pi−1,S\{Z},x and Pi,S\{Z},x . Therefore, only these two lines (of length N) need to be preserved requiring O(N) space. The probability table PT used in the straightforward and in the divide-and-conquer-based approach (cf. Subsection 11.3.2), in contrary, had to store N²·mvalues, resulting in an overall space requirement of O(N³).

While the bisection-based algorithm still requires exponential asymptotical runtime for the computation of the RPD, the dynamic-programming-based algorithm only requires a

Im Dokument Similarity processing in multi-observation data (Seite 130-135)