Problem Definition - Similarity processing in multi-observation data

(a) Query on objects represented by mean positions. (b) Query on objects with full uncertainty.

Figure 11.1: Distance range query on objects with different uncertainty representations.

object consists of a set of multiple observations which are mutually exclusive.

In the following, Section 11.2 will formally define different semantics of probabilistic ranking on uncertain objects. Then, Section 11.3 will introduce a framework containing the essential modules for computing the rank probabilities of uncertain observations. In Section 11.4, two approaches will be presented to speed-up the computation of the rank probability distribution. These approaches will be evaluated w.r.t. effectiveness and effi-ciency in Section 11.5. Finally, Section 11.6 will conclude this chapter.

11.2 Problem Definition 107 Here, the condition P

(d,p)∈distu(X,Y)p= 1 holds.

Since distance computations between uncertain objects are very expensive, computation-ally inexpensive distance approximations are required in order to reduce the candidate set in a filter step. For this reason, it makes sense to introduce distance approximations that lower and upper bound the uncertain distance between two uncertain objects.

Definition 11.2 (Minimum and Maximum Object Distance) LetX ={x₁, . . . , x_m} and Y ={y₁, . . . , y_m⁰} be two uncertain objects. Then, the distance

minDist(X, Y) = min

i∈{1,...,m},j∈{1,...,m⁰}(dist(x_i, y_j)) is called minimum distance between the objects X and Y, and

maxDist(X, Y) = max

i∈{1,...,m},j∈{1,...,⁰}(dist(xi, yj)) is called maximum distance between X and Y.

11.2.2 Probabilistic Ranking on Uncertain Objects

A probabilistic ranking query assigns a set of probability values to each result object, one value for each ranking position. This Rank Probability Distribution (RPD) is defined as follows:

Definition 11.3 (Rank Probability Distribution (RPD)) Let Q be an uncertain query object and let D be a database containing N uncertain objects. A Rank Proba-bility Distribution (RPD) is a function P_Q : D × {1, . . . , k} → [0,1] that reports, for a database object X ∈ D and a ranking position i∈ {1, . . . , k}, the probability which reflects the likelihood thatX is on the ith ranking position w.r.t. the uncertain distance dist_u(X, Q) between X and the query object Q in ascending order.

Table 11.1 summarizes the most frequently used notations of this chapter on the following page. Chapter 12 will also fall back on these notations.

Assuming k = N, the RPD represents a complete probabilistic assignment of each database object to its possible ranking positions, which can be visualized by a bipartite graph, where the zero probabilitiesPQ(X, i) = 0 (1≤i≤N) can be omitted (cf. Figure 9.2 in Chapter 9). For this reason, diverse variants of query definitions will be proposed that can be easily motivated by the fact that the user could be overstrained with ambiguous ranking results. They specify how the results of an RPD can be aggregated and reported in a more comfortable form which is more easy to read. In particular, for each ranking position, only one object is reported, i.e., for each ranking position i, the object which is most likely to appear on the given position i is reported. The final unambiguous rankings can be built in a postprocessing step.

Notation Description

D an uncertain database N the cardinality of D

k the ranking depth that determines the number of ranking positions of the ranking query result

m the number of observations belonging to an object

Q an uncertain query object in respect to which a rank probability distribution (RPD) is computed

q a query observation belonging toQ in respect to which an RPD is computed

B a distance browsing ofD w.r.tq

X,Y,Z uncertain objects, each corresponding to a finite set of alternative observations

x, y, z observations belonging to the objectsX, Y, Z respectively P(X =x) the probability that object X is represented by observationx

S a set of objects that have already been retrieved, i.e., the set that contains an object X iff at least one observation of X has already been returned by the distance browsingB

P_q(X, i) the probability that objectX is assigned to theith ranking position i, i.e., the probability that exactly i−1 objects in (D \ {X}) are closer to q than X

P_q(x, i) the probability that an observationxof objectX is assigned to the ith ranking positioni, i.e., the probability that exactlyi−1 objects inD \ {X} are closer to q than x

P_i,S,x the probability that exactly i objects Z ∈ S are closer to q than observation x

P_x(Z) the probability that object Z is closer to the query observation q than the observationx; computable using Lemma 11.3

Table 11.1: Table of notations used in this chapter and in Chapter 12.

U-kRanks Query

A U-kRanks query is defined according to [192, 214] as follows:

Definition 11.4 (U-kRanks) A U-kRanks query incrementally retrieves for a ranking position i a result tuple of the form (X, P_Q(X, i)), where X ∈ D has a higher probability then all other objects ∀Z ∈ D \ {X} to appear on the ranking position i, formally

P_Q(X, i)≥P_Q(Z, i).

In a U-kRanks query, an object can be assigned to multiple ranking positions, or it can not occur at all.

11.2 Problem Definition 109 Probabilistic Ranking Query Based on Maximum Confidence

A similar query definition reports the objects in such a way that the ith reported object has the highest confidence to be at the given ranking position i, but without multiple or empty assignments of objects.

Definition 11.5 (PRQ MC) A probabilistic ranking query based on maximum confi-dence (PRQ MC)incrementally retrieves for a ranking position ia result tuple of the form (X, P_Q(X, i)), where X ∈ D has not been reported at previous ranking iterations (i.e., at ranking positions j < i) and Z ∈ D \ {X} which have not been reported at previous ranking iterations, formally

P_Q(X, i)≥P_Q(Z, i), X has not been reported at previous ranking iterations.

These two types of queries only consider the probability of an object to be ranked on a particular ranking position. The confidences of prior ranking positions of an object are ignored in the case they are exceeded by another object. However, the confidences of prior ranking positions might also be relevant for the final setting of the ranking position of an object. This assumption will be taken into account with the next query definition.

Probabilistic Ranking Query Based on Maximum Aggregated Confidence The next query definition PRQ MAC takes aggregated confidence values of ranking posi-tions into account. Contrary to the previous definition, this query assigns to each object X a unique ranking position i by aggregating over the confidences of all prior ranking positions j < i according to X. Thus, this definition extends the semantics of PRQ MC by aggregation.

Definition 11.6 (PRQ MAC) A probabilistic ranking query based on maximum ag-gregated confidence (PRQ MAC) incrementally retrieves for a ranking position i a result tuple of the form (X,P

j∈{1,...,i}P_Q(X, j)), where X ∈ D has not been reported at previous ranking iterations (i.e., at ranking positions j < i) and Z ∈ D \ {X} which have not been reported at previous ranking iterations, formally

j=1

P_Q(X, j)≥

j=1

P_Q(Z, j).

The query types defined above specify the ranking position of each objectX by comparing the ranking position confidence of X with the confidences of the other objects.

Probabilistic Ranking Query Based on Expected Matching

This query assigns to each object its expected ranking position without taking the confi-dences of the other objects into account.

Rank A B C U-kRanks PRQ MC PRQ MAC EM 1 0.8 0.0 0.2 A (0.8) A (0.8) A (0.8) A (1.3) 2 0.1 0.5 0.4 B (0.5) B (0.5) C (0.6) C (2.2) 3 0.1 0.5 0.4 B (0.5) C (0.4) B (1.0) B (2.5)

Table 11.2: Object-rank probabilities from Example 11.1.

Definition 11.7 (PRQ EM) Aprobabilistic ranking query based on expected matching (PRQ EM)globally retrieves a result tuple of the form(X, µ(X))and assigns to the ranking position i the object X ∈ D which has theith highest expected rank; formally

µ(X) =

i=1

i·PQ(X, i).

In other words, the objects are reported in ascending order of their expected ranking position. This corresponds to the expected rank semantics [83].

Discussion

As already stated, the suggested unambiguous probabilistic ranking output types contain different semantics. The following example provides an overview of the advantages and drawbacks.

Example 11.1 Consider three uncertain objects A, B and C, for which an RPD accord-ing to Definition 11.3 has been computed. These rankaccord-ing probabilities are illustrated in Table 11.2. Object A has a probability of 80% to appear on rank 1 and of 10% to appear at ranks 2 and 3, respectively. Object B will never appear on rank 1, but with 50% on ranks 2 and 3, respectively. The probabilities of object C are 20% for rank 1, 40% for rank 2 and 40% for rank 3. According to the definition of U-kRanks [192, 214], the object with the highest probability will appear on the corresponding ranking position, even if it has already been returned for a previous ranking position. Thus, this output assigns A to rank 1, B to rank 2 and again B to rank 3. The drawback here is that B appears twice, whereas C does not appear in the result at all. The PRQ MC semantics tackles this lack, as no object can be reported for a ranking position if it has already been reported before. Thus, for rank 3, the only object left is C. This approach avoids multiple assignments of objects.

However, it does not consider the probability distribution of the other objects and the prior ranks. The PRQ MAC semantics provides a solution using the aggregated probabilities, also considering the values for the previous ranks. Then, A remains on rank 1 due to the highest probability. For rank 2, the probability of C is0.2 + 0.4 = 0.6 and therefore higher than the probability of B, which is 0 + 0.5 = 0.5; hence, C is ranked second. A has already been assigned before and is not considered here. Finally, B is ranked third, as it is the only remaining object. The expected rank approach EM assigns the objects to the ranking positions w.r.t. an ascending order of their expected ranks; thus, A is ranked first with an

Im Dokument Similarity processing in multi-observation data (Seite 122-127)