Probabilistic Query Processing - Similarity processing in multi-observation data

10.3.1 Probabilistic Similarity Ranking

In the context of probabilistic ranking, significant work has been done in the field of probabilistic top-k retrieval yielding unambiguous rankings from probabilistic data. A detailed summary of the most approaches can be found in [114].

[55] applies the Gauss-tree [54] in order to incrementally retrieve thosek objects which have a sufficiently high probability of being located inside a given query area.

Probabilistic top-k queries have been studied first by Soliman et al. [192] on the x-relation model. The authors propose two different ways of ranking tuples: the uncertain top-k query (U-Topk) and theuncertain k-ranks query (U-kRanks). At the same time, R´e et al. proposed in [177] an efficient but approximate probabilistic ranking based on the concept of Monte Carlo simulation.

The approach proposed in [214] was the first efficient exact probabilistic ranking ap-proach for the x-relation model. The results for U-Topk and U-kRanks are computed by means of a dynamic-programming technique, known as Poisson Binomial Recurrence (PBR) [147] and early stopping conditions for accessing the tuples. The work proposed in Chapters 11 and 12 uses this technique as a module of computing the object-rank proba-bilities which can, among others, be used to solve the U-kRanks problem efficiently.

In [155], theprobabilistic ranked query for the context of distributions over spatial data is based upon the same definition as U-kRanks.

The Probabilistic Threshold Top-k (PT-k) query problem [108] aggregates the proba-bilities of an observation appearing on rank k or higher. Given a user-specified probability threshold p, PT-k returns all observations that have a probability of at leastpof being on rankk or higher. In this definition, the number of results is not limited byk, but depends on the threshold parameter p. This approach also utilizes the PBR.

10.3 Probabilistic Query Processing 101 The Global top-k approach [219] is very similar to PT-k. It ranks the observations by their top-kprobability and then takes the top-k of these. The advantage here is that, unlike in PT-k, the number of results is fixed, and there is no user-specified threshold parameter that has to be set.

Cormode et al. [83] reviewed alternative top-k ranking approaches for uncertain data, including U-Topk and U-kRanks, and argued for a more robust definition of ranking, namely the expected rank for each tuple (or x-tuple). The expected rank is defined by the weighted sum of the ranks of the tuple in all possible worlds, where each world in the sum is weighed by its probability. The k tuples with the lowest expected ranks are argued to be a more appropriate definition of a top-k query than previous approaches.

Nevertheless, it could be found by experimentation that such a definition may not be appropriate for ranking objects (i.e., x-tuples), whose observations have large variance (i.e., they are scattered far from each other in space). Therefore, a follow-up work [114] computes the median rank and the quantile rank in order to obtain more robust measures against outliers and high variances. These approaches run in loglinear time and, thus, outperform exact approaches that do not use any estimation. The main drawback of the approaches is that, by using an aggregated estimator, information is lost about the distribution of the objects. This is the reason why Chapters 11 and 12 focus on the computation of the RPD, at the and presenting a solution which also requires loglinear runtime complexity.

The goal of [191] is to rank uncertain objects (i.e., x-tuples) where the scores are uncertain and can be described by a range of values. Based on these ranges, the authors define a graph that captures the partial orders among objects. This graph is then processed to compute U-kRanks and other queries. Although [191] has similar objectives to the approaches of Chapters 11 and 12, it operates on a different input, where the distribution of uncertain scores is already known, as opposed to the ranking approaches of this work, which dynamically computes this distribution by performing a linear scan over the ordered observations.

The work of [215] studies probabilistic ranking of objects according to their distance to a query point. However, the solutions are limited to existentially uncertain data with a single observation.

Related to probabilistic top-k queries, [180] introduced queries on uncertain data with aggregations, such as probabilistic count and probabilistic sum queries, where the number of tuples is determined that have a higher (uncertain) score than the current tuple. The consideration of all possible worlds is, however, again very inefficient. The authors of [96]

apply the continuous probabilistic count query on wireless sensor network environments, which, in this context, reports the number of sensor nodes whose measured values satisfy a given query predicate. An efficient result computation is achieved by applying the PBR.

The work [109] proposes thecontinuous probabilistic sum query in wireless sensor networks applying Generating Functions, which have been introduced by [154] while solving a wide class of probabilistic top-k queries in the same time complexity. This concept has similar objectives to the PBR w.r.t. dynamic-programming techniques and incremental processing and will also be used in this work, namely in the context of probabilistic frequent itemset mining in Chapter 15.

10.3.2 Probabilistic Inverse Ranking

In contrast to ranking in uncertain data, where there exists abundant work, there is limited research on the inverse variant of ranking uncertain data. The inverse ranking query on certain data was first introduced by Li [152]. Chen et al. [158] apply inverse ranking to probabilistic databases by introducing the probabilistic inverse ranking (PIR) query. Ac-cording to [158], the output of a PIR query consists of all possible ranks for a (certain) query objectq, for whichqhas a probability higher than a given threshold. Another approach for answering PIR queries has been proposed by [149], which computes the expected inverse rank of an object. The expected inverse rank can be computed very efficiently, however, it lacks from a semantic point of view. In particular, an object that has a very high chance to be on rank 1, may indeed have an expected rank far from rank 1, and may not be in the result using expected ranks. Thus, no conclusion can be made about the actual rank probabilities if the expected rank is used, since the expected rank is an aggregation that drops important information. The first exact PIR approach for continuously changing data in the context of observation streams will be presented in Chapter 13.

In order to deal with massive datasets that arrive online and have to be monitored, managed and mined in real time, the data stream model has become popular. Surveys of systems and algorithms for data stream management are given in [18, 168]. A generalized stream model, the probabilistic stream model, was introduced in [113]. In this model, each item of a stream represents a discrete probability distribution together with a probability that the element is actually present in the stream. There has been work of interest on clustering uncertain streams [7], as well as on processing more complex event queries over streams of uncertain data [178]. [82] presents algorithms that capture essential features of the stream, such as quantiles, heavy hitters, and frequency moments. In [115], the authors propose a framework for processing continuous top-k queries on uncertain streams.

10.3.3 Further Probabilistic Query Types

Beyond probabilistic ranking, there is a variety of work tackling other query types in uncertain data, including probabilistic range queries, probabilistic nearest neighbor (PNN) queries and some variants, and probabilistic reverse nearest neighbor (PRNN) queries.

Probabilistic range queries have been addressed in [76, 78, 112, 138, 195].

There exist approaches for PNN queries based on certain query objects [77] and for uncertain queries [110, 139]. The authors of [74] add threshold constraints and propose the constrained PNN query for certain query points in order to retrieve only objects whose probability of being the nearest neighbor exceeds a user-specified threshold. A combination of the concepts of PNN queries and top-k retrieval in probabilistic databases is provided by top-k PNN queries [50]. Here, the idea is to return the k most probable result objects of being the nearest neighbor to a single-observation (certain) query point.

[162] proposed a solution for probabilistic k-nearest neighbor (Pk-NN) queries based on expected distances. [75] introduced the probabilistic threshold k-NN (PTk-NN) query, which requires an uncertain object to exceed a probability threshold of being part of the

10.4 Probabilistic Data Mining 103

Im Dokument Similarity processing in multi-observation data (Seite 116-119)