Experimental Evaluation - Similarity processing in multi-observation data

Dataset PRQ MC PRQ MAC PRQ EM MP

O₃ 0.51 0.65 0.53 0.63

N SP_h 0.36 0.43 0.29 0.35

N SP_{f rq} 0.62 0.70 0.41 0.60

Table 11.3: Avg. precision for probabilistic ranking queries on different real-world datasets.

uncertain objects created from time series, each composing a set of measurements of the ozone concentration in the air measured within one month¹. Thereby, each observation features a daily ozone concentration curve. The dataset covers observations from the years 2000 to 2004 and is labeled according to the months in a year. NSP is a chronobiologic dataset describing the cell activity of Neurospora² within sequences of day cycles. This dataset is used to investigate endogenous rhythms. It can be classified w.r.t. two parameters among others: day cycle and fungal type. For the experiments, two subsets of the NSP datasets were used: N SPhandN SPf rq. N SPhis labeled according to the day cycle length.

It consists of 36 objects that created three classes of day cycle (16, 18 and 20 hours). The N SP_{f rq} dataset consists of 48 objects and is labeled w.r.t. the fungal type (f rq1,f rq7 and f rq+).

11.5.2 Effectiveness Experiments

The first experiments evaluate the quality of the different probabilistic ranking query se-mantics (PRQ MC,PRQ MAC,PRQ EM) proposed in Subsection 11.2.2. In order to make a fair evaluation, the results obtained with these approaches were compared with the results of a non-probabilistic ranking (MP) which ranks the objects based on the distance between their mean positions. For these experiments, the three real-world datasets O₃, N SP_h and N SP_{f rq} were used, each consisting of uncertain objects which are labeled as described above.

In order to evaluate the quality of the semantics, a k-nearest neighbor (k-NN) classi-fication was performed. According to the semantics of a classiclassi-fication [103], objects are divided into positive (P) and negative (N) objects, which denote the number of objects that are returned by a classifier w.r.t. a label and the number of objects that have been discarded, respectively. In the context of document retrieval, a popular measure that rates the overall significance of query results when, for example, retrieving the k most similar documents, is the precision [103], which denotes the percentage of relevant objects that have been retrieved, and, thus, serves as a measure that can also reflect the quality of the similarity ranking schemes. Formally, the precision is defined by ^{T P}_P , which yields values between 0 and 1. Hereby, T P denotes the number of retrieved relevant objects.

The average precision over all class labels (cf. dataset description in Subsection 11.5.1)

1TheO₃ dataset has been provided by the Bavarian State Office for Environmental Protection, Augs-burg, Germany (http://www.lfu.bayern.de/).

2Neurospora is the name of a fungal genus containing several distinct species. For further information seeThe Neurospora Home Page: http://www.fgsc.net/Neurospora/neurospora.html

11.5 Experimental Evaluation 121

1 10 100 1000 10000

4 5 6 7 8 9

Query time [ms]

Variance

IT TP

BS TP+BS DP

(a) Number of observationsm= 10.

1 10 100 1000 10000 100000

1 2 3 4 5 6 7

Query time [ms]

Variance

IT TP

BS TP+BS

(b) Number of observationsm= 30.

Figure 11.5: Query processing cost w.r.t. UD.

can be observed from Table 11.3. Concluding, PRQ MAC provides a superior result quality to the other approaches including the non-probabilistic ranking approachMP. In-terestingly, the approachPRQ MC, which has a quite similar definition as theU-kRanks query proposed in [192, 214], does not work very well and shows similar quality to MP.

The approachPRQ EM loses clearly and is even significantly below the non-probabilistic ranking approach MP. This observation points out that the postprocessing step, i.e., the way in which the results of the RPD are combined to a definite result, indeed affects the quality of the result.

11.5.3 Efficiency Experiments

The next experiment evaluates the performance of the proposed probabilistic ranking accel-eration strategies proposed in Section 11.4 w.r.t. the query processing time. The different proposed strategies were compared with the straightforward solution without any addi-tional strategy. The competing methods are the following:

• IT: Iterative fetching of the observations from the distance browsing B and compu-tation of the probability tablePT entries without any acceleration strategy.

• TP: Table pruning strategy where the reduced table space was used.

• BS: Bisection-based computation of the probability permutations.

• TP+BS: Combination of TPand BS.

• DP: Dynamic-programming-based computation of the probability permutations.

Influence of the Degree of Uncertainty

The first experiment compares all strategies (including the straightforward solution) for the computation of the RPD on the artificial datasets with different values of UD. The evalu-ation of the query processing time of the proposed approaches is illustrated in Figure 11.5.

In particular, the differences between the used computation strategies are depicted for two different numbers of observations per object (m = 10 andm = 30). Here, a database size of 20 uncertain objects in a ten-dimensional vector space was utilized.

The plain iterative fetching of observations (IT) is hardly affected by an increasing UD value, as it anyway has to consider all possible worlds for the computations of the probabilistic rank distribution. The table pruning strategy TPsignificantly decreases the required computation time. For a low UD, many objects cover only a small range of ranking positions and can, thus, be neglected. An increasing UD leads to a higher overlap of the objects and requires more computational effort. For the divide-and-conquer-based computation of BS, the query time increases only slightly when increasing UD. However, the required runtime is quite high even for a low UD value. The runtime of TP is much lower for low degrees of uncertainty in comparison with BS; here TPis likely to prune a high number of objects that are completely processed or not yet seen at all. A combination of the benefits of the TP and BS strategies results in a quite good performance, but it is outperformed by the DP approach. This is due to the independence of the dynamic iterations of the degree of uncertainty, because the iterations require quadratic runtime in any case.

Finally, it can be observed that the behavior with of each approach with an increasing UD remains stable for different values of m. However, a higher number of observations per object leads to significantly higher computational requirements of about an order of magnitude for each approach. Thus, these experiments support that the required runtime of computing the RPD is highly dependent onm, so that the need for efficient solutions is obvious.

Scalability

The next experiment evaluates the scalability based on theART datasets of different size.

The BSapproach will be omitted in the following, as the combinationTP+BSproved to be more effective. Here again, different combinations of strategies were considered. The results are depicted in Figure 11.6 for two different values of UD.

Figure 11.6(a) illustrates the results for a low UD value. Since, by considering all possible worlds, the simple approach ITproduces exponential cost, such that experiments for a database size above 30 objects are not applicable. The application of TP yields a significant performance gain. Assuming a low UD value, the ranges of possible ranking positions of the objects hardly overlap. Furthermore, there are objects that do not have to be considered for all ranking positions, since the minimum and maximum ranking posi-tions of all objects are known (cf. Subsection 11.4.1). It can clearly be observed that the combination TP+BSsignificantly outperforms the case where only TPis applied, as the split of ther-sets reduces the number of combinations of higher ranked objects that have to be considered when computing a rank probability for an observation. For small databases where N <100, there is a splitting and merging overhead of theBS optimization, which, however, pays off for an increasing database size. For N <700, TP+BS even beats the DP approach, which is due to the fact that TP+BS takes advantage from the

Im Dokument Similarity processing in multi-observation data (Seite 135-139)