Experimental Evaluation - Similarity processing in multi-observation data

12.4.3 Global Top-k

Global top-k [219] is very similar to PT-k. It ranks the observations by their top-k proba-bility, and then takes the top-k of these. This approach has a runtime of O(k· N²). The advantage here is that, unlike in PT-k, the number of results is fixed, and there is no user-specified threshold parameter. Here, the ranking order information that has been acquired in the PT-k using the proposed framework to solve Global top-k inO(N· log(N) +k· N) time, can be exploited.

The framework is used to create the RPD in O(N· log(N) +k· N) as explained in the previous section. For each observation x, the probability that x appears at position k or higher is computed (in O(k· N)) like in PT-k. Then, thek observations with the highest probability are returned in O(k· log(k)).

12.5 Experimental Evaluation

12.5.1 Datasets and Experimental Setup

Extensive experiments were performed to evaluate the performance of the proposed proba-bilistic ranking approach proposed in this chapter. The parameters that were evaluated are the database size N (the number of uncertain objects), the ranking depthk and thedegree of uncertainty (UD) as defined below. In the following, the ranking framework proposed in this chapter is briefly denoted by PSR.

The probabilistic ranking was applied to a scientific semi-real-world dataset SCI and several artificial datasets ART X of varying size and degree of uncertainty. All datasets are based on the discrete uncertainty model according to Definition 9.2 in Chapter 9.

The SCI dataset is a set of 1,600 objects, which was synthetically created based on a data set comprising 1,600 environmental time series². In the original time series data set, each object consists of 48 ten-dimensional environmental sensor measurements taken on one single day, one per 30 minutes. The ten measured attributes were temperature, humidity, speed and direction of wind w.r.t. degree and sector, as well as concentrations of CO,SO2, N O, N O2 and O3. An uncertain object X was then created based on one single time series as follows by incorporating a real as well as a synthetic component. Addressing the real component, each sensor measurement can be considered as an observation in the feature space which is spanned by the ten dimensions enumerated above. The dimensions were normalized within the interval [0,1] to give each attribute the same weight. Thus, a time series is translated to a ten-dimensional spatial object with 48 alternative observations xi, i∈ {1, . . . ,48}. Finally, addressing the synthetic component, eachxi ∈Xhas to possess a likelihood to represent X: P(X = x_i). Here, the probability for each x_i was set to ₄₈¹ , summing up in an overall probability of 1; thus, the dataset complies with the uncertain data model of Definition 9.2 of Chapter 9. It is important to note that the method of

2The environmental time series have been provided by the Bavarian State Office for Environmental Protection, Augsburg, Germany (http://www.lfu.bayern.de/).

creating the dataset does not imply a specific degree of uncertainty due to the availability of the attributes values. The SCI dataset was used to evaluate the scalability and the ranking depth.

The ART 1 dataset was used for the scalability experiments and consists of 1,000,000 objects. Here, each uncertain object is represented by a set of 20 three-dimensional ob-servations that are uniformly distributed within a three-dimensional hyperrectangle. The degree of uncertainty of this object then corresponds to the size (i.e., the side length) of this hyperrectangle. All rectangles are uniformly distributed within a 10×10×10 feature space. For the evaluation of the performance w.r.t. the ranking depth and the degree of uncertainty, two collections of datasets, ART 2 andART 3, were applied. Each dataset of the collections is composed of 10,000 objects with 20 observations each and differs in the degree of uncertainty of the corresponding objects. InART 2, the observations of an object are also uniformly distributed within a three-dimensional hyperrectangle. In ART 3, the observations of an object follow a three-dimensional Gaussian distribution. The datasets of ART 3 vary in the degree of uncertainty as well. For this dataset, the degree of uncertainty simply denotes the standard deviation of the Gaussian distribution of the objects.

The degree of uncertainty is interesting in the performance evaluation, since it is ex-pected to have a significant influence on the runtime. The reason is that a higher degree of uncertainty obviously leads to a higher overlap between the objects which influences the size of the active object list AOL(cf. Section 12.3) during the distance browsing. The higher the object overlap, the more objects are expected to be in theAOLat a time. Since the size of the AOL influences the runtime of the rank probability computation, a higher degree of uncertainty is expected to lead to a higher runtime. This characteristic will be experimentally evaluated in Subsection 12.5.3.

12.5.2 Scalability

This section gives an overview of the experiments regarding the scalability of PSR. The obtained results are compared to the rank probability computation based on dynamic programming as proposed by Yi et al. in [214]. This method, in the following denoted by YLKS, has been the best prior approach for solving the U-kRanks (cf. Table 12.1). For a fair comparison, the PSR framework was used to compute the same (observation-based) rank probability problem as described in Section 12.2. As mentioned in Subsection 12.2.3, the cost required to solve the object-based rank probability problem is similar to that required to solve the observation-based rank probability problem. Furthermore, the cost required to build a final unambiguous ranking (e.g., the rankings proposed in Section 12.4) from the rank probabilities can be neglected, because this ranking can also be computed on-the-fly by simple aggregations of the corresponding (observation-based) rank probabilities.

For the sorting of the distances of the observations to the query point, a tuned quicksort adapted from [26] was used. This algorithm offers O(N · log(N)) performance on many datasets that cause other quicksort algorithms to degrade to quadratic runtime.

The results of the first scalability tests on the real-world dataset SCI are depicted in Figures 12.3(a) and 12.3(b). It can be observed that the required runtime for computing

12.5 Experimental Evaluation 139

0 35 0.4

k = 100 0.3

0.35 k =100

e[s] 0.25 ^{k =50}

0 15 0.2

untime _{k =10}

0.1

Ru 0.15

k =1

0.05 0

400 600 800 1000 1200 1400 1600 400 600 800 1000 1200 1400 1600

Databasesize (a) PSR.

250

200 k =100

e[s] 150

k = 50

untime 100 _{k =50}

Ru 100

k = 1 k =10

400 600 800 1000 1200 1400 1600

k =1

400 600 800 1000 1200 1400 1600

Databasesize (b)YLKS.

700

500 600

400 500

facto PSR

edͲUp LKS/ 300

Sppe YL 200 100

200 400 600 800 1000 1200 1400 1600 200 400 600 800 1000 1200 1400 1600

Figure 12.3: Scalability evaluated on SCI for different values ofk.

the probabilistic ranking using the PSR framework increases linearly in the database size, whereas YLKS has a runtime that is quadratic in the database size with the same parameter settings. It can also be observed that this effect persists for different settings of k. The effect of theO(N·log(N)) sorting of the distances of the observations is insignificant on this relatively small dataset. The direct speed-up of the rank probability computation using PSR in compared to YLKS is depicted in Figure 12.3(c). It shows, for different values ofk, the speed-up factor, which is defined as the ratio runtime(YLKS)

runtime(PSR) describing the performance gain of PSRw.r.t.YLKS. It can be observed that, for a constant number of objects in the database (N = 1,600), the ranking depth k has no impact on the speed-up factor. This can be explained by the observation that both approaches scale linearly in k.

The next experiment evaluates the scalability of the database size based on the ART 1 dataset. The results of this experiment are depicted in Figures 12.4(a) and 12.4(b). The former shows that the approach proposed in this chapter performs ranking queries in a

0 20 40 60 80 100 120

0 200,000 400,000 600,000 800,000 1,000,000

Runtime[s]

Databasesize

k =1 k =100

k =50

k =10

(a) PSR.

700

500

600 k =100

400 500

e[s]

k = 50

untime 300 _{k =50}

Ru 200

100

k = 1 k =10

0 2.000

k =1

Databasesize

2,000 4,000 6,000 8,000 10,000

(b) YLKS.

1 10 100 1000

SpeedͲUpfactor YLKS/PSR

Databasesize (c) Speed-up gain w.r.t.N (k= 100).

Figure 12.4: Scalability evaluated on ART 1 for different values of k.

reasonable time of less than 120 seconds, even for very large database containing 1,000,000 and more objects, each having 20 observations (thus having a total of 20,000,000 observa-tions). An almost perfect linear scale-up can be seen despite of theO(N· log(N)) cost for sorting the database. This is due to the very efficient quicksort implementation in [26] that the experiments have shown to require only slightly worse than linear time. Furthermore, it can be observed that, due to its quadratic scaling, the YLKS algorithm is already in-applicable for relatively small databases of size 5,000 or more. The direct speed-up of the rank probability computation using PSR in comparison to YLKS for a varying database size is depicted in Figure 12.4(c). Here, it can be observed that the speed-up of PSR in comparison to YLKS increases linearly with the size of the database, which is consistent with the runtime analysis in Subsection 12.2.3.

12.5 Experimental Evaluation 141

0 0.5 1 1.5 2 2.5

0 20 40 60 80 100

Runtime [s]

Degree of uncertainty

uniform gaussian

(a) Evaluation of PSR by an increasing degree of uncertainty.

0.001 0.01 0.1 1 10 100 1000 10000

71 184 1075 2607 5189

Runtime [s]

Degree of uncertainty (corresponding Ø|AOL|) PSR uniform PSR Gaussian YLKS uniform YLKS Gaussian

(b) YLKS vs. PSR in a logarithmic scale w.r.t.

different∅(|AOL|) values.

Figure 12.5: Runtime w.r.t. the degree of uncertainty on ART 2 (uniform) and ART 3 (Gaussian).

12.5.3 Influence of the Degree of Uncertainty

The next experiment varies the degree of uncertainty (cf. Subsection 12.5.1) on the datasets ART 2 andART 3. In the following experiments, the ranking depth is set to a fixed value of k = 100. As previously discussed, a varyingUD leads to an increase of the overlap between the observations of the objects and thus, objects will remain in the AOLfor a longer time.

The influence of the UD depends on the probabilistic ranking algorithm. This statement is underlined by the experiments shown in Figure 12.5. It can be seen in Figure 12.5(a) that PSR scales superlinear in the UD at first, until a maximum value is reached. This maximum value is reached when the UD becomes so large that the observations of an object cover the whole vector space. In this case, objects remain in the AOL until almost the whole database is processed in most cases due to the increased overlap of observations.

In this case of extremely high uncertainty, almost no objects can be pruned for a ranking position, thus slowing down the algorithm by several orders of magnitude. It is also worth noting that, in the used setting, the algorithm performs worse on Gaussian distributed data than on uniformly distributed data. This is explained by the fact that the space covered by a Gaussian distribution with standard deviationσ in each dimension is generally larger than a hyperrectangle with a side length of σ in each dimension. A runtime comparison of YLKS andPSR w.r.t. the average AOLsize is depicted in Figure 12.5(b) for both the uniformly and the Gaussian distributed datasets. TheUD has a similar influence on both YLKS and PSR.

12.5.4 Influence of the Ranking Depth

The influence of the ranking depth k on the runtime performance of the probabilistic ranking methodPSRis studied in the next experiment. As illustrated in Figure 12.6, where the experiments were performed using both the SCI and theART 2 dataset, the influence of an increasing k yields a linear effect on the runtime of PSR, but does not depend on

0 1 2 3 4 5 6 7 8 9

200 400 600 800 1000

Runtime[s]

Rankingdepthk

ART_2 (|DB|=10000)

SCI (|DB|=1600)

Figure 12.6: Runtime using PSR onSCI and ART 2.

the type of the dataset. This effect can be explained by the fact that each iteration of Case 2 or Case 3 of the incremental probability computation (cf. Subsection 12.2.2) requires a probability computation for each ranking position i ∈ {0, . . . , k}. The overall runtime requirements on ART 2 is higher that on SCI due to the different database sizes, which could already be observed in Subsection 12.5.2.

12.5.5 Conclusions

The experiments presented in this section show that the theoretical analysis of the pro-posed approach, which was given in Subsection 12.2.3, can be confirmed empirically on both artificial and real-world data. The performance studies showed that the proposed framework computing the rank probabilities indeed reduces the quadratic runtime com-plexity of state-of-the-art approaches to linear comcom-plexity. The cost required to presort the observations are neglected in the settings due to the tuned quicksort. It could be shown that the proposed approach scales very well even for large databases. The speed-up gain of the proposed approach w.r.t. the rank depthkhas shown to be constant, which proves that both approaches scale linearly in k. Furthermore, it could be observed that the proposed approach is applicable for databases with a high degree of uncertainty (i.e., the variance of the observation distribution).

Im Dokument Similarity processing in multi-observation data (Seite 153-158)