Summary - Similarity processing in multi-observation data

163

Chapter 14 Hot Item Detection in Uncertain Data

14.1 Introduction

Beyond the relevance of similarity ranking in probabilistic databases, where efficient solu-tions were given in Chapters 11 and 12, also data mining tasks are faced with the presence of uncertainty. An important task is to rate the significance of uncertain objects. Chap-ter 13 tackled the problem of probabilistic inverse ranking, where the ranking position w.r.t. a given score function indicated the significance of a particular (uncertain) object among peers. This chapter will focus on a different semantics to rate the significance of objects. According to this semantics, an object is considered to be important if it shows characteristics that are similar to these of a sufficiently high population of other objects in the database.

The detection of objects which build dense regions with other objects within a vector space is a foundation of several based data mining techniques, in particular density-based clustering [90, 184], outlier detection and other density-density-based mining applications [61, 143, 194]. A (certain) object xfor which exists a sufficiently large population of other objects in a databaseD that are similar to xis called a hot item. Intuitively, an item that shares its attributes with many other items could be potentially of interest, as its shows a typical occurrence of items in the database.

Application areas where the detection of hot items is potentially important exemplarily include scientific applications, e.g., astrophysics (cf. Figure 14.1(a)), biomedical, socio-logical and economic applications. In particular, the following applications give a good motivation for the efficient detection of hot items:

• Detection of “hot” research topics: Given a large database of actual research papers and articles, the task of this application is to identify those research articles address-ing problems that might be relevant for a research community. A paper might be relevant if there exist enough other papers which address a similar problem.

• Online shopping advertising: Online shopping advertising often profits from software

hot spots

(a) Astrological hot items in terms of interesting constellations.

hot spots in terms of hot spots interms of drug offenses

(b) Hot item detection for crime defense ap-plications.

Figure 14.1: Applications for hot item detection.

tools that extract items containing a high number of bids from online auction and shopping websites, e.g., the Hot Item Finder¹ for eBay². One can imagine that a product which is quite similar to a lot of other products that already have a high number of bids is a potential candidate for also becoming a good selling product. The detection of such products could be very valuable for online shopping advertising.

• Pre-detection of criminal activities: After a soccer game, one might be interested in the detection of larger groups of hooligans that should be accompanied by guards in order to avoid criminal excesses. If we assume that the locations of all hooligans are monitored, then it would be interesting which of these individuals have a lot of other hooligans in their immediate vicinity. Another example is the detection of outstanding crime, e.g., cases of drug abuse in areas with high population of drug offences as depicted in Figure 14.1(b)³.

The applications mentioned above require special methods supporting the efficient search in modern databases that have to cope with uncertain or imprecise data. This chapter will propose the first approach addressing the retrieval of hot items in uncertain domains.

A hot item x has the property that the number of other items (objects) which are in the proximity of x, i.e., which are similar tox, exceed a given minimum population value.

This chapter will give a general definition of hot items by relaxing the similarity predicate between the objects.

Definition 14.1 (Hot Item) Given a databaseDwith objects and a minimum population threshold minItems. Furthermore, given a score function f_score : D × D → R⁺0, which is defined on pairs of objects in D, and a similarity predicate φ_ε :R⁺0 → {true,false}, where φε ∈ {< ε,≤ε,=ε,≥ε, > ε} and ε∈R⁺0 is a given scalar. An object x∈ D is called hot item, iff there exist at least minItems objects y ∈ D \ {x} which satisfy the predicate φ_ε,

1http://www.hotitemfinder.com

2http://www.ebay.com

3Source: https://www.amethyst.gov.uk/crime_map/crimedrugs.htm

14.1 Introduction 165

hot item

not ahot item

(a) Hot items in certain data.

possible hot item

(b) Hot items in uncertain data.

Figure 14.2: Examples of hot items.

formally

|{y∈ D \ {x}:φ_ε(f_score(x, y)) = true}| ≥minItems ⇔x is a hot item.

The value of the score function f_score reflects the degree of similarity of two objects w.r.t.

the predicate φ_ε, where a small value indicates a high similarity, whereas a high value indicates high dissimilarity. In particular, two objectsxandy are considered to be equal if f_score(x, y) = 0. For spatial data, this corresponds to the semantics of the distance between x and y.

In the case of uncertain objects, an exact score cannot be determined, in particular if the score relates to the object attributes which are assumed to be uncertain (cf. Figure 14.2).

Consequently, uncertain objects lead to uncertain scores, which in turn lead to uncertain predicate results. Thus, the result of the predicate φ_ε is no longer binary and, instead, yields a probability value. This probabilistic predicate result can be estimated. Based on this estimation, it is possible to compute, for each probabilistic object X of an uncertain database, a probability value which reflects the likelihood that X is a hot item or not.

In the context of this chapter, hot items can be abstracted to objects that satisfy a given similarity predicate together with a reasonably large set of other items. If theequality predicate is assumed, i.e., φ_ε(f_score(x, y)) := “f_score(x, y) = 0”, then a hot item x satisfies the frequent item property, as x is equal to many other items and, thus, occurs frequently in the database.

The detection of hot items can be efficiently supported by a similarity join query used in a preprocessing step, in particular the distance-range self-join. Approaches for an effi-cient join on uncertain data are proposed in [138]. The main advantage of this approach is that discrete positions in space can efficiently be indexed using traditional spatial access methods, thus allowing to reduce the computational complexity of complex query types.

The approach that will be proposed in this chapter exploits the similarity join approach proposed in [138]. However, the cost of the probabilistic detection of hot items is origi-nally highly CPU-bound, which will be demonstrated in the experimental evaluation. The advantage of an I/O-cost-efficient approach for the preprocessing step only becomes

no-ticeable when applying the methods in a way that the CPU cost less outbalance the overall query cost.

The remainder of this chapter is organized as follows. Section 14.2 will formally in-troduce the problem of probabilistic identification of hot items in uncertain databases.

The solution for the efficient computation of hot item probabilities can be found in Sec-tion 14.3. A performance evaluaSec-tion of the proposed approach will be given in SecSec-tion 14.4.

Section 14.5 will conclude this chapter.

Im Dokument Similarity processing in multi-observation data (Seite 177-182)