Similarity Predicates - Efficient similarity-based operations for data integration

selection, join, and grouping operations. For the latter, apart from handling atransitivity etc. the implicitly specified equivalence predicate must be re-placed by an explicit similarity predicate. The impact of similarity relations on database operations is currently rarely considered in existing systems, mainly because a limited view on similarity and according operations – ba-sically only selections – are supported.

Level 3 - Query and data model: the introduction of probabilistic aspects may require changes or extensions to the underlying query and data model of the system to express the possible vagueness of facts derived by similarity-based operations. Though this is currently not addressed in existing sys-tems and, furthermore, not a focus of this thesis, the problem was addressed in research. In [DS96] Dey et al. propose an extended relational model and algebra supporting probabilistic aspects. Fuhr describes a probabilistic Datalog in [Fuh95]. Especially for data integration issues probabilistic ap-proaches were verified and yielded useful results, as described by Tseng et al. in [TCY92].

Each level builds on the respective lower levels, such as similarity based opera-tions only can be applied based on actual similarity predicates, and a probabilistic data model does only make sense, if according operations are supported. On the other hand, the support of one level does not necessarily imply any higher lev-els. Though similarity predicates to some degree may be supported by database systems, the processing of operations can be carried out in a conventional way, possibly with decreased efficiency or accuracy. And, similarity-based operations as proposed in this thesis can be used without any explicit modifications to the data model. So, the focus of this thesis will be on the levels 1 and 2 described above and introduced in more detail in the following sections.

In this section the semantics of similarity predicates are described as exten-sions to the standard relational algebra assuming the following basic notations:

let r be a relation with the schema R A₁ A_m , t^r r is a tuple from the relation r and t^rA_i denotes the value of attribute A_iof the tuple t^r.

If we furthermore distinguish between predicates defined between attributes, used for instance in join conditions, and those defined between an attribute and a constant value, as typically used in selection conditions, basic similarity predi-cates sim pred can be specified as follows.

sim pred :

simA_i const l distA_iconst k simA_i A_j l distA_iA_j k

where the predicate is specified using either a normalised similarity or a distance measure according to the description in Section 3.2. The semantics of the similar-ity and distance predicates are as follows.

Normalised similarity predicate on attribute and constant value:

sim pred t simtAi const l

Distance predicate on attribute and constant value:

sim pred t disttA_i const k Normalised similarity predicate on attributes:

sim pred t^rt^s simt^rA_i t^sA_j l Distance predicate on attributes:

sim pred t^rt^s distt^rAi t^sAj k

The last two cases explicitly include the r s and even i j. The latter is for instance useful when expressing a predicate for similarity-based grouping as introduced below, where the implicit equality predicate given in the conven-tional GROUP BY-clause must be replaced by a similarity predicate. We use simt^r A_i l and distt^rA_i k as shorthands for such predicates on one at-tribute within one relation.

Due to being based on two-valued logic, conventional predicates based on operators such as equality , inequality , or order comparison , etc.

can be considered conceptually on the same level. We use conv pred as a shorthand for such conventional predicates. Both kinds can be combined freely through logical operators , and . We refer to conditions containing at least

one similarity predicate as similarity conditions sim cond , i.e.

sim cond :

sim pred

sim cond

sim cond _θ sim pred

conv pred

where θ . A special case considered for purposes of the evaluation of predicates during query processing are conjunctive similarity conditions, where only the logical conjunction operatorθ is used to combine predicates.

This basic concept of similarity predicates can be used in most current database management systems by applying user-defined functions for implement-ing similarity or distance measures. Yet, the recognition or explicit qualification of these similarity predicates is necessary if special support for similarity-based operations through algorithms and indexes is intended.

If this is the case, further considerations regarding the properties of the applied measure are required. Similarity predicates which are not reflexive or symmetric have a severe impact on the operations they are used in. Atransitivity must gener-ally be considered. The consequences of this are discussed in relation to the oper-ations later on. But, at this point it is necessary to mention that these aspects must be known to the system performing according operations. For system-defined predicates this is straightforward. For user-defined predicates, which will often be required due to the strong context dependence of similarity, there must be ways to declare these properties to the system.

Because the previous definition of predicates is an extension of the standard relational algebra, we do not have to deal with probabilities in conditions – by using a similarity threshold we can always rely on boolean values of true or f alse for such predicates and derived complex conditions. The alternatively considered approach of using fuzzy similarity predicates returning a degree of truth between 0 and 1 would require a special handling of complex conditions, for which a prob-abilistic result must be derived. Given two fuzzy predicates p and q, two often used ways of computing this score are:

Minimum/maximum combination which is for instance applied during query optimisation when dealing with selectivities in commercial database man-agement systems:

P p q min P p Pq P p q max P p Pq

P p 1 P p

Probabilistic combination assuming independence between the predicates as for instance used in Information retrieval, probabilistic database approaches like the one by Fuhr in [Fuh95], or data integration approaches like Cohen’s WHIRL described in [Coh98]:

P p q P p P q

P p q 1 1 P p 1 Pq P p 1 P p

Throughout this thesis the latter approach is used. A discussion of the approaches regarding data integration is given by Cohen in [Coh98]. The integration of pred-icates which are not fuzzy can easily be done by assigning the values 0 and 1 if the result is f alse or true, respectively.

The score of such a complex condition including fuzzy predicates is again be-tween 0 and 1 and can for instance be used to specify a global threshold for the condition instead of the single fuzzy predicates. To gain the expressive power when using thresholds for each fuzzy predicate a weighting of predicates would have to be introduced. Alternatively, the score can be used for further process-ing, such as for ranking the results for an output, or for ordered and pipelined processing in following operations.

Im Dokument Efficient similarity-based operations for data integration (Seite 65-68)