Similarity-based Operations - Efficient similarity-based operations for data integration

However, we may want to maintain and improve the initial frequency infor-mation continuously. For this purpose, the result sets during query processing can be used. Updating selectivity information continuously may seem problematic for structures which are pruned based on a count or frequency threshold. Each new entry in an already established structure would fall prey to the pruning rule and does not have a chance to reach the threshold. A solution currently developed by Ingolf Geist and not described here is based on an aging algorithm for the count information in the data structures.

σ˜_distsattr krR t

t rR dists t attr k

The variant of such a similarity predicate considered here is based on the edit distance of strings edist:

σ˜_edistsattr kr R t

t rR edists t attr k

Without loss of generality we focus on simple predicates only. Complex predi-cates, e.g. connected by or can be handled by applying the following steps to each atomic predicate and taking into account query capabilities of the sources.

Furthermore, we assume that source systems do not support such predicates but only the primitive predicate contains(a, b) introduced above. Now, the problem is to rewrite a query containing ˜σSIM in the following form:

σ˜SIM σ˜SIMσPRESIMrR

whereσPRESIMis pushed to the source system and ˜σSIM is performed in the medi-ator.

Assuming SIM is an atomic predicate of the form edistsattr k the selec-tion condiselec-tion PRESIM can be derived using the mapping funcselec-tions map qgram, map substring, map token from Section 6.2 which we consider in the generalised form map. This mapping function returns a set q of q-samples, substrings, or keywords according to the mappings described in Section 6.2. The disjunctive query represented by this set in general contains k 1 strings, unless the length of s does not allow to retrieve this number of substrings. In this case, a the next possible smaller set is returned, representing a query returning a partial result as described before. In any case, the estimated selectivity of the represented query must be better than a given selectivity threshold.

Based on this we can derive the expression PRESIM from the similarity pred-icate as follows:

PRESIM :

q maps

containsqattr

In case of using the edit distance as similarity predicate we can further opti-mise the query expression by applying length filtering. This means, we can omit the expensive computation of the edit distance between two strings s₁ and s₁ if

lengths1 length s2

k for a given maximum distance values k. This holds,

because in this case the edit distance value is already k. Thus, the final query expression is

σ˜_edistsattr kσ_lengths lengthattr kσPRESIMr R

where the placement of the length filtering selection depends on the query capa-bilities of the source.

A second optimisation rule deals with complex disjunctively connected sim-ilarity conditions of the form SIM s₁attr SIMs₂ attr . In this case the pre-selection condition can be simplified to

q₁ maps₁

containsq₁ attr

q₂ maps₂

containsq₂ attr

A general problem that can occur in this context are query strings exceeding the length limit for query strings given by the source system. This has to be han-dled by splitting the query condition into two or more parts PRESIM₁ PRESIM_n and building the union of the partial results afterwards:

σ˜SIMσPRESIM₁r R σPRESIMnrR

Obviously, the above mentioned optimisation of applying length filtering can be used here, too.

6.4.2 Similarity Join

Based on the idea of implementing similarity operations by introducing a pre-selection we can realise similarity join operations, too. A similarity join r₁R₁ ˜_SIMr₂ R₂ where the join condition is an approximate string criterion of the form SIMR₁attr₁R₂ attr₂ threshold or edist R₁attr₁ R₂attr₂ k. As in the previous sections we consider in the following only simple edit distance predicates.

A first approach for computing the join is to use a bind join implementation.

Here, we assume that one relation is either restricted by a selection criterion or can be scanned completely. Then, the bind join works as shown in Algorithm 4. For each tuple of the outer relation r1 we take the (string) value of the join attribute attr1and perform a similarity selection on the inner relation.

This is performed in the same way as described in Section 6.4.1 by 1. mapping the string to a set of q-grams,

2. sending the disjunctive selection to the source,

3. post-process the result by applying the similarity predicate, and then 4. combining each tuple of this selection result with the current tuple of the

outer relation.

Algorithm 4: Bind join foreach t₁ r₁ R₁ do

s : t R₁attr₁

foreach t2 σ˜edistsattr₂ σPRESIM r2R2 do output t₁ t₂

od od

The roles of the participating relations (inner or outer relation) are determined by taking into account relation cardinalities as well as the query capabilities. If a relation is not restricted using a selection condition and does not support a full table scan it has to be used as inner relation. Otherwise, the smaller relation is chosen as the outer relation in order to reduce the number of source queries.

A significant reduction of the number of the source queries can be achieved by using a semi-join variant. Here, the following principal approach is used.

1. One of the relations is first processed completely.

2. The string values of the join attribute are collected and the map function is applied to each of them.

3. The resulting set

S

of q-grams, tokens, or substrings is used to build a single pre-selection condition.

4. The result of the according query is joined with the tuples from the first relation using the similarity condition.

This is shown in Algorithm 5.

Algorithm 5: Semi join

S

^: /0

foreach t₁ r₁ R₁ do

S

^map^t^R1 attr₁ od

r_tmp: σ _s _S_containssattr2 r₂ R₂ foreach t₁ r₁ R₁ do

foreach t2 rtmpdo

if edistt₁R₁attr₁ t₂R₂ attr₂ k output t₁ t₂

fi od od

If the pre-selection condition exceeds the query string limit of the source, the pre-selection has to be performed in multiple steps. In the best case, this approach requires only 2 source queries assuming that the first relation is cached in the me-diator or 3 source queries otherwise. The worst case depends on the query length limit as well as the number of derived q-grams. However, if the number of queries is greater than

r₁

1 one can switch always to the bind join implementation.

A further kind of join operation can be used if none of the both input relations are restricted by a selection condition. Assuming that a full fetch / scan is not pos-sible or not allowed, one could use the index containing frequent q-grams / tokens / substrings together with the selectivity for retrieving possibly matching data from both relations. By processing the results (i.e. extracting q-grams) the index can be adjusted and extended and in this way the following retrieval operations can be focused to promising q-grams. Of course, this discovery join cannot guarantee a complete result but is helpful in identifying existing approximate matches.

Im Dokument Efficient similarity-based operations for data integration (Seite 109-113)