• Keine Ergebnisse gefunden

However, we may want to maintain and improve the initial frequency infor-mation continuously. For this purpose, the result sets during query processing can be used. Updating selectivity information continuously may seem problematic for structures which are pruned based on a count or frequency threshold. Each new entry in an already established structure would fall prey to the pruning rule and does not have a chance to reach the threshold. A solution currently developed by Ingolf Geist and not described here is based on an aging algorithm for the count information in the data structures.

σ˜distsattr krR t

t rR dists t attr k

The variant of such a similarity predicate considered here is based on the edit distance of strings edist:

σ˜edistsattr kr R t

t rR edists t attr k

Without loss of generality we focus on simple predicates only. Complex predi-cates, e.g. connected by or can be handled by applying the following steps to each atomic predicate and taking into account query capabilities of the sources.

Furthermore, we assume that source systems do not support such predicates but only the primitive predicate contains(a, b) introduced above. Now, the problem is to rewrite a query containing ˜σSIM in the following form:

σ˜SIM σ˜SIMσPRESIMrR

whereσPRESIMis pushed to the source system and ˜σSIM is performed in the medi-ator.

Assuming SIM is an atomic predicate of the form edistsattr k the selec-tion condiselec-tion PRESIM can be derived using the mapping funcselec-tions map qgram, map substring, map token from Section 6.2 which we consider in the generalised form map. This mapping function returns a set q of q-samples, substrings, or keywords according to the mappings described in Section 6.2. The disjunctive query represented by this set in general contains k 1 strings, unless the length of s does not allow to retrieve this number of substrings. In this case, a the next possible smaller set is returned, representing a query returning a partial result as described before. In any case, the estimated selectivity of the represented query must be better than a given selectivity threshold.

Based on this we can derive the expression PRESIM from the similarity pred-icate as follows:

PRESIM :

q maps

containsqattr

In case of using the edit distance as similarity predicate we can further opti-mise the query expression by applying length filtering. This means, we can omit the expensive computation of the edit distance between two strings s1 and s1 if

lengths1 length s2

k for a given maximum distance values k. This holds,

because in this case the edit distance value is already k. Thus, the final query expression is

σ˜edistsattr kσlengths lengthattr kσPRESIMr R

where the placement of the length filtering selection depends on the query capa-bilities of the source.

A second optimisation rule deals with complex disjunctively connected sim-ilarity conditions of the form SIM s1attr SIMs2 attr . In this case the pre-selection condition can be simplified to

q1 maps1

containsq1 attr

q2 maps2

containsq2 attr

A general problem that can occur in this context are query strings exceeding the length limit for query strings given by the source system. This has to be han-dled by splitting the query condition into two or more parts PRESIM1 PRESIMn and building the union of the partial results afterwards:

σ˜SIMσPRESIM1r R σPRESIMnrR

Obviously, the above mentioned optimisation of applying length filtering can be used here, too.

6.4.2 Similarity Join

Based on the idea of implementing similarity operations by introducing a pre-selection we can realise similarity join operations, too. A similarity join r1R1 ˜SIMr2 R2 where the join condition is an approximate string criterion of the form SIMR1attr1R2 attr2 threshold or edist R1attr1 R2attr2 k. As in the previous sections we consider in the following only simple edit distance predicates.

A first approach for computing the join is to use a bind join implementation.

Here, we assume that one relation is either restricted by a selection criterion or can be scanned completely. Then, the bind join works as shown in Algorithm 4. For each tuple of the outer relation r1 we take the (string) value of the join attribute attr1and perform a similarity selection on the inner relation.

This is performed in the same way as described in Section 6.4.1 by 1. mapping the string to a set of q-grams,

2. sending the disjunctive selection to the source,

3. post-process the result by applying the similarity predicate, and then 4. combining each tuple of this selection result with the current tuple of the

outer relation.

Algorithm 4: Bind join foreach t1 r1 R1 do

s : t R1attr1

foreach t2 σ˜edistsattr2 σPRESIM r2R2 do output t1 t2

od od

The roles of the participating relations (inner or outer relation) are determined by taking into account relation cardinalities as well as the query capabilities. If a relation is not restricted using a selection condition and does not support a full table scan it has to be used as inner relation. Otherwise, the smaller relation is chosen as the outer relation in order to reduce the number of source queries.

A significant reduction of the number of the source queries can be achieved by using a semi-join variant. Here, the following principal approach is used.

1. One of the relations is first processed completely.

2. The string values of the join attribute are collected and the map function is applied to each of them.

3. The resulting set

S

of q-grams, tokens, or substrings is used to build a single pre-selection condition.

4. The result of the according query is joined with the tuples from the first relation using the similarity condition.

This is shown in Algorithm 5.

Algorithm 5: Semi join

S

: /0

foreach t1 r1 R1 do

S

:

S

maptR1 attr1 od

rtmp: σ s Scontainssattr2 r2 R2 foreach t1 r1 R1 do

foreach t2 rtmpdo

if edistt1R1attr1 t2R2 attr2 k output t1 t2

fi od od

If the pre-selection condition exceeds the query string limit of the source, the pre-selection has to be performed in multiple steps. In the best case, this approach requires only 2 source queries assuming that the first relation is cached in the me-diator or 3 source queries otherwise. The worst case depends on the query length limit as well as the number of derived q-grams. However, if the number of queries is greater than

r1

1 one can switch always to the bind join implementation.

A further kind of join operation can be used if none of the both input relations are restricted by a selection condition. Assuming that a full fetch / scan is not pos-sible or not allowed, one could use the index containing frequent q-grams / tokens / substrings together with the selectivity for retrieving possibly matching data from both relations. By processing the results (i.e. extracting q-grams) the index can be adjusted and extended and in this way the following retrieval operations can be focused to promising q-grams. Of course, this discovery join cannot guarantee a complete result but is helpful in identifying existing approximate matches.