Conclusions - Efficient similarity-based operations for data integration

relative distance. Figure 5.9 shows the distribution of the relative edit distances in the previously mentioned example relation. Using the first global minimum around 0 8 as a threshold, and analysing matches in this area shows that it provides a good ratio of very few over- and under-identified tuples.

A successive adjustment of similarity predicates using information from ana-lytical data processing is also of interest for the creation of user-defined similarity predicates. For instance, directly using the edit distance on author names and their various representations will yield poor results. Combining analytical processing and a stepwise addition of canonising techniques like transformation of lower or upper case letters, tokenising, abbreviation matching, etc., as mentioned in Sec-tion 3.3 quickly leads to more meaningful distribuSec-tions, that can be used to derive a threshold value.

often a viable approach.

However, choosing the right thresholds and combinations of predicates dur-ing the design phase of an integrated system often requires several trial-and-error cycles. This process can be supported by analytical processing steps as shown in Section 5.5 and according tools.

Chapter 6 Re-writing Similarity-based Queries for Virtual Data Integration

While in Chapter 4 the foundations of similarity-based operations where intro-duced, and in Chapter 5 the implementation of such operations for temporarily or persistently materialised result sets was covered, this chapter addresses problems of a distributed processing of similarity-based operations in heterogeneous envi-ronments. For this purpose special concepts to handle similarity predicates proba-bly not supported by integrated systems have to be applied. Again, the description is focused on string similarity measures and on re-writing queries containing ac-cording predicates in a way, that source systems can answer such queries.

Provided implementations include similarity-based selections and joins, but not the previously described similarity-based grouping. This is because the oper-ation is hardly applicable across various sources when there are no further con-straints on the input set. If the latter is the case, the source selections representing the constraints are processed first as introduced in this section, and then grouping and aggregation can take place as described in the previous chapter.

6.1 Introduction

To address the problem of data level conflicts in weakly related or overlapping data sets from different sources, similarity-based operations were integrated in data integration research. Unfortunately, the support for such operations in current data management solutions is rather limited. And worse, interfaces provided over the Web are even more limited and almost always do not allow any similarity-based lookup of information. The specification of query capabilities is addressed for instance by Vassalos et al. in [VP97] and by the author of this thesis and Endig in [SE00]. The majority of attributes used for querying are string attributes, but while

string similarity can be expressed using for instance the Levenshtein distance, common interfaces only include the lookup based on equality or substring and keyword containment. While such predicates do not allow to perform similarity selections or joins directly, they can be used for efficiently finding candidate sets as described in this Chapter.

The principal idea of the presented approach is to provide a pre-selection for string similarity operations by using string containment operations as provided by all databases and most information systems. Regarding the pre-selection this approach is similar to those by Gravano et al. introduced in [GIJ 01] and extended in [GIKS03]. Contrary to their pre-selection strategy, the one presented here is not only applicable in a scenario were integrated data sets or data sets in general are materialised in one database, but also allows re-writing string similarity queries for the virtual integration of autonomous sources. This way, it is applicable in Web integration scenarios.

Another related approach only applicable in materialised scenarios is de-scribed by Jin et al. in [JLM03], which is based on FastMap introduced by Faloutsos and Lin in [FL95] and shortly described in Section 3.2 of this thesis.

Nevertheless, this approach requires the full domain of string values to define a mapping to an n-dimensional space, and according interfaces for efficient lookup.

The pre-selection proposed here is based on the edit or Levenshtein distance as introduced in Sections 3.2.2 and 3.3 of this thesis, which expresses the dissimilar-ity of two strings by the minimal number k of operations necessary to transform a string to a comparison string. A basic observation described for instance by Navarro and Baeza-Yates in [NBY98] is, that if we pick any k 1 non-overlapping substrings of one string, at least one of them must be fully contained in the com-parison string. This corresponds to Count Filtering as introduced by Gravano, where the number of common q-grams (substrings of fixed length q) in two strings is used as a criterion. So, searching a large pool of string data we may find a can-didate set by selecting all strings containing at least one of these k 1 chosen substrings. Based on this observation, Navarro and Baeza-Yates in their approach use q-gram indexes for approximate searches within texts in an Information re-trieval context.

The problem with selecting according substrings for pre-selection is, we can-not use Length Filtering and Position Filtering like described in [GIJ 01] to fur-ther refine the pre-selection, because we cannot access the necessary informa-tion in a non-materialised scenario. And, if we choose inappropriate substrings, the candidate sets can be huge. In this case, the question is: which substrings are appropriate? Obviously, we can minimise the size of the intermediate result by finding the k 1 non-overlapping substrings having the best selectivity when combined in one disjunctive query. Then, processing a string similarity predicate requires the following steps:

1. Transform the similarity predicate to an optimal disjunctive substring pre-selection query considering selectivity information

2. Process the pre-selection using standard functionality of the information system yielding a candidate set

3. Process the actual similarity predicate within a mediator or implemented as a user defined function in standard DBMS

While this sketches only a simple selection, we will describe later on, how for instance similarity joins over diverse sources can be executed based on bind joins as described by Roth and Schwarz in [RS97]. Furthermore, we will discuss the advantages and disadvantages of the kind of substring used, whether it is arbitrary substrings, q-samples as fixed length substrings, or tokens.

We have to point out that though substring queries can easily be optimised, many systems including well-known relational DBMS fail to do this. Hence, step 2 in the above mentioned processing may or may not be efficiently executed by integrated source systems. Nevertheless, in virtual integration the key aspect very often is to minimise the size of intermediate results that have to be transferred from a source to the mediator. But most of all, in such scenarios we cannot expect the source systems to provide any interface for similarity searches.

Im Dokument Efficient similarity-based operations for data integration (Seite 96-101)