Traversal-Based Query Execution - Execution of Queries over a Web of Linked Data 97

II. Execution of Queries over a Web of Linked Data 97

5.4. Traversal-Based Query Execution

5.3.2. Integrated Execution Approaches

Integrated approaches may allow a query execution system to report first solutions for a (monotonic) query early, that is, before data retrieval has been completed. Furthermore, integrated approaches have the potential to require significantly less query-local memory than any separated execution approach; this holds in particular for integrated approaches that process retrieved data in a streaming manner and, thus, do not require to store all retrieved data until the end of a query execution process.

As for separated approaches it is possible to use any type of source selection as a basis for an integrated execution approach. A manifold of combinations are conceivable; in particular, live exploration may be combined with an integrated approach in a multi-tude of ways. We refer to query executions that present such a combination (i.e., live exploration with an integrated execution approach) as traversal-based query executions.

5.4. Traversal-Based Query Execution

For a simple example of a traversal-based query execution strategy we recall the ER machine that we use in our proof of Theorem 4.3 (cf. Definition 4.12, page 92). The query execution strategy of this machine combines the idea of an integrated execution with source selection by live exploration: The machine alternates between link traversal phases and result computation phases. Each link traversal phase consists of traversing all those relevant data links that the machine finds in the data retrieved during the previous link traversal phase. Hence, with each link traversal phase the machine expands its information about (the reachable subweb of) the queried Web of Linked Data. After each link traversal phase the machine generates a (potentially incomplete) query result; from such a result the machine reports those solutions that did not appear in the previously generated result. While this execution strategy is sufficient for proving Theorem4.3, the frequent recomputation of partial query results is not efficient and, thus, the strategy may not be suitable in practice.

Schmedding proposes a traversal-based query execution strategy that addresses this problem [139]. The idea of this strategy is to adjust an intermediate (and potentially incomplete) query result after each link traversal phase. Schmedding’s main contribution is an extension of the SPARQL algebra operators that makes the differences between query results computed on different input data explicit; based on the extended algebra the intermediate query result can be adjusted by using only the data retrieved during the directly preceding link traversal phase (instead of recomputing the intermediate query result from scratch as done by the ER machine).

An alternative approach to traversal-based query execution is the strategy that we study in the following chapters. This strategy is based on a result construction pro-cess that generates any single solution of a query result incrementally (as opposed to generating incrementally the query result as a whole). We shall see that this strategy achieves an even tighter integration of data retrieval and result construction than the two aforementioned approaches.

The idea of integrating the traversal of data links into the application logic has first been proposed by Berners-Lee et al. [16]. The authors outline an algorithm that traverses

data links in order to obtain more data about the entities presented in the Tabulator Linked Data browser. Shinavier’s functional scripting language Ripple is based on the same idea [146]: While Ripple programs operate on Linked Data, the automatic lookup of recursively discovered URIs is an integral feature of the language. Therefore, it is not necessary to add explicit URI lookup commands to such a program. Instead, during run-time the Ripple interpreter traverses data links and retrieves all Linked Data required for the execution incrementally. The earliest integration of link traversal into an execution of Linked Data queries was implemented in the Semantic Web Client Library [20].

The first research publication on Linked Data query execution describes the idea of traversal-based query execution and introduces an efficient implementation of this idea using a synchronized pipeline of iterators [79]. We follow up on this implementation approach in [72], where we propose a heuristics-based approach for query planning.

These two publications provide the basis for Chapter 7 in this dissertation. In a more recent publication we introduce a general, implementation independent formalization of a traversal-based execution strategy [75]; this formalization is the basis for the query execution model that we present in the following chapter (cf. Section6.3, page115ff).

While our work on implementing traversal-based query execution focuses on iterators, other authors introduce alternative implementation approaches:

• Ladwig and Tran propose an implementation approach that uses symmetric hash join operators which are connected via an asynchronous, push-based pipeline [99].

In later work, the authors extend this approach and introduce thesymmetric index hash join operator. This operator allows a query execution system to incorporate a query-local RDF data set into the query execution [100].

• Miranker et al. introduce another push-based implementation [115]. The authors implement traversal-based query execution using Forgy’s Rete match algorithm (originally introduced in [51]).

Since traversal-based query execution approaches combine an integrated execution and live exploration, they inherit the advantages and limitations of these two strategies (as discussed in Sections5.1.1and5.3.2). That is, like all live exploration systems, traversal-based query execution systems are able to make use of data from initially unknown data sources and can readily be used without first populating and maintaining supporting data structures. Furthermore, a traversal-based query execution system can be built to report first solutions early. On the downside, data retrieval may not be parallelized as effectively as is possible with index-based source selection; moreover, a sparsity of data links reduces the chances for discovering potentially relevant data and may thus result in missing a larger number of solutions for queries under full-Web semantics.

5.5. Summary

We conclude our discussion of query execution techniques for Linked Data queries by classifying existing approaches in Table 5.1. For the classification we use the three dimensions as introduced in this chapter.

5.5. Summary

Publication Source

Selection

Source Ranking

Integr.

Exec.

Harth et al. [67,159] index-based yes no

Ladwig and Tran [99] (“bottom up”)^∗ live exploration yes yes Ladwig and Tran [99] (“top down”) index-based yes yes Ladwig and Tran [99] (mixed strategy)^∗ hybrid yes yes

Ladwig and Tran [100]^∗ live exploration no yes

Miranker et al. [115]^∗ live exploration no yes

Paret et al. [126] index-based no no

Schmedding [139]^∗ live exploration no yes

Tian et al. [156] index-based no n/a

Umbrich et al. [159] (multidim. histograms) index-based yes no Umbrich et al. [159] (schema-level index) index-based no no Umbrich et al. [159] (inverted URI index) index-based no no

Wagner et al. [162] index-based yes yes

our work^∗ live exploration no yes

Table 5.1.: Classification of existing work on Linked Data query execution along the dimensions of (i) data source selection, (ii) data source ranking, and (iii) in-tegration of data retrieval and result construction. Approaches marked with an asterisk (*) are traversal-based query execution approaches.

Im Dokument Querying a Web of Linked Data (Seite 119-123)