Contributions - Querying a Web of Linked Data

defines queries as functions over an“RDF dataset” [63], that is, a fixed, a-priory defined collection of sets of RDF triples; therefore, given that the Web of Data is not such an RDF dataset, the expected result for evaluating a SPARQL expression over the Web of Data remains undefined in most of the literature on Linked Data query processing.

We note that some publications exist that propose such a missing query semantics to use SPARQL for the Web of Data [26, 66, 160] (for a detailed discussion of these proposals refer to Section 3.1.3, page 35ff). However, these proposals are limited to conjunctive queries and they lack an analysis of the computational feasibility (or other properties) of queries under the proposed semantics (let alone a proof showing that a Linked Data query processing approach correctly supports such a query semantics).

1.4. Contributions

The primary goal of this dissertation is to formally establish fundamental properties and limitations of Linked Data query processing. To this end, we address the aforementioned problems by making the following main contributions:

1. Weestablish well-defined foundations for Linked Data query processing (as-suming SPARQL as our query language). In particular, these foundations include:

a) a data modelthat formally represents the Web of Data,

b) acomputation modelthat captures the capabilities of systems whose access to data sources relies only the Linked Data principles, and

c) multiple, well-definedquery semanticsfor using the complete core fragment of SPARQL as a query language for Linked Data on the WWW.

2. Westudy computational feasibilityand related properties of queries under the proposed query semantics.

3. We provide a comprehensive overview and classification of query execution techniquesthat are used in existing query processing approaches for Linked Data.

4. We show soundness and completenessof a “traversal-based query execution”

approach [75], which is among the most prevalent query processing approaches for Linked Data.

5. We investigate the suitability of iterators for implementing traversal-based query execution.

In the following we provide a more detailed description of our contributions.

Formal Framework

The basis of this dissertation is a formal framework that enables us to define query semantics of queries over Linked Data and to study the computational feasibility of queries under such query semantics. Consequently, our formal framework consists of a data model and a computation model.

The data model formalizes the idea of Linked Data by introducing an abstract struc-ture called the Web of Linked Data. We emphasize that such a Web may be infinitely large in our data model; allowing for an infinite Web of Linked Data enables us to cap-ture the existence of Web servers that are able to respond to an infinite number of URIs by generating Linked Data on the fly (as illustrated in Example 1.2, page3). Our data model also introduces the concept of a Linked Data query. This concept formalizes the notion of queries over Linked Data.

In addition to the data model, our formal framework comprises a computation model.

This model allows us to formally identify those Linked Data queries for which a complete computation is feasible. To this end, the model captures the capabilities of systems that aim to make use of data available as per the Linked Data principles (e.g., for query computation or for answering decision problems about Linked Data on the WWW).

We emphasize that our data model and our computation model are independent of any particular query language (such as SPARQL) or query processing approach (such as traversal-based query execution). Hence, these models present a basis not only for the work in this dissertation but also for future work related to the foundations of query processing for Linked Data.

Query Semantics

This dissertation focuses on Linked Data queries that are represented using the com-plete core fragment of the RDF query language SPARQL (which includes conjunctions, disjunctions, optional parts, and filter constraints over values bound to query variables).

To use SPARQL in our context, we have to adjust the semantics of SPARQL expres-sions. More precisely, we redefine the scope for evaluating SPARQL expresexpres-sions. In this dissertation we propose (and study) two approaches for such an adjustment. The first approach uses a query semantics where the scope of a query is the complete set of Linked Data on the Web. We call this semantics full-Web semantics. The second approach introduces a family ofreachability-based semanticswhich restrict the scope to data that is reachable by traversing a well-defined set of data links.

Formal Analysis of SPARQL-Based Linked Data Queries

By introducing the aforementioned query semantics, we contribute a well-defined foun-dation for Linked Data query processing. However, instead of merely defining query semantics, we aim to understand the consequences of using these semantics. Therefore, we formally analyze several theoretical properties of queries under these semantics: Most importantly, we study the computational feasibility of such queries and the feasibility of

1.4. Contributions deciding whether a query execution terminates. For this analysis we use our computa-tion model. The perhaps most important result of this analysis is aformal verification of the common, as yet unverified assumption that“processing queries against Linked Data where sources have to be discovered online might not yield all results” [99]. However, we also identify the cases in which—at least in theory—an expected query result may be computed completely by an execution that is guaranteed to terminate.

Furthermore, we study basic properties such as satisfiability and monotonicity, and we discuss the implications of querying an infinite Web of Linked Data. For the reachability-based query semantics, we also discuss the relationships between different notions of reachability and the impact of these notions on queries and their properties. Finally, we identify commonalities and differences between the different query semantics.

Classification of Possible Query Execution Techniques

Multiple approaches to process Linked Data queries have been presented in the literature.

Each of these approaches introduces a number of (complementary) query execution tech-niques. Some of the techniques proposed for different approaches implement the same abstract idea and, thus, are conceptually similar; other techniques are very different from each other (or serve different purposes).

There does not exist a systematic overview on the state of the art in executing Linked Data queries that reviews all of these techniques separate from discussing the particu-lar approaches in whose context they are introduced. To fill this gap, we introduce a classification that categorizes possible query execution techniques along three orthogo-nal dimensions: (i) data source selection, (ii) data source ranking, and (iii) integration of data retrieval and result construction. For each of the dimensions, we provide a comprehensive conceptual comparison of the techniques in that dimension.

Formal Analysis of Traversal-Based Execution

One of the most prevalent approaches to execute Linked Data queries is “traversal-based query execution” [75], which takes advantage of the characteristics of the Web of Data. The fundamental approach is to intertwine the traversal of data links with the construction of the query result thus integrating the discovery of data into the query execution process. Hence, in contrast to more traditional query execution approaches, this approachdoes not assume a fixed set of relevant data sources beforehand; instead, it uses data from initially unknown data sources for answering queries and, therefore, enables applications to tap the full potential of the WWW.

While different implementations of the idea of traversal-based query execution have been published [79,99,115], we are interested in whether the general approach is sound and complete w.r.t. the query semantics that we introduce. Therefore, we define an ab-stract query execution model that formalizes the idea of traversal-based query execution;

i.e., this model captures the approach of intertwining link traversal and result construc-tion independent from particular implementaconstruc-tion techniques. Based on this model we prove the soundness and completeness of the new query execution paradigm.

Analysis of Link Traversing Iterators

As mentioned before, the idea of traversal-based query execution may be implemented using different techniques. In this dissertation we aim to achieve an understanding of the suitability of classical database techniques for such a purpose. In particular, we focus on the well-known iterator model [56].

We define alink traversing iterator. The main feature of this iterator is that calling its GetNextfunction (to obtain intermediate results for computing a query) has the desired side effect of a dynamic, traversal-based retrieval of Linked Data. Hence, a pipeline of such iterators continuously augments the query-local dataset over which it operates.

We prove that such a pipeline presents a sound implementation of our abstract model of traversal-based query execution. However, this implementation approach cannot guar-antee completeness of computed query results. In a formal and experimental analysis we study this limitation, as well as other properties of the implementation approach.

Technical Contributions

In addition to the aforementioned research contributions, we developed a complete query processing system for Linked Data queries during our work on this dissertation. This system, calledSQUIN, is implemented in Java, and it consists of more than 10K lines of native source code (which is available as Free Software on the SQUIN project homepage athttp://squin.org). The query engine in SQUIN performs a traversal-based execution of Linked Data queries and has been implemented based on the iterator approach as studied in this dissertation.

By using SQUIN, the aforementioned example query (cf. Example1.3on page 4) can be executed live on the WWW. On September 16, 2013, such an execution resulted in obtaining the phone numbers of two relevant authors. During this execution, SQUIN discovered (and used) Linked Data from 14 different Web sites.

Im Dokument Querying a Web of Linked Data (Seite 19-22)