SPARQL as a Query Language for Linked Data

I. Foundations of Queries over a Web of Linked Data 13

3. Full-Web Query Semantics 33

3.1.3. SPARQL as a Query Language for Linked Data

Instead of developing a new language for Linked Data queries, in this dissertation we focus on using the RDF query language SPARQL for such a purpose. We shall see that this approach allows users who are not interested in prescribing particular navigation paths, to express queries over Linked Data without knowing anything about the link

graph of the queried Web of Linked Data. Another motivation for studying SPARQL-based Linked Data queries is the focus on such queries in existing works on processing Linked Data queries. While we postpone a discussion of these works to Chapter 5, we emphasize that most of them lack a precise definition of the query semantics assumed for the supported queries.

The theoretical properties of SPARQL as a query language for fixed, a-priory defined collections of RDF data are well understood today [7, 127, 128, 140]. Particularly in-teresting in our context are semantical equivalences between SPARQL expressions [140];

these equivalences may also be used for optimizing SPARQL-based Linked Data queries.

Bouquet et al. were the first to provide a formalization for using (a fragment of) SPARQL as a language for Linked Data queries [26]. Other proposals have been pub-lished by Harth and Speiser [66] and by Umbrich et al. [160]; a first version of one of the reachability-based query semantics presented in this dissertation can be found in our ear-lier work on Linked Data query processing [72]. The remainder of this section describes these proposals in detail and informally compares the respectively introduced query semantics to the query semantics that we study in both this and the following chapter.

Bouquet et al. formalize three “query methods” for conjunctive queries [26]. In terms of our data model these methods can be characterized as follows:

“Bounded method”: This method assumes that queries include a specification of a set of LD documents. The evaluation of such a query is then restricted to the data in these documents. This approach corresponds to the most restrictive version of our reachability-based query semantics, namely, our notion ofcNone-semantics (cf.

Section4.1.2, page63f).

“Navigational method”: This method is based on a notion of reachability that assumes a recursive traversal ofall data links in a queried Web. The result of a query must be computed by taking into account all data that can be discovered by starting such a traversal from a designated LD document (specified as part of the query).

This navigational method prescribes a query semantics that is equivalent to, what we call,c_All-semantics (cf. Section4.1.2); it is the most general of our reachability-based semantics. Bouquet et al.’s navigational method does not support other, more restrictive notions of reachability, as is possible with our model.

“Direct access method”: For this method, Bouquet et al. assume an oracle that, for a given query, provides a set of “relevant” LD documents (from a queried Web of Linked Data). Without discussing their understanding of relevance any further, the authors define an expected query result based on such a set of relevant documents.

Due to the undefined basis of this definition, it is unclear how Bouquet et al.’s direct access method is related to the query semantics discussed in this dissertation.

After introducing these query methods, Bouquet et al. use model theory to establish a formalism for interpreting query results [26]. However, an analysis of the properties or the feasibility of the three query methods is missing from Bouquet et al.’s work.

3.1. Related Work Harth and Speiser also propose several query semantics for conjunctive Linked Data queries [66]. These semantics use authoritativeness of data sources to restrict the evalu-ation of queries to particular subsets of all data in a queried Web. In terms of our data model, the authors call an LD document d “subject-authoritative” for an RDF triple t= (s, p, o) in a Web of Linked DataW = (D, data, adoc), ifsis a URI andadoc(s) =d;

analogously, LD documents may be predicate-authoritative and object-authoritative for a given RDF triple. Based on these notions of authoritativeness, the authors introduce a formalism that allows users to express what data they consider relevant for a query.

More precisely, users may specify that only those RDF triples are relevant for evaluat-ing a query, that are available in their authoritative documents (or in particular subsets thereof); any other triple is considered irrelevant and must be ignored. These restrictions may be specified separately for each predicate of a given conjunctive query.

In addition to these data-specific “authority restrictions” [66], Harth and Speiser in-troduce three “completeness classes.” For any query, such a class designates particular documents from a queried Web such that, by definition, these documents are considered

“completely sufficient” for the query [66]. Hence, these completeness classes may be understood as document-specific restrictions on the relevance of data. The authors’ for-malism allows users to combine any of the three completeness classes with any possible set of authority restrictions. Then, an RDF triple needs to be considered for a predicate of a given (conjunctive) query if and only if (i) the triple is available in its authoritative documents (or in a specified subset thereof) and (ii) these documents qualify accord-ing to the completeness class used. Thus, dependaccord-ing on which completeness class and authority restrictions are used, a different query semantics ensues.

Unfortunately, Harth and Speiser’s work lacks a proper formal definition of one of the key concepts for specifying authority restrictions (that is, the concept of an“authoritative lookup”—represented by a function called derefa [66, Definition 10]). Therefore, it is impossible to discuss Harth and Speiser’s query semantics in detail or to provide an informed comparison with the query semantics discussed in this dissertation.

Umbrich et al. define five different query semantics for conjunctive Linked Data queries and analyze them empirically [160]. The first of these semantics is the reachability-based query semantics presented in our earlier work [72]. In this dissertation we shall refer to this semantics asc_Match-semantics (cf. Section4.1.2). In comparison to the reachability-based cAll-semantics that corresponds to the aforementioned navigational method of Bouquet et al. [26], we shall see that c_Match-semantics is more restrictive (w.r.t. what data is considered relevant for evaluating queries).

However, the main contribution of the work by Umbrich et al. are several query seman-tics that extend c_Match-semantics in order to“benefit [from] inferable knowledge” [160].

Thus, these extensions take into account additional RDF triples that can be inferred from data available on the queried Web. In particular, these query semantics integrate (i) lightweight RDFS reasoning [119] (restricted to a fixed, a-priori defined set of vocab-ularies), and (ii) inference rules for RDF triples with the predicateowl:sameAs[142]. The latter allows for making use of information about coreferenced entities becauseowl:sameAs

is commonly used to indicate coreferencing URIs in Linked Data [42].

In an empirical analysis, Umbrich et al. compare their extended, inference-based query semantics to cMatch-semantics (which does not integrate inference rules) [160]. This analysis shows that the number of solutions in a query result under any of the inference-based semantics is usually greater than the result for the corresponding query under cMatch-semantics. The price for such an increase in“recall” [160] is an increase in average query execution times because, for a complete execution of queries under the inference-based semantics, it becomes necessary to look up more URIs than underc_Match-semantics.

We consider extending Linked Data queries with features for leveraging inferable knowledge a very interesting topic for future research. However, the aim of this dis-sertation is to establish a comprehensively studied foundation for the base case (that is, Linked Data queries without inference rules).

In summary, a common limitation of the query semantics that have been proposed to use SPARQL as a language for Linked Data queries [26, 66, 72, 160], is their focus on a very basic type of SPARQL expressions, namely, “basic graph patterns” [63], which allow users to express some form of conjunctive queries (cf. Section 6.1, page111ff). By contrast, this dissertation covers the complete core fragment of SPARQL; in addition to conjunctions, this fragment includes disjunctions, constraints on variable bindings, and optional parts. Furthermore, the aforementioned proposals merely define some query semantics without properly analyzing what a sound and complete support of these semantics entails. That is, a formal analysis of queries under a given semantics—our primary contribution in both this and the following chapter—is missing for any of these proposals.

Im Dokument Querying a Web of Linked Data (Seite 47-50)