Global-as-View Ontology-Based Data Access for Relational Data [09/2019]

(1)

People and Knowledge Networks

WeST

Fachbereich 4: Informatik Institute for Web Science and Technologies

Global-as-View Ontology-Based Data

Access for Relational Data

Masterarbeit

zur Erlangung des Grades eines Master of Science (M.Sc.) im Studiengang Informatik

vorgelegt von

Adrian Skubella

Erstgutachter: Prof. Dr. Steffen Staab

Institute for Web Science and Technologies

Zweitgutachter: M. Sc. Daniel Janke

Institute for Web Science and Technologies

(2)

(3)

Erklärung

Hiermit bestätige ich, dass die vorliegende Arbeit von mir selbstständig verfasst wurde und ich keine anderen als die angegebenen Hilfsmittel – insbesondere keine im Quellenverzeichnis nicht benannten Internet-Quellen – benutzt habe und die Arbeit von mir vorher nicht in einem anderen Prüfungsverfahren eingereicht wurde. Die eingereichte schriftliche Fassung entspricht der auf dem elektronischen

Speichermedium (CD-Rom).

Ja Nein Mit der Einstellung dieser Arbeit in die Bibliothek

bin ich einverstanden. ◻ ◻

Der Veröffentlichung dieser Arbeit im Internet

stimme ich zu. ◻ ◻

Der Text dieser Arbeit ist unter einer Creative

Commons Lizenz (CC BY-SA 4.0) verfügbar. ◻ ◻

Der Quellcode ist unter einer GNU General Public

License (GPLv3) verfügbar. ◻ ◻

Die erhobenen Daten sind unter einer Creative

Commons Lizenz (CC BY-SA 4.0) verfügbar. ◻ ◻

. . . .

(4)

(5)

Anmerkung

• If you would like us to contact you for the graduation ceremony,

please provide your personal E-mail address: . . . . • If you would like us to send you an invite to join the WeST Alumni

(6)

(7)

Zusammenfassung

Ontology Based Data Access (OBDA) ist eine Technologie, um verschiedene Daten-quellen auf ein globales Schema abzubilden. Das globale Schema kann anschließend an-gefragt werden. Diese Technologie kann zum Beispiel genutzt werden, um relationale Daten in Knowledge Graphen zu integrieren. In dieser Arbeit wurde ein formaler Rah-men für OBDA-Systeme entwickelt. Basierend auf diesem formalen RahRah-men wurde das OBDA-System UltrawrapOBDA_{formalisiert. Des Weiteren wurde Ultrawrap}OBDA reimplementiert, erweitert und der benötigte Speicherbedarf des Systems wurde op-timiert. Ergebnisse des Texas Benchmark zeigen, dass das reimplementierte System durchschnittlich 3.16 mal schneller ist als UltrawrapOBDA _{und 1.87 mal schneller als} das OBDA-Sytem Ontop. Außerdem sind die Ausführungszeiten der Reimplementie-rung und der optimisierten ReimplementieReimplementie-rung vergleichbar, während das optimierte system 55% weniger Speicherplatz benötigt als das unoptimierte System.

Abstract

Ontology Based Data Access (OBDA) is a paradigm with which different data sources can be mapped onto a global schema that can be queried. A use case of OBDA is to integrate relational data into knowledge graphs. In this thesis a formal frame-work for OBDA systems is presented. Based on this frameframe-work the OBDA system UltrawrapOBDA _{is formalized. Ultrawrap}OBDA _{has been reimplemented, extended} and the space consumption of the system has been optimized. Results of the Texas Benchmark show that the reimplemented system is averagely 3.16 times faster than UltrawrapOBDA _{and averagely 1.87 times faster than the state of the art OBDA} system Ontop. Furthermore, the execution times of the reimplemented system and the optimized reimplementation are comparable, while the space consumption of the optimized system is reduced by 55% compared to the unoptimized version.

(8)

(9)

1. Introduction

Knowledge graphs store knowledge about various domains. Examples of knowledge graphs are the open source knowledge graph Wikidata1_{, Microsoft Satori}2 _{and the} Google knowledge graph3_{. The Google knowledge graph is used for instance for} displaying information that is connected to the search term in a Google search.

One way to store graphs is the triple based Resource Description Framework (RDF), which is a representation of directed, labelled graphs. Furthermore, ontologies de-scribe a schema for RDF data. With the help of ontologies it is possible to infer new knowledge from existing RDF data. For instance, consider that an ontology defines that master student and bachelor student are subclasses of the class student. Further-more, consider that Alice is a master student and Bob is a bachelor student. With the help of the ontology it can be inferred that Alice and Bob are also instances of student.

A lot of information is only available in relational databases. The query language for relational databases is the Structured Query Language (SQL). One way to integrate information stored in relational databases into knowledge graphs is to extract the relational data, translate it to RDF triples and store it in a database for RDF called a triplestore. Such an approach is called extract, transform, load (ETL). Since the relational database will still be used after an ETL process, a drawback of this strategy is that a second database system is needed and thereby, data is stored twice, once in the relational database and once in the triple store. Furthermore, every time the data in the relational database is updated the data needs to be translated and stored in the triple store again.

An alternative way of integrating relational data into graphs is the Ontology-Based Data Access (OBDA). OBDA systems virtualize the information stored in relations in the relational database as RDF graph. This means that the relational schema of a relational database is mapped onto an ontology, which serves as global schema. Then queries written in the standard query language for RDF graphs SPARQL Protocol and Query Language (SPARQL) can be issued against the ontology. These queries are then translated to SQL queries based on the mappings. These SQL queries can be used to retrieve data from the relational database such that the query results are equivalent to the SPARQL results obtained when the SPARQL query is executed on the actual RDF graph. In figure 1 an overview of an OBDA system is given.

In this thesis a formal framework for OBDA systems that is independent from a particular implementation has been introduced. With this framework the OBDA sys-tem UltrawrapOBDA_{[1] has been formalized. Furthermore, Ultrawrap}OBDA _{has been} reimplemented and the system has been optimized. Contrary to the original system, the optimized reimplementation supports instances of superclasses, that are not in-stances of any of the subclasses of the superclass. Furthermore, the space required to

1

https://www.wikidata.org last retrieved 20.09.2019

2

https://blogs.bing.com/search/2013/03/21/understand-your-world-with-bing/ last retrieved 20.09.2019

3

(12)

Relational

Schema Relations

Relational Database

Ontology Virtualized Graph

schema Of schema Of User / Application SPARQL Queries Mappings Virtualized as

Figure 1: Overview of an OBDA system.

use the system was reduced. The implemented system has been benchmarked with the Texas Benchmark and the results have been compared to benchmark results of UltrawrapOBDA _{and the state of the art OBDA system Ontop [2].}

1.1. Research Questions

In order to reimplement and to optimize UltrawrapOBDA _{the following research} ques-tion have been answered in this thesis.

• Research question 1: How can UltrawrapOBDA _{be formally defined?}

Even though UltrawrapOBDA _{is presented in [1], it is only partly formally} de-fined. Therefore, a formal definition of the complete OBDA system is needed. • Research question 2: How can the space consumption of materialized views

be reduced?

UltrawrapOBDA _{uses views, which are virtual tables based on the result sets of} SQL queries, to virtualize an RDF graph. In order to enhance the performance of the OBDA system, views are materialized, which means that the result sets of the SQL queries are physically stored. In these materialized views data is stored redundantly and therefore, the space consumption of materialized views may be reduced.

• Research question 3: How can instances of superclasses be used indepen-dently from their subclasses.

(13)

UltrawrapOBDA _{creates a single SQL view for each class in an ontology. Views} for superclasses are defined as the union of all of their subclass views. Subse-quently, each instance of a superclass has to also be instance of at least one subclass of the superclass. However, RDF allows for instances of superclasses that are not instances of any of the subclasses of the superclass. Consider an ontology that defines that master student and bachelor student are subclasses of student. Furthermore, consider that Alice is an instance of master student and that Bob is an instance of bachelor student. Additionally to Alice and Bob, the PhD student Carol exists. Since Carol is a PhD student, she is an instance of student but she is not an instance of master student or bachelor student. The student view is defined as the union of the master student and the bachelor student view and thereby it contains Alice and Bob, but not Carol who is an exclusive superclass instance. However, Carol should be included in the student view.

• Research question 4: How well does the reimplemented and optimized sys-tem perform?

The reimplemented system and the optimized reimplementation should be eval-uated using a benchmark for OBDA systems. The benchmark results should be used to evaluate the effect of the optimizations. In order to compare the performance of the new system with the performance of existing OBDA sys-tems, the benchmark results should be compared to the benchmark results of UltrawrapOBDA _{and Ontop.}

1.2. Methodology

In the first step the subsets of RDF, Ontologies, SPARQL and relational algebra needed for this thesis have been introduced in section 2. After that a formal framework for OBDA systems independent from the actual OBDA system has been defined in section 3. With the help of this framework UltrawrapOBDA _{has been formalized in} section 4, such that research question 1 has been answered.

After having defined all necessary parts of the OBDA system, the system has been reimplemented. After that the optimizations have been defined and implemented. Section 5.1 describes how certain attributes can be omitted in views to reduce the space needed by the OBDA system. This section addresses research question 2. Furthermore, in section 5.2 it has been described how the system supports exclusive superclass instances to answer research question 3.

In order to answer research question 4, the implementation has been benchmarked with the Texas Benchmark [3] and the results have been compared to the benchmark results of the state of the art OBDA system Ontop4 _{and the benchmark results} pro-vided for UltrawrapOBDA_{. The results of the benchmark and the comparison of the} results have been described in section 6. In section 7 related work in the field of

4

(14)

OBDA has been summarized and finally in section 8 the results of the thesis have been summarized and possible future research has been presented.

(15)

2. Preliminaries

The OBDA system that has been implemented in this thesis enables querying rela-tional data with SPARQL based on mappings, which map relarela-tional data onto a given ontology. In this section the data schema as well as the query languages for the RDF and relational data are defined. Due to the considerable amount of symbols that are introduced in this section, table 42 in appendix A shows a summary of the introduced symbols.

2.1. Resource Description Framework

The Resource Description Framework (RDF) represents a directed labelled graph. In the OBDA system that has been implemented in this work, relational data is virtualized as an RDF graph to allow for querying relational data with the query language for the RDF.

An RDF graph consists of triples called RDF triples.5 Definition 1 (RDF Triple and RDF Graph)

I BLdenotes the set I ∪ B ∪ L where I, B and L are disjoints sets of IRIs, blank nodes and literals respectively. An RDF triple tr is a triple (s,p,o) ∈ (I ∪ B) × I × IBL. In an RDF triple s is called the subject, p the predicate and o the object of the triple. Furthermore, an RDF graph G is a set of RDF triples. [4]

Example 1

A single RDF triple is depicted in listing 1 in the n-triples format.6 < http :// www . u n i v e r s i t y . com / Alice >

< http :// www . w3 . org / 1 9 9 9 / 0 2 / 2 2 - rdf - syntax - ns # type > < http :// www . u n i v e r s i t y . com / M a s t e r S t u d e n t >.

Listing 1: A single RDF triple in the n-triple format. A graphical representation of the triple is depicted in figure 2.

5

https://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/ last retrieved 28.03.2019

6

(16)

http://www.university.com/Alice

http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.university.com/MasterStudent

Figure 2: Graphical representation of an RDF triple. Example 2

In listing 2 an example of an RDF graph written in the n-triples format is given. The data holds information about two students at a university, namely Alice and Bob. The data says that Alice is a master student and that Bob is a bachelor student. Furthermore, the data defines that Alice and Bob are studying computer science. A graphical representation of the RDF graph resulting from the triples is depicted in figure 3. In this RDF graph "Computer Science" is a literal, which is illustrated in figure 3 by the rectangular shape of the vertex in the graph.

< http :// www . u n i v e r s i t y . com / Alice >

< http :// www . w3 . org / 1 9 9 9 / 0 2 / 2 2 - rdf - syntax - ns # type > < http :// www . u n i v e r s i t y . com / M a s t e r S t u d e n t >. < http :// www . u n i v e r s i t y . com / Alice >

< http :// www . u n i v e r s i t y . com / field > " C o m p u t e r S c i e n c e ".

< http :// www . u n i v e r s i t y . com / Bob >

< http :// www . w3 . org / 1 9 9 9 / 0 2 / 2 2 - rdf - syntax - ns # type > < http :// www . u n i v e r s i t y . com / B a c h e l o r S t u d e n t >. < http :// www . u n i v e r s i t y . com / Bob >

< http :// www . u n i v e r s i t y . com / field > " C o m p u t e r S c i e n c e ".

(17)

http://www.university.com/Alice http://www.university.com/Bob http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.university.com/MasterStudent http://www.university.com/BachelorStudent http://www.university.com/field http://www.university.com/field “Computer Science”

Figure 3: Graphical representation of an RDF graph.

2.2. Ontologies

The term ontologies is overloaded because it has different meanings in different fields of research. [5] defines an ontology in the context of computer science as "a means to formally model the structure of a system, i.e., the relevant entities and relations that emerge from its observation, and which are useful to our purpose".

In case of RDF, ontologies describe the schema of RDF data. Ontologies define classes or concepts and the relations between those. Furthermore, ontologies often define class hierarchies.

In this work ontologies provide a global schema onto which relational data will be mapped. SPARQL queries are written against this global schema. Those SPARQL queries are translated based on the underlying relational data such that they can retrieve the desired information from the relational data.

For representing ontologies the Web Ontology Language (OWL) can be used. OWL can be serialized as RDF data. In this work the subset of OWL is considered that is defined in definition 2. To distinguish between triples that belong to an ontology and those that are not part of the ontology ontological triples and assertional triples are defined hereinafter. These definitions of triples are based on [1] and [6].

(18)

Definition 2 (Ontological Terms)

The set T_ontological= {subClassOf, subProperty, domain, range, type, equivalentClass, equivalentProperty, inverse, symmetricProperty} is the set of ontological terms.

For simplicity the full IRIs of ontological terms are omitted. Definition 3 (Ontological Triples)

An RDF triple (s, p, o) is an ontological triple if

1) s ∈ (I ∖ Tontological) and

2) either p ∈ (Tontological∖ {type}) and o ∈ (I ∖ Tontological) or p = type and o = symmetricProperty

Definition 4 (Assertional Triples)

An RDF triple is assertional if it is not ontological.

Based on the definition of ontological triples, an ontology can be defined as follows. Definition 5 (Ontology)

An ontology O is a set of ontological triples. The semantics τG

tr of a triple tr in an ontology are presented in the following defi-nition. The following definitions are based on [1].

Definition 6 (Semantics of Ontological Triples)

The semantics of an ontological triple tr is the evaluation of the function τ_trG over the RDF graph G.

a) τ_{(s,subClassOf,o)}G = ∀x ∈ IBL∣(x, type, s) ∈ G → (x, type, o) ∈ G b) τ_{(s,subP roperty,o)}G = ∀x, y ∈ IBL∣(x, s, y) ∈ G → (x, o, y) ∈ G c) τ_(s,domain,o)G = ∀x, y ∈ IBL∣(x, s, y) ∈ G → (x, type, o) ∈ G d) τ_(s,range,o)G = ∀x, y ∈ IBL∣(x, s, y) ∈ G → (y, type, o) ∈ G e) τ_{(s,equivalentClass,o)}G = ∀x ∈ IBL∣(x, type, s) ∈ G ↔ (x, type, o) ∈ G f ) τ_{(s,equivalentP roperty,o)}G = ∀x, y ∈ IBL∣(x, s, y) ∈ G ↔ (x, o, y) ∈ G g) τ_{(s,inverseP roperty,o)}G = ∀x, y ∈ IBL∣(x, s, y) ∈ G ↔ (y, o, x) ∈ G h) τ_{(s,type,symmetricP roperty)}G = ∀x, y ∈ IBL∣(x, s, y) ∈ G → (y, s, x) ∈ G

Note that besides the triples that are inferred as defined in definition 6 additional triples are inferred. In this work only the subset of inferred triples considered useful is dealt with. An inferred triple is considered useful if it is inferred by one of the rules presented in definition 6. Furthermore, an IRI i is called an instance of a class c if there exists an RDF triple (i, type, c).

(19)

Example 3

In listing 3 an example of an ontology is given. This ontology defines that each in-stance of BachelorStudent or MasterStudent is also an inin-stance of Student because the former two classes are subclasses of Student. Based on the ontology shown in listing 3 and definition 6 a) the triples shown in listing 4 can be inferred when the RDF dataset shown in listing 2 and the ontology are combined.

< http :// www . u n i v e r s i t y . com / B a c h e l o r S t u d e n t > < http :// www . w3 . org / 2 0 0 0 / 0 1 / rdf - s c h e m a # s u b C l a s s O f > < http :// www . u n i v e r s i t y . com / Student >. < http :// www . u n i v e r s i t y . com / M a s t e r S t u d e n t > < http :// www . w3 . org / 2 0 0 0 / 0 1 / rdf - s c h e m a # s u b C l a s s O f > < http :// www . u n i v e r s i t y . com / Student >.

Listing 3: An example of an OWL ontology. < http :// www . u n i v e r s i t y . com / Alice >

< http :// www . w3 . org / 1 9 9 9 / 0 2 / 2 2 - rdf - syntax - ns # type > < http :// www . u n i v e r s i t y . com / Student >.

< http :// www . u n i v e r s i t y . com / Bob >

< http :// www . w3 . org / 1 9 9 9 / 0 2 / 2 2 - rdf - syntax - ns # type > < http :// www . u n i v e r s i t y . com / Student >.

Listing 4: Newly inferred triples based on an ontology.

2.3. SPARQL Protocol and RDF Query Language

The SPARQL Protocol and RDF Query Language (SPARQL) is the standard query language for querying RDF data. In this subsection the subset of SPARQL needed for this thesis is introduced. The definitions for SPARQL are based on the W3C recommendation for SPARQL 1.1 [7] and [8].

Example 4

A simple SPARQL query retrieving in what field Alice is aiming to obtain her master’s degree in is depicted in listing 5. In this query so called prefixes are used. Prefixes define abbreviations that can be used in a query to shorten an IRI. For instance, the term PREFIX uni:<http://www.university.com/> defines that uni:Alice actually means <http://www.university.com/Alice>.

(20)

P R E F I X uni : < http :// www . u n i v e r s i t y . com / > S E L E C T ? f i e l d W H E R E {

uni : A l i c e uni : f i e l d ? f i e l d . }

Listing 5: SPARQL query retrieving the field Alice studies in.

When this query is issued against the graph that is depicted in figure 3, "Computer Science"is bound to the variable ?field. After that the variable binding that defines that "Computer Science" is bound to ?field is returned because the variable ?field is stated after the SELECT keyword in the query.

Syntax

In the following definition the syntax of SPARQL will be introduced. In SPARQL queries graph patterns are used to define, which data should be retrieved.

Definition 7 (Graph Pattern and Triple Pattern)

A tuple of the form tp = (IBL ∪ V) × (I ∪ V) × (IBL ∪ V), is a graph pattern and is called triple pattern. The set of triple patterns is denoted by T P. With graph patterns P, P1 and P2:

{P } is a graph pattern.

P1.P2 is a graph pattern and is called join.

P1 OP T ION AL {P2} is a graph pattern and is called optional. {P₁}U N ION {P₂} is a graph pattern and is called union.

Furthermore, var(tp) is the set of variables that occur in tp. Example 5

An example of a triple pattern is uni:Alice uni:field ?field in the SPARQL query depicted in listing 5. Furthermore, the SPARQL queries depicted in listings 6, 7 and 8 contain examples of join, optional and union graph patterns respectively. P R E F I X uni : < http :// www . u n i v e r s i t y . com / >

P R E F I X rdf : < http :// www . w3 . org / 1 9 9 9 / 0 2 / 2 2 - rdf - syntax - ns # > S E L E C T ? f i e l d ? type W H E R E {

uni : A l i c e uni : f i e l d ? f i e l d . uni : A l i c e rdf : type ? type }

(21)

P R E F I X uni : < http :// www . u n i v e r s i t y . com / >

P R E F I X rdf : < http :// www . w3 . org / 1 9 9 9 / 0 2 / 2 2 - rdf - syntax - ns # > S E L E C T ? f i e l d W H E R E {

uni : A l i c e uni : f i e l d ? f i e l d

O P T I O N A L { uni : A l i c e rdf : type uni : B a c h e l o r S t u d e n t } }

Listing 7: Query containing an optional graph pattern. P R E F I X uni : < http :// www . u n i v e r s i t y . com / >

P R E F I X rdf : < http :// www . w3 . org / 1 9 9 9 / 0 2 / 2 2 - rdf - syntax - ns # > S E L E C T ? type W H E R E {

{ uni : A l i c e rdf : type ? type } U N I O N

{ uni : Bob rdf : type ? type }

}

Listing 8: Query containing a union graph pattern.

Graph patterns are used within so called SELECT queries in SPARQL. The queries depicted in listings 5, 6, 7 and 8 are SELECT queries.

Definition 8 (SELECT Query)

If P is a graph pattern and V is a set of variables, then SELECT V WHERE {P} and SELECT * WHERE {P} are SELECT queries.

Semantics

In order to retrieve data from an RDF graph with a SELECT query, so called variable bindings are used.

Definition 9 (Variable bindings)

The partial function µ ∶ V → IBL is called a variable binding. For the triple pattern tp, µ(tp) denotes the triple obtained when all variables in tp are replaced according to µ. The domain dom(µ) of a variable binding µ is the set of variables on which µ is defined.

Example 6

An example of a variable binding is µ = {(?type, uni:MasterStudent)}. In this example ?type is the variable on which the mapping is defined and uni:MasterStudent is the value that is bound to ?type. Consider the triple pattern tp = (uni:Alice, rdf:type, ?type). The triple obtained by µ(tp) is (uni:Alice, rdf:type, uni:MasterStudent).

(22)

Definition 10 (Compatible Variable Bindings)

Two variable bindings µ1 and µ2 are compatible variable bindings when: ∀x ∈ dom(µ₁) ∩dom(µ₂) ∶µ₁(x) = µ₂(x)

Example 7

Consider the variable bindings µ1 ={(?type, uni:MasterStudent),(?person, uni:Alice)} and µ2 = {(?type, uni:MasterStudent),(?field, ComputerScience)}. The shared do-main of the two variable bindings dom(µ1) ∩dom(µ₂) ={?type}. Due to the fact that µ1(?type) = µ₂(?type) = uni:MasterStudent, the two variable bindings are compatible. The join, union and difference of two sets of variable bindings can be created as defined in the following definition:

Definition 11 (Join, Union and Difference of Sets of Variable Bindings) Let Ω₁ and Ω₂ be two sets of variable bindings, then the join (1), union (2), difference (3) and left outer join (4) of these sets are defined as follows:

(1) Ω₁&Ω₂ ={µ₁∪µ₂∣µ₁∈Ω₁, µ₂∈Ω₂ and µ₁ and µ₂ are compatible} (2) Ω1∪Ω₂ ={µ∣µ ∈ Ω₁ or µ ∈ Ω₂}

(3) Ω₁∖Ω₂ ={µ ∈ Ω₁∣∀µ′∈Ω₂∶µ and µ′ are not compatible} (4) Ω₁d|><|Ω2 =(Ω₁&Ω₂) ∪ (Ω₁∖Ω₂)

Based on these definition the evaluation of a graph pattern, denoted by the function J.KG, where G is the graph on which the graph pattern is evaluated, can be defined. Definition 12 (Evaluation of Graph Pattern)

Let G be an RDF graph, let tp be a triple pattern and let P₁ and P₂ be graph patterns, then the evaluation _{JP K}G is defined as:

JtpKG= {µ∣dom(µ) = var(tp) and µ(tp) ∈ G

JP1.P2KG= JP1KG&JP2KG JP1OP T ION AL {P2}KG= JP1KGd|><|JP2KG J{P1}U N ION {P2}KG= JP1KG∪JP2KG

With this evaluation of graph patterns the evaluation of a SELECT query is defined as follows.

Definition 13 (Evaluation of SELECT Query)

The evaluation_JQKG of a query Q of the form SELECT V WHERE {P } on RDF graph G is the set of all projections µ∣V of bindings µ from JP KG to V , where the projection of µ∣_V is the binding that coincides with µ on V and is undefined elsewhere.

The evaluation of SELECT * WHERE {P } is equal to the evaluation of SELECT V WHERE {P } where V = var(P ) and var(P ) denotes the set of all variables in P .

(23)

2.4. Relational Data Model

Relational database systems are the backbone of ample web sites and software sys-tems [9]. In this section basics on relational databases and the underlying relational model, needed for this thesis, will be presented. These basics are based on [10] and [1]. In relational databases data is stored as relations.

Example 8

One example of a schema of a relation is given in table 1, where the schema and the relation are depicted as table. In this relation schema various attributes are defined. Attributes are depicted as column names in table 1, namely ID, Name and Field with the domains integer, characters and characters respectively.

STUDENT

ID Name Field

1 Alice Computer Science

2 Bob Computer Science

Table 1: Table depicting relation schema and relation. Definition 14 (Domain)

A domain D is a set of atomic values.

Due to the fact that NULL values often appear in real world datasets the NULL values will be defined in the context of relational algebra hereinafter.

Definition 15 (NULL)

NULL ∉ D is the keyword that defines the absence of a value. Definition 16 (Relation Schema)

Arelation schema R(A1, A2, ..., An)is the schema of a single relation where A₁, ..., A_n are the attributes of the relation schema. The arity of the relation schema is equal to n.

Based on the relation schema relations can be defined. Informally speaking, a relation is the set of entries in a table defined by the relation schema.

Definition 17 (Relation)

Arelation r of a relation schema R(A1, A2, ..., An)is a set of tuples r = {tu₁, tu₂, ...tu_m} where each tuple tu is an ordered list of values < v₁, v2, ...vn > where v_i ∈dom(A_i) ∪ N U LL . The ith value of a tuple tu is denoted by tu[Ai]. Furthermore, att(r) denotes the set of attributes {A1, A2, ..., An} in r.

An example of a relation is the set of tuples r = {tu1, tu2} where the tuple tu₁=< 1, Alice, ComputerScience > and tu2 =< 2, Bob, ComputerScience > as depicted in

(24)

table 1. Thereby, tu1[ID]= 1 and tu₂[N ame]= Bob are examples of how values in tuples can be denoted.

So far it was talked about the schema of a single relation. A complete database also has a schema called relational schema.

Definition 18 (Relational Schema)

A relational schema S = {R₁, R2, ..., Rn} of a database is a set of relation schemes. Each attribute Ai in Rj∈S has a Domain D denoted by dom(A_i).

Furthermore, relational schemas can be instantiated. Definition 19 (Instance of Relational Schema)

An instance s = {r1, r2, ...rn} of a relational schema S is a set of relations where for each relation schema Ri∈S a corresponding relation r_i exists in s.

Based on the instance of a relational schema s an instance of a relation schema can be written as Rs_{. This expressions defines the instance of a relation schema R that} is included in s.

2.5. Relational Algebra

Query languages such as the well known Structured Query Language (SQL), which are used to query relational data are defined based on relational algebra. In relational algebra sets, their union and difference and the Cartesian product from set theory are used. The definitions in this section are the definitions introduced in [1].

Syntax

In relational algebra, relational algebra expressions are used. A relational algebra expression ϕ and its attributes att(ϕ) are defined hereinafter. In the following sections S denotes a relational schema, s denotes an instance of a relational schema S, R denotes a relation schema and r denotes an instance of a single relation schema R. Definition 20 (Relation in relational algebra)

Let ϕ = R and R ∈ S. Then ϕ is a relational algebra expression over S such that att(ϕ) = att(R).

Definition 21 (NULL)

Let A be an attribute and ϕ = N U LL_A then ϕ is a relational algebra expression over S where att(ϕ) = {A}.

Definition 22 (Condition)

(25)

condition condA is of the form:

A = a A ≠ a isN ull(A) isN otN ull(A) true

If cond1A and cond2A are conditions, then

cond1A∧cond2A and

cond1A∨cond2A are conditions.

Example 9

Consider the relation depicted in table 1. Let A = att(ST UDENT ), then a condition cond{ID,N ame,F ield}=isN otN ull(N ame).

Definition 23 (Selection)

Let ϕ₁ be a relational algebra expression over S, let A ⊆ att(ϕ₁). Then the following expressions ϕ₂ is a relational algebra expression with att(ϕ₂) =att(ϕ₁):

ϕ2=σ_cond

A(ϕ1)

Example 10

For instance, σID=1(ST U DEN T )is a selection on the relation depicted in table 1. Definition 24 (Projection)

Let ϕ1 be a relational algebra expression over S with U ⊆ att(ϕ1). Let ϕ2 =π_U(ϕ₁) then ϕ₂ is a relational algebra expression over S and att(ϕ₂) =U .

Example 11

An example of a projection on the example relation STUDENT depicted in table 1 is: πID,F ield(ST U DEN T ).

Definition 25 (Coalesce)

Let ϕ₁ be a relational algebra expression over S and let A₁, A2 ∈ att(ϕ₁) and let Anew /∈ att(ϕ₁), then ϕ₂ = κ_A₁_,A₂_,A_new(ϕ₁) is a relational algebra expression with att(ϕ2) =att(ϕ₁) ∪ {A_new}.

Example 12

Consider the relation depicted in table 2. An example of a coalesce is the following relational algebra expression: κP ostalCode,City,Location(ADDRESS).

(26)

ADDRESS

Name City PostalCode

Alice Koblenz 56073

Bob Cologne NULL

Carol NULL NULL

Table 2: Table depicting relation that stores cities and postal codes of persons. Definition 26 (Rename of Attribute)

Let ϕ₁ be a relational algebra expression over S. Furthermore let A ∈ att(ϕ₁) and let B ∉ att(ϕ1). If ϕ2=%_A→B(ϕ₁), then ϕ₂ is a relational algebra expression over S with att(ϕ2) = (att(ϕ₁) ∖ {A}) ∪ {B}.

Example 13

Considering the relation depicted in table 1 %N ame→F irstN ame(ST U DEN T )is an ex-ample of a rename.

Definition 27 (Union)

Let ϕ1, ϕ2 be relational algebra expressions over S with att(ϕ1) =att(ϕ₂). Let ϕ₃ = ϕ1∪ϕ₂ then ϕ₃ is a relational algebra expression over S and att(ϕ₃) =att(ϕ₁) Example 14

Assume the relation depicted in table 3 called PERSON. The union of STUDENT and

PERSONcan be written as ST UDENT ∪ P ERSON.

PERSON

Name City Age

Alice Koblenz 22

Bob Cologne 30

Carol Koblenz 23

Table 3: Table depicting relation schema and relation for persons. Definition 28 (Outer Union)

Let ϕ1, ϕ2 be relational algebra expressions over S. Let ϕ3 = ϕ₁⊎ϕ₂ then ϕ₃ is a relational algebra expression over S and att(ϕ₃) =att(ϕ₁) ∪att(ϕ₂).

Example 15

The outer union of the table STUDENT and PERSON is ST UDENT ⊎ P ERSON. Definition 29 (Difference)

Let ϕ1, ϕ2 be relational algebra expressions over S with att(ϕ1) =att(ϕ₂). Let ϕ₃ = ϕ1∖ϕ₂ then ϕ₃ is a relational algebra expression over S and att(ϕ₃) =att(ϕ₁) Example 16

(27)

SINGLE_STUDENT

ID Name Field

Table 4: Table depicting relation schema and relation. Example 17

The difference of the two relations STUDENT and SINGLE_STUDENT can be expressed with the clause ST UDENT ∖SINGLE_STUDENT.

Definition 30 (Cross Join)

Let ϕ₁, ϕ2 be relational algebra expression over S and let att(ϕ1) ∪att(ϕ₂) = ∅. Let ϕ3=ϕ₁×ϕ₂, then ϕ₃ is a relational algebra expression over S and att(ϕ₃) =att(ϕ₁) ∪ att(ϕ2).

Example 18

Consider the relations ST UDENT and CIT Y depicted in table 1 and table 5 respec-tively. The cross product of the two relations can be written as ST UDENT ×CIT Y .

CITY

CityName PostalCode

Koblenz 56068

Cologne 50667

Table 5: Table depicting relation storing city names and postal codes. Definition 31 (Theta Join)

Let ϕ1, ϕ2 be relational algebra expressions over S and let att(ϕ1) ∪att(ϕ₂) = ∅. Let ϕ3 = ϕ₁ &_cond

A ϕ2 and let A ⊆ att(ϕ1) ∪att(ϕ2). Then ϕ3 is a relational algebra

expression with att(ϕ₃) =att(ϕ₁) ∪att(ϕ₂).

If no condA is given in a theta join ϕ1&ϕ2, then the theta join is equivalent to ϕ1&_trueϕ₂.

Example 19

Consider the relation shown in table 2 and the relation shown in table 6, which depicts codes for missing information. A join with a NULL in the join condition is: ADDRESS &P ostalCode=N U LLCODES.

CODES

Code Description

1 Postal Code

2 Incomplete

(28)

Definition 32 (Left Outer Join)

Let ϕ1, ϕ2 be relational algebra expressions over S, let att(ϕ1) ∪att(ϕ₂) = ∅, and let A ⊆ att(ϕ₁) ∪att(ϕ₂). Let ϕ₃ = ϕ₁d|><|condA ϕ2, then ϕ3 is an relational algebra

expression over S with att(ϕ3) =att(ϕ₁) ∪att(ϕ₂) called left outer join.

If no condAis given in a left outer join ϕ1d|><|ϕ2, then the left outer join is equivalent to ϕ₁d|><|trueϕ2.

Example 20

An example of a left outer join of the relations STUDENT and GRADES depicted in table 3 and 7 respectively is P ERSONd|><|ID=StudentID GRADES.

GRADES

StudentID AverageGrade

1 1.7

Table 7: Table depicting grades for students.

It may happen that in a join of two relations the set of attributes of the relations is not disjunct. In such cases the fully qualified name of an attribute can be used. A fully qualified name takes the source relation of the attribute into consideration. Consider the relation STUDENT depicted in table 1 and the relation PERSON depicted in table 3. Both relations have an attribute called NAME. Fully qualified names for the attributes are STUDENT.NAME and PERSON.NAME for the attributes in STUDENT and PERSONrespectively.

Should a self join occur and thereby, should the source relation not be sufficient to unambiguously identify an attribute, the attribute can be renamed or the underlying relation may be named.

Definition 33 (Naming of Relational Algebra Expression)

Let ϕ be a relational algebra expression, and let E be an arbitrary but unambiguous name for the algebra expression then ρ_E(ϕ) is an relational algebra expression, such that each A ∈ att(ϕ) can be addressed with E.A.

This definition does not change any value in the actual relation. Example 21

Consider the relation depicted in table 1. Consider the relation should be joined on the attribute ID with itself. In order to ensure that the attributes are unambiguous the following relational algebra expression may be used:

ST U DEN T &ST U DEN T .ID=renamedST U DEN T .IDρrenamedST U DEN T(ST U DEN T ) Although parentheses are not explicitly mentioned in the syntax definition, they are sometimes used for clarification.

(29)

Semantics

Based on the syntax of relational algebra the semantics will be introduced in the following. Let S be a relational schema, s an instance of a relational schema S, R a relation schema and r an instance of a single relation schema R. Furthermore, let ϕbe a relational algebra expression over S. The evaluation of the relational algebra function JϕKs over the instance s of S is defined as follows.

Definition 34 (Evaluation relation name) Let ϕ = R then _JϕKs=Rs

Let student and person be the relations of the relational schemes of the relations STUDENT and PERSON respectively. Furthermore, let the relational schema instance s = {student, person}. Then_{JS T U DE N T K}s=ST U DEN Ts=student.

Definition 35 (Evaluation of NULL)

Let ϕ = N U LL_A and A is an attribute, then_JϕKs= {tu} where tu is a tuple, such that tu[A] = N U LL.

Definition 36 (Evaluation of Condition)

The evaluation of a condition over a tuple tu _JcondAKtu is defined as:

JA=aKtu ∶= ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ true, if tu[A] = a false, otherwise JA≠aKtu ∶= ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ true, if tu[A] ≠ a false, otherwise

JisN ull(A)Ktu ∶=

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ true, if tu[A] = N U LL false, otherwise

JisN otN ull(A)Ktu ∶=

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ true, if tu[A] ≠ N U LL false, otherwise JtrueKtu ∶= true

Furthermore, let cond1A, cond2A be conditions, then:

Jcond1A∧cond2AKtu∶=Jcond1AKtu∧Jcond2AKtu Jcond1A∨cond2AKtu∶=Jcond1AKtu∨Jcond2AKtu Example 22

Consider the single tuple <2, Bob, Computer Science> depicted in table 4. The evaluation of the condition ID = 2 over this tuple is:

(30)

Definition 37 (Evaluation of Selection)

The evaluation of a selection selects tuples from a relation which satisfy a condition. Let ϕ₁ be a relational algebra expression over S and let condA be a condition with A ⊆att(ϕ₁), then the evaluation of a selection σ_cond

A(ϕ1) is defined as:

JσcondA(ϕ1)Ks∶= {tu ∈Jϕ1Ks∣JcondAKtu} Example 23

The evaluation of the selection σID=1(ST U DEN T ) with regard to the relation de-picted in table 1 is {<1, Alice, Computer Science>}.

Definition 38 (Evaluation of Projection)

The evaluation of a projection chooses a subset of attributes from a relation. Let ϕ1 be a relational algebra expression over S and U ⊆ att(ϕ₁).

If ϕ2=π_U(ϕ₁), then _Jϕ₂_K_s= {tu′∣tu ∈_Jϕ₁_K_s and ∀A ∈ U ∶ tu′[A] = tu[A]} Example 24

The evaluation of the projection πID,F ield(ST U DEN T )is the following set of tuples {<1, Computer Science>,<2, Computer Science>}.

Definition 39 (Evaluation of Coalesce)

Let ϕ1 be a relational algebra expression over S. Let ϕ2 =κ_A₁_,A₂_,A_new(ϕ₁), then Jϕ2Ks∶= {tu∣tu1 ∈Jϕ1Ks∶ ∀A ∈ att(ϕ1) ∶tu[A] = tu1[A] and

tu[Anew] = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ tu1[A₂], if tu₁[A₁] =N U LL tu1[A₁], otherwise } Example 25

The evaluation of the coalesce κP ostalCode,City,Location(ADDRESS) is depicted in table 8.

Name City PostalCode Location

Alice Koblenz 56073 56073

Bob Cologne NULL Cologne

Carol NULL NULL NULL

Table 8: Table depicting result of a coalesce. Definition 40 (Evaluation of Rename of Attribute)

The evaluation of rename allows for renaming an attribute. If ϕ1 is a relational algebra expression, A ∈ att(ϕ1) and B ∉ att(ϕ₁). The evaluation of the rename operation %_A→B(ϕ₁) is the set of tuples where the attribute A is now B.

Let ϕ2=%_A→B(ϕ₁), then Jϕ2Ks= {tu ′_{∣tu ∈} Jϕ1Ks and tu ′_{[B] = tu[A]} and ∀C ∈ att(ϕ C ≠ A ⇒ tu′ C] = tu[C]}

(31)

Example 26

The evaluation of %N ame→F irstN ame(ST U DEN T )on table 1 would result in table 9. STUDENT

ID FirstName Field

Table 9: Result of rename operation. Definition 41 (Evaluation of Union)

Let ϕ₁=ϕ₂∪ϕ₃ then the evaluation of the union is defined as:

Jϕ1Ks=Jϕ2Ks∪Jϕ3Ks

Example 27

Consider the STUDENT table depicted in table 1 and the BIOLOGY_STUDENT table de-picted in table 10. The union of these two tables ST UDENT ∪BIOLOGY_STUDENT is depicted in table 11.

BIOLOGY_STUDENT

ID Name Field

3 Carol Biology

Table 10: Table depicting biology student.

ID Name Field

3 Carol Biology

Table 11: Result of union operation. Definition 42 (Evaluation of Outer Union)

Let ϕ1=ϕ₂⊎ϕ₃ then

Jϕ1Ks∶= {tu∣(∀tu2∈ϕ2∶ (∀A1∈att(ϕ2) ∶tu[A1] =tu2[A1]) and (∀A₂∈att(ϕ₃) ∖att(ϕ₂) ∶tu[A₂] =N U LL))

and (∀tu₃∈ϕ₃∶ (∀A₃∈att(ϕ₁) ∶tu[A₃] =tu₃[A₃]) and (∀A₄∈att(ϕ₂) ∖att(ϕ₃) ∶tu[A₄] =N U LL))}

(32)

Example 28

The relation of the outer union of the tables STUDENT and PERSON, P ERSON ⊎ ST U DEN T is depicted in table 12.

Name City Age ID Field

Alice Koblenz 23 NULL NULL

Bob Cologne 30 NULL NULL

Carol Koblenz 23 NULL NULL

Alice NULL NULL 1 Computer Science

Bob NULL NULL 2 Computer Science

Table 12: Result of outer union. Definition 43 (Evaluation of Difference)

If ϕ1=ϕ₂∖ϕ₃, then

Jϕ1Ks=Jϕ2Ks∖Jϕ3Ks. Example 29

The result of the difference operation STUDENT∖SINGLE_STUDENT is depicted in table 13.

STUDENT

ID Name Field

Table 13: Difference of two relations. Definition 44 (Evaluation of Cross Join)

Let ϕ1=ϕ₂×ϕ₃ then the evaluation of the cross join is defined as:

Jϕ1Ks∶= {tu∣∀tu1∈Jϕ2Ks∶ ∀tu2∈Jϕ3Ks∶ (∀A ∈ att(ϕ2) ∶tu[A] = tu1[A])and (∀A ∈ att(ϕ3) ∶tu[A] = tu₂[A])} Example 30

Consider the relations ST UDENT and CIT Y depicted in table 1 and table 5 re-spectively. The cross product of the two relations ST UDENT × CIT Y is depicted in table 14.

ID Name Field CityName PostalCode

1 Alice Computer Science Koblenz 56068

1 Alice Computer Science Cologne 50667

2 Bob Computer Science Koblenz 56068

2 Bob Computer Science Cologne 50667

(33)

Definition 45 (Evaluation of Theta Join)

Let ϕ1, ϕ2 be relational algebra expressions over s and let ϕ1=ϕ₂&_cond

Aϕ3. Then the

evaluation of the theta join is defined as:

Jϕ1Ks∶=JσcondA(ϕ1×ϕ2)Ks

Example 31

The evaluation of the theta join ADDRESS &P ostalCode=N U LLCODES is depicted in table 15.

Name City PostalCode Code Description

Bob Cologne NULL 1 Postal Code

Bob Cologne NULL 2 Incomplete

Carol NULL NULL 1 Postal Code

Carol NULL NULL 2 Incomplete

Table 15: Table depicting evaluation of join with NULL in join condition. Example 32

Executing the join ST UDENT &N ame=N ameP ERSON would result in the relation depicted in table 16.

ID Name Field Name City Age

1 Alice Computer Science Alice Koblenz 22

2 Bob Computer Science Bob Cologne 30

Table 16: Result of join operation. Definition 46 (Evaluation of Left Outer Join)

Let ϕ1, ϕ2 relational algebra expression over S and let ϕ1 =ϕ₂d|><|condA ϕ3, then the

evaluation of the left outer join is defined as:

Jϕ1Ks=J(ϕ2&condAϕ3) ⊎ (ϕ2∖πatt(ϕ2)(ϕ2&condAϕ3))Ks

Example 33

The relation, which results from the evaluation of the left outer join of the two tables ST U DEN T and GRADES, ST UDENTd|><|ID=StudentID GRADES with the join condition ID = StudentID is depicted in table 17.

ID Name Field StudentID AverageGrade

1 Alice Computer Science Koblenz 1 1.7

2 Bob Computer Science NULL NULL

(34)

2.6. Structured Query Language (SQL)

In order to query relational data the Structured Query Language (SQL) is used. In fact SQL is based on relational algebra. The translation of SQL to relational algebra is defined in [11]. In the context of this master thesis SQL is needed to query the virtualized RDF graph.

In order to query relational data so called SQL SELECT queries can be used. An example of such a query retrieving the name of the person with the id 1 from the relation depicted in table 1 is shown in listing 9.

S E L E C T NAME FROM S T U D E N T W H E R E ID = 1

Listing 9: SQL query retrieving name.

This query would return Alice as NAME. Hereby SELECT NAME corresponds to the projection of relational algebra πN AM E(ϕ₁). In this context ϕ₁ is defined by the rest of the query. The clause WHERE ID = 1 corresponds to the selection of relational algebra: σID=1(ϕ₂). ϕ₂ is defined by FROM STUDENT, which says that the selection is executed on the table STUDENT. Therefore, the complete relational algebra expression of the query depicted in listing 9 is πN AM E(σID=1(ST U DEN T )).

Besides SELECT queries the creation of views is also needed in this work. A view is in fact an SQL query result. A SELECT query is stored and the results of the query are directly shown as table. This table of results is the view. Besides normal views also materialized views exist. The difference between materialized and not materialized views is that in materialized views the results of the SELECT query are physically stored in the database, whereas in not materialized views the SELECT query is evaluated every time the view is accessed. In listing 10 an example of a query that creates a materialized view is given.

C R E A T E M A T E R I A L I Z E D VIEW

S E L E C T NAME FROM S T U D E N T W H E R E ID = 1

(35)

3. Ontology Based Data Access

In order to integrate a relational database into an RDF graph, Ontology Based Data Access (OBDA) is used. In this section a formal framework for OBDA is presented, which had been used to formalize the OBDA system that has been implemented in this thesis.

3.1. Mapping

In ontology based data access a relational schema S is mapped onto an ontology O based on a mapping M such that SPARQL queries can be issued against an instance of S. Thereby, the ontology O serves as the global schema of the data. This means that when a mapping has been defined, a user of the OBDA system does not need any knowledge of the underlying relational data. The user can simply issue SPARQL queries against the ontology and retrieve the desired information. Hereinafter, map-pings from relational data onto RDF data are defined.

Definition 47 (Mapping Templates)

Let A be an attribute of a relation. The arbitrary string θ is a mapping template. In a mapping template substrings of the form {A} can occur and denote template variables in a mapping template.7 _{Furthermore, att(θ) denotes the set of attribute} names in θ.

Example 34

The string http://www.uni.com/student/{ID} is a mapping template with one tem-plate variable in it, namely {ID}.

Definition 48 (Evaluation of Mapping Templates)

Let A be an attribute of a relation and let tu be a tuple in a relation. The evaluation of a single template variable {A} is defined as follows:

J{A}Ktu∶=str(tu[A])

Where str denotes the function that creates a string from a given input. The evalua-tion of a mapping templateJθKtufor a given tuple tu is the string obtained by replacing each template variable in θ with the evaluation of the template variable. Furthermore, let R be the relation schema of tu then att(θ) ⊆ att(R).

Example 35

Consider the tuple tu =<1, Alice, ComputerScience> from table 1. The evaluation Jhttp∶ //www.uni.com/student{ID}Ktu=http ∶ //www.uni.com/student/J{ID}Ktu =http ∶ //www.uni.com/student/1.

7

If { or } are used within θ without being used as markup for the template variable, then they have to be escaped as \{ or \}. Consequently also \ has to be escaped as \\ if it is not used as escape character.

(36)

Definition 49 (Mapping Rule)

Let ϕ be a relational algebra expression, θ1 and θ2 mapping templates and iri ∈ I, then a mapping rule is:

ϕ ↝ (θ1, iri, θ2) Example 36

An example of a mapping rule that defines that for each student in table 1 a triple should be created where the subject contains the ID, the predicate is always the IRI http://www.uni.com/nameand the object is an IRI including the value stored in the NAMEcolumn of the relation STUDENT is:

ST U DEN T ↝ (http ∶ //www.uni.com/student/{ID}, http ∶ //www.uni.com/name,

http ∶ //www.uni.com/student/{N AM E}) Definition 50 (Evaluation of Mapping Rule)

The evaluation of a mapping rule over an instance of a relational schema s is a set of triples:

Jϕ↝ (θ1, iri, θ2)Ks= {(Jθ1Ktu, iri, Jθ2Ktu) ∣tu ∈JϕKs} Example 37

The evaluation of the mapping rule shown in example 36 results in the two triples depicted in listing 11.

< http :// www . uni . com / s t u d e n t /1 > < http :// www . uni . com / name >

< http :// www . uni . com / s t u d e n t / Alice >. < http :// www . uni . com / s t u d e n t /2 >

< http :// www . uni . com / name >

< http :// www . uni . com / s t u d e n t / Bob >.

Listing 11: RDF triples resulting from evaluation of mapping rules. Definition 51 (Mapping)

Amapping M is a set of mapping rules.

3.2. Formal Framework for Ontology Based Data Access

After having defined all inputs that are given to an OBDA system a formal framework for OBDA will be formalized now. The definitions for the formal framework of OBDA are based on [12] and on [1].

(37)

Definition 52 (OBDA Specification)

An OBDA specification (S, M, O) specifies how the relational schema S can be mapped onto the ontology O based on the mapping M such that the result of the evaluation of all mapping rules in M result in valid RDF triples.

With the help of an OBDA specification, an instance s of a relational schema and therefore, the relational data in s can be mapped onto the ontology O.

Definition 53 (OBDA instance)

An OBDA instance is the tuple ((S, M, O), s) where s is the instance of a relational schema S.

SPARQL queries can be issued against an OBDA instance such that sets of variable mappings are returned that correspond to the triples that are created by the evalua-tion of each mapping rule in a mapping M. In order to also obtain results that are not explicitly stored in the data, but can be inferred with the help of the ontology, two approaches can be used. In the first approach the input mapping is saturated with additional rules, such that the mapping also creates all implicit triples.

Definition 54 (Mapping Saturation)

For a given mapping M and an ontology O the function sat(M, O) produces a sat-urated mapping M′_{, where M ⊆ M}′_{. Thereby, M}′ _{is the set of mapping rules that} produces all triples produced by M and all implicit triples that can be inferred based on O.

Example 38

Consider the ontology consisting of one triple:

{(http ∶ //www.uni.com/BachelorStudent,

http ∶ //www.w3.org/2000/01/rdf − schema#subClassOf, http ∶ //www.uni.com/Student)}

This ontology defines that each bachelor student is also a student. Consider the following mapping M.

{ST U DEN T ↝ (http ∶ //www.uni.com/student/{ID},

http ∶ //www.w3.org/1999/02/22 − rdf − syntax − ns#type, http ∶ //www.uni.com/BachelorStudent)}

Based on the ontology, a mapping rule that defines that each bachelor student is also a student has to be added to the mapping in order to create a saturated mapping.

(38)

The saturated mapping M′ is depicted below.

{ST U DEN T ↝ (http ∶ //www.uni.com/student/{ID},

http ∶ //www.w3.org/1999/02/22 − rdf − syntax − ns#type, http ∶ //www.uni.com/BachelorStudent),

ST U DEN T ↝ (http ∶ //www.uni.com/student/{ID},

http ∶ //www.w3.org/1999/02/22 − rdf − syntax − ns#type, http ∶ //www.uni.com/Student)}

The second approach to also consider implicit knowledge when querying an OBDA instance is to extend queries according to the given ontology.

Definition 55 (Query Extension)

For a SPARQL SELECT query Q and an ontology O the function extend(Q, O) extends the query Q based on the ontology O to the query Q′_{. Thereby, Q}′ _returns all variable bindings that would have been returned, if Q would have been executed on an RDF graph that includes all implicit triples based on O.

Example 39

The query depicted in listing 12 retrieves all vertices that are of the type student. Consider the ontology from example 38. The ontology defines that each bachelor student is also a student. Therefore, the query can be extended to the query depicted in listing 13. Thereby, the union of all students and bachelor students is created to obtain all implicit results.

P R E F I X rdf : < http :// www . w3 . org / 1 9 9 9 / 0 2 / 2 2 - rdf - syntax - ns \# > P R E F I X uni : < http :// www . uni . com / >

S E L E C T ? s W H E R E {

? s rdf : type uni : S t u d e n t }

Listing 12: Input query to an OBDA system.

P R E F I X rdf : < http :// www . w3 . org / 1 9 9 9 / 0 2 / 2 2 - rdf - syntax - ns \# > P R E F I X uni : < http :// www . uni . com / >

S E L E C T ? s W H E R E {

{? s rdf : type uni : S t u d e n t } U N I O N

{? s rdf : type uni : B a c h e l o r S t u d e n t } }

Listing 13: Ontology based extended query.

In order to obtain results from the underlying relational database the SPARQL query has to be rewritten to a SQL query, which retrieves the desired results. The SPARQL query is rewritten based on the mapping. The SQL query is then issued

(39)

against the underlying relational database and the result of the SQL query is trans-formed into respective SPARQL results, which are then returned to the user of the system.

Definition 56 (Relation to Variable Binding Transformation)

Given a relation schema R and an instance of this schema r, the function transform(r) transforms the relation into a set of variable bindings.

transform(r) = {µ∣tu ∈ r and µ = {(toVar(A), tu[A])∣A ∈ att(R) and tu[A] ≠ N U LL}} Thereby, the static function toVar(A) creates a SPARQL variable from an attribute name.

Example 40

Consider the relation depicted in table 1. The result of transform(ST UDENT ) is: transf orm(ST U DEN T ) =

{{(?ID, 1), (?N AM E, Alice), (?F IELD, Computer Science)}, {(?ID, 2), (?N AM E, Bob), (?F IELD, Computer Science)}} Definition 57 (Query Rewriting)

Given a SPARQL query Q, and an OBDA instance ((S, M, O), s), the function rewrite(Q, M ) rewrites Q to a SQL query such that:

transf orm(_Jrewrite(extend(Q, O), sat(M, O))

Ks) =JQKJsat(M,O)Ks

In figure 4 the dataflow in an OBDA system is depicted. The mapping saturation and query extension based on the ontology are the first steps in the figure. After that the extended SPARQL query is translated to SQL with the help of the mapping. The resulting SQL query is executed on the underlying relational database and the query results are transformed to variable bindings.

Mapping Ontology SPARQL Query Translate Query SQL Query Execute SQL Query SQL Results Transform Results SPARQL Results Saturate Mapping Saturated Mapping Relational Database Dataflow Extend Query Extended SPARQL Query

(40)

Figure 5: Dataflow in UltrawrapOBDA_.

4. Ultrawrap

UltrawrapOBDA _{is an OBDA system that was developed by Juan Sequeda [1]. The} system allows for querying relational data stored in an Oracle database with SPARQL. Before UltrawrapOBDA _{can be extended, the OBDA system will be presented in this} section. UltrawrapOBDA _{differentiates between the compilation phase and the} run-time phase. In figure 5 the dataflow in UltrawrapOBDA _{is depicted.}

4.1. Compilation Phase

In the compilation phase UltrawrapOBDA _{is prepared to allow for querying relational} data with SPARQL in the later runtime phase. The input in the compilation phase is an ontology O, a mapping M, a relational schema S and an instance s of S. Formally speaking the input creates an OBDA instance ((O, S, M), s). The OBDA system supports implicit triples by saturating the input mapping with additional mapping rules. In UltrawrapOBDA _{the saturation of the mapping is achieved with inference} rules of the form (s, p, o) ∶ρ1

ρ2 where given a triple (s, p, o) in the ontology a mapping

rule ρ2 is returned if a mapping rule ρ1exists in the mapping. The mapping rules that are used in UltrawrapOBDA _{do only allow for complete strings as mapping templates} and not for any template variables in the mapping templates. The inference rules are listed in definition 58.

Definition 58 (Ultrawrap Inference Rules)

(41)

ex-pression, then the inference rules are defined as: (A, subClassOf, B) ∶ϕ ↝ (θ, type, A) ϕ ↝ (θ, type, B) (A, subP roperty, B) ∶ϕ ↝ (θ1, A, θ2 ) ϕ ↝ (θ1, B, θ2) (A, domain, B) ∶ϕ ↝ (θ1, A, θ2 ) ϕ ↝ (θ1, type, B) (A, range, B) ∶ϕ ↝ (θ1, A, θ2 ) ϕ ↝ (θ2, type, B) (A, equivalentClass, B)or(B, equivalentClass, A) ∶ϕ ↝ (θ, type, A)

ϕ ↝ (θ, type, B) (A, equivalentP roperty, B)or(B, equivalentP roperty, A) ∶ϕ ↝ (θ1, A, θ2

) ϕ ↝ (θ1, B, θ2) (A, inverseP roperty, B)or(B, inverseP roperty, A) ∶ϕ ↝ (θ1, A, θ2

) ϕ ↝ (θ2, B, θ1) (A, symmetricP roperty, B)or(B, symmetricP roperty, A) ∶ϕ ↝ (θ1, A, θ2

) ϕ ↝ (θ2, A, θ1) These inference rules are applied until a fix point is reached and the set of rule does not change anymore. The saturation of mappings is defined in the following definition.

Definition 59 (Saturation of Mappings)

Let M be a mapping and O an ontology then a single saturation step is defined as: sat′

(M, O) = M ∪ {m∣m = (s, p, o) ∶ ρ1 ρ2

where (s, p, o) ∈ O and ρ₁ ∈M } (1) Subsequently the mapping saturation function sat is defined as:

sat(M, O) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ M, if sat′ (M, O) = M sat(sat′

(M, O), O), otherwise (2)

The formally defined mapping rules in the saturated mapping are implemented as SQL queries that create SQL views in the underlying relational database during the view selection. These SQL views depict a virtualized RDF graph. Thereby, a single view for each property and for each RDF type is created. These views are called tripleviews.

Example 41

Consider the relation depicted in table 18. This relation stores information about courses at a university and the primary key of the relation is the ID. Furthermore, consider a mapping rule which defines that for each tuple in the relation a triple should be created in which the subject contains the ID of the tuple, the pred-icate is http://www.uni.com/description and the object is the value stored in DESCRIPTION. The mapping is shown in (3).

(42)

COURSES

ID NAME LECTURER DESCRIPTION

c1 Mathematics Dr. Strange Teaches the basics of mathematics.

c2 Physics Dr. Octavius Physics is one of the most fundamental scientific

disciplines, and its main goal is to understand how the universe behaves.

c3 Databases Dr. Acula Teaches relational algebra and SQL.

Table 18: Relation holding information about courses at a university.

COU RSES ↝ (http ∶ //www.uni.com/course/{ID}, http ∶ //www.uni.com/description,

”{DESCRIP T ION }”)

(3) The corresponding SQL query that creates the tripleview is depicted in listing 14. Executing this query on the relation shown in table 18 would result in the view depicted in table 19. For each property and for each class such a tripleview is created. C R E A T E VIEW d e s c r i p t i o n V i e w S E L E C T S , P , O AS S E L E C T C O N C A T (" http :// www . uni . com / c o u r s e /" , ID ) AS S , " http :// www . uni . com / d e s c r i p t i o n " AS P , C O N C A T (" , D E S C R I P T I O N , ") AS O FROM C O U R S E S

Listing 14: SQL query creating SQL view based on mapping rule. descriptionView

S P O

http://www.uni.com/

course/c1 http://www.uni.com/description "Teaches the basics of mathematics." http://www.uni.com/

course/c2 http://www.uni.com/description "Physics is one of the most funda-mental scientific disciplines, and its main goal is to understand how the universe behaves."

http://www.uni.com/

course/c3 http://www.uni.com/description "Teaches relational algebra andSQL." Table 19: Triple view containing all triples with http://www.uni.com/description

(43)

There may exists more than one mapping rule creating triples with the same pred-icate. For instance, consider the following mapping rule.

COU RSES ↝ (http ∶ //www.uni.com/course/{N AM E}, http ∶ //www.uni.com/description,

”{DESCRIP T ION }”)

(4) Together with the mapping rule shown in (3) two subqueries are used in the view creation. The union of these subqueries is created to create the description view. The respective SQL query is depicted in listing 15.

C R E A T E VIEW d e s c r i p t i o n V i e w S E L E C T S , P , O AS S E L E C T C O N C A T (" http :// www . uni . com / c o u r s e /" , ID ) AS S , " http :// www . uni . com / d e s c r i p t i o n " AS P , C O N C A T (" , D E S C R I P T I O N , ") AS O FROM C O U R S E S U N I O N S E L E C T

C O N C A T (" http :// www . uni . com / c o u r s e /" , NAME ) AS S , " http :// www . uni . com / d e s c r i p t i o n " AS P ,

C O N C A T (" , D E S C R I P T I O N , ") AS O FROM

C O U R S E S

Listing 15: SQL query creating SQL view based on two mapping rules. The SQL subqueries created by multiple mapping rules that define the same class are unioned analogously to create the respective view for the class.

Definition 60 (View Function)

The function view(iri), with iri ∈ IRI returns the respective view name for a property or a class..

The view function is needed to retrieve the view name for a given property or class in the later process of translating SPARQL queries into SQL queries.

Furthermore, a view is created that contains all triples independent from their property or class. This view is needed if there is triple pattern in a SPARQL query where the predicate is a variable. If the predicate is a variable, the triple pattern cannot be mapped to any other tripleview and therefore, it will be mapped to the view with all triples in it. This view is called allTriplesView.

4.2. Tripleview Optimization

In order to enhance the query execution time of SPARQL queries that are posed against the OBDA system, tripleviews may be optimized. Sequeda names three

(44)

pos-sible optimizations of the tripleviews: 1. Addition of primary key columns.

2. Creation of separate tripleviews for different data types. 3. Materialization of views.

1. Addition of primary key columns:

Indices optimize the performance of relational databases by minimizing the number of disk accesses required when a query is executed. An index stores a pointer with the physical address on a hard disk where information about a primary key is stored. Sequeda argues that due to the fact that the subject column S and the object column O in the tripleviews do not correspond to the primary keys of the source relation of the triple SQL optimizers cannot leverage indexing for speeding up query execution. Therefore two additional columns can be added to a tripleview, namely S_pk, which denotes the primary key of the tuple from which the subject is taken and O_pk, which does the same for the object. Thereby, O_pk is null if O is a literal and not an IRI.

Due to the fact that the views the system works with are actually queries that are executed whenever a view is accessed, the desired data is still stored in the source relations. Therefore, queries with these additional primary keys can exploit the indices and speed up queries because the joins are done on these values.

Example 42

Consider the description view from example 41. Adding primary keys from the source relation to the tripleview results in the view depicted in table 20. Hereby, the value in the primary key column for the object is NULL, because the objects are literals.

descriptionView

S S_pk P O O_pk

http://www.uni.com/

course/c1 c1 http://www.uni.com/description "Teaches the basics ofmathematics." NULL http://www.uni.com/

course/c2 c2 http://www.uni.com/description "Physics is one of themost fundamental sci-entific disciplines, and its main goal is to un-derstand how the uni-verse behaves."

NULL

http://www.uni.com/

course/c3 c3 http://www.uni.com/description "Teaches relational al-gebra and SQL." NULL Table 20: Triple view having additional primary key columns.

2. Creation of separate tripleviews for different data types:

(45)

BOOKS

ID NAME DESCRIPTION

b1 Basics of Databases This book covers the basic topics of databases.

b2 Physics in a Nutshell A collection of physics formula.

Table 21: Relation holding information about literature used at a university. on the datatype of the object column in a tripleview. In the first step separate tripleviews were created depending on the predicate of a triple, or the class of an instance, as described above. These triples may have different source relations. All values in these tripleviews were cast to the datatype varchar. Sequeda argues that the size of the object column in a tripleview is the same as the biggest column from any of the source relations, where the column corresponds to the later object column of the tripleview. This leads to poor query performance. Therefore, separate tripleviews were created for the same property with different datatypes in the object column. Example 43

Consider the BOOKS relation depicted in table 21 that holds information about books used for teaching at a university. Furthermore, consider a mapping rule that creates triples from this relation, where the subject corresponds to the ID, the predicate is http://www.uni.com/descriptionand the object is a literal that corresponds to the value stored in the DESCRIPTION column. The mapping looks like:

BOOKS ↝ (http ∶ //www.uni.com/books/{ID},

http ∶ //www.uni.com/description, ”{DESCRIP T ION }”)

(5) Now consider that the DESCRIPTION column in the BOOKS relation is of the type varchar(50) and that the type of the DESCRIPTION column in the COURSES rela-tion depicted in table 18 is of the type varchar(150). Even though, the mapping rules define that from both relations triples should be created where the predicate is http://www.uni.com/description, the triples would not be stored in the same tripleview because the object columns are of different data types. Actually two triple-views for the property http://www.uni.com/description would be created. One for the triples where the object column has the datatype varchar(50) and one tripleview where the datatype is varchar(150).

3. Materialization of views:

In UltrawrapOBDA_{a distinction is drawn between tripleviews and materialized} triple-views. Hereby, tripleviews are stored as queries, which are executed whenever a view is accessed. Materialized tripleviews on the other hand are stored as actual rela-tions. This means that the underlying query does not have to be executed when the materialized tripleview is accessed.

Global-as-View Ontology-Based Data Access for Relational Data [09/2019]

WeST

Global-as-View Ontology-Based Data

Access for Relational Data

Masterarbeit

Adrian Skubella

Erklärung

Anmerkung

Contents

1. Introduction

2. Preliminaries

3. Ontology Based Data Access

4. Ultrawrap