• Keine Ergebnisse gefunden

In order to make use of labeled graphs G= (V,Σ, E) as a model for data, we have to be clear about what the nodes and edges shall mean. Thus, we specify what the objects that account forG represent.

Usually, graph databases capture entity-centric information, which are entities rep-resented as nodes, their properties/attributes as edges to actual data values, and their relations to other database objects also represented as edges. Remaining in the realm of labeled graphs, we have at least two types of nodes, one representing entities and one for data items such as string or number objects. One of the most general and stable data models stems from the impressive standardization eorts of the Semantic Web commu-nity and the World Wide Web Consortium (W3C), trying to build an infrastructure of machine-readable semantics for the data on the Web [13].

2.2.1 The Resource Description Framework

The Resource Description Framework, RDF for short, provides a simple and extensible data model that comes with a formal semantics. It has been a W3C recommendation since 1999 and, from there on, sparked much attention from researchers and practitioners. The current recommendation provides RDF 1.1 [38, 67]. As the name suggests, RDF allows for expressing information about resources. A resource can be anything, from Web documents up to physical objects or actual people [122].

Modeling information in RDF means to formulate statements about resources, following the simple structure of

subject predicate object.

Subject and object are resources related by the predicate. Because RDF statements consist of three components, they are commonly referred to as RDF triples. A set of RDF triples makes up an RDF graph. Three dierent types of data may occur in RDF triples, namely IRIs, literals, and blank nodes [122].

Every resource is uniquely identied and implemented by International Resource Iden-tiers [44] (IRIs), a generalization of Uniform Resource IdenIden-tiers (URIs). IRIs may occur in the subject, object, as well as predicate position of an RDF triple. Technically speak-ing, predicates are resources, which makes sense as soon as we think of statements about relationship types. For example, we may want to express that is child of is the inverse relation of is parent of. IRIs are thought of as global identiers, i. e., if two dierent people talk about the same IRI, they refer to the same object. URLs are an essential subset of IRIs, referencing Web locations.

Literals are data values, not represented as IRIs. They come with a data type, such as string, int, or date (cf. [38] for a list of valid data types). Such data values are used to dene attribute values of a resource, such as a date of birth or a person's address, or title, author, or publication year of a book. Therefore, literals solely occur in object position.

Finally, RDF provides us with the possibility of expressing anonymous resources, called blank nodes. According to [38], blank nodes have a local scope, i. e., they are not to be referenced outside an RDF graph. They can be used in subject and object positions and refer to some unnamed data objects.

Let I, L, B be disjoint universes of IRIs, literals, and blank nodes. An RDF triple is a triple (s, p, o) ∈ (I ∪B)×I ×(I ∪B∪L). A set of RDF triples G is an RDF graph.

Throughout this thesis, we are considering so-called ground RDF graphs [62], which are subsets of I×I×(I∪L), i. e., the are free of blank nodes.

dbr:Albert_Einstein 1879-03-14 (xsd:date)

dbo:birthDate

dbr:Mileva_Mari¢

dbo:spouse

dbo:spouse

dbr:Physics dbo:field

dbr:The_Evolution_of_Physics dbo:nonFictionSubject

dbo:author

dbr:Cambridge_University_Press

dbo:publisher dbo:Person rdfs:type

dbo:Scientist rdfs:subClassOf rdfs:type

(a)

dbo:birthDate

rdf:Property rdf:type dbo:Person

rdfs:domain

xsd:date

rdfs:range

(b)

Albert Einstein 1879-03-14

birthDate

Mileva Mari¢

spouse

spouse

Physics eld

The Evolution of Physics nonFictionSubject

author

Cambridge University Press

publisher Person type

Scientist

subClassOf type

(c)

Figure 2.5: (a) Graph Representation of an Example RDF Graph from DBpedia [17] (b) An RDF Graph Describing the Predicate dbo:birthDate (c) A Graph Database Repre-sentation of Figure 2.5 (a)

Example 2.10 If an RDF graph G⊆ (I∪B)×I ×(I∪B∪L) does not contain state-ments about predicates, it may be represented as a labeled graph G(G), as dened in Denition 2.6. All subjects and objects, occurring in G, amount to the set of nodes of G(G). All predicates form the labeling alphabet. The set of edges is the RDF graph G itself, i. e.,

G(G) = ({s, o|(s, p, o)∈G},{p|(s, p, o)∈G},G).

Thus, many RDF graphs can be graphically represented as labeled graphs. An example, manually extracted from DBpedia [17], is shown in Figure 2.5 (a). As the nodes' identities are essential for RDF, they ll in as node labels in the center of the respective nodes. From now on, we solely rely on this kind of graphical notation of data modeled by graphs.

This RDF graph contains information about dbr:Albert_Einstein, the resource to ac-cess information about the person Albert Einstein. DBpedia introduces prexes to shorten IRIs, for representational purposes as well as to reduce the size of RDF dataset dump les. For instance, the prex dbr: unwinds to the URL http://dbpedia.org/resource/.

Hence, dbr:Physics actually represents http://dbpedia.org/resource/Physics, the URL linking to a DBpedia page with information about the scientic eld of physics. We have one literal, being the date of birth of Albert Einstein. The string in brackets species the type of the literal, here xsd:date, an XML Schema Denition for data formats.

Also included in this excerpt of DBpedia is some schema information prexed by rdfs:. They state that Albert Einstein, represented by the resource Albert_Einstein, is of the types person and scientist, represented by the DBpedia ontology classes Person and Scientist. Every object of type scientist is also a person, stated by the triple (dbo:Scientist,rdfs:subClassOf,dbo:Person).

As suggested by the font used for the predicates, also the edge labels are resources and may be, as such, part of RDF statements. For instance, predicate dbo:birthDate is itself described by an RDF graph, from which we draw an excerpt in Figure 2.5 (b). It species the domain and range of the predicate, which can be used as a constraint when inserting a concrete RDF triple with this predicate. In this example, only persons may have associated birth dates, which must be of type xsd:date. The graph in Figure 2.5 (a) conforms to these constraints. However, integrating both graphs into a single graphical representation leaves the realm of standard graphs [66] as not all information about dbo:birthDate is collected in a single place, that is the node labeled dbo:birthDate.

As already mentioned, and enforced by the W3C, an IRI can be anything, making RDF highly extensible towards so-called vocabularies that capture the semantics of resources and statements [122]. RDF supports the denition of such vocabularies by incorporating RDF Schema (RDFS), which deals with typing of entities, building hierarchies of classes, and putting restrictions on domains/ranges of predicates. To cope with these and other extensions, RDF comes with a model-theoretic semantics [67] that formally grasps all such features. However, our view on RDF shall be restricted to a basic representational level because our focus will be on querying explicit extensions of graph databases. We formally substantiate this representational level by the notion of graph databases, grounded in the principles of RDF graphs. We provide further information about the capabilities of RDF to express data schemas in Section 2.3.5.

2.2.2 Graph Databases

From an RDF perspective, we use the grounded model of graph data and ignore the entailment capabilities of RDFS vocabularies. We do acknowledge there are universes of objects U, to be used as graph nodes, and predicates P, used as edge labels. For ease of notation, U captures everything that can be in subject or object position, including predicates and literals. Note that this automatically implies non-disjoint universes U and

P. Therefore, we work with a non-standard graph model G = (V,Σ, E) with a set of objects V and a set of predicates Σ, but V ∩Σ = ∅ does not generally hold. Although a node's neighborhood does not exhaustively describe a single node [66], the following contents will not suer from this inconvenience. Beyond Example 2.10, there will be no example that uses RDF (sub-)graphs dealing with predicates as resources explicitly.

As we are concerned with graph databases extensionally, there is also no need to include blank nodes. Even if we used RDFS vocabulary and blank nodes, Gutierrez et al. have shown that the maximal extension, called closure that can be derived from all the implicit information present in an RDF graph is unique [62]. Hence, we would always work with the closure of an RDF graph (cf. Theorem 3.6 [62]).

Denition 2.11 (Graph Database)

A graph database is a labeled graphDB = (ODB,Σ, EDB)whereODB (U andΣ(P. N

In divergence of alternative denitions, e. g., the one given by Hayes and Gutierrez [66], we omit auxiliary labeling functions of nodes and edges but assume database objects (ODB) and predicates (Σ) to be identical with their respective labels.

Example 2.12 The graphs depicted in Figures 2.5 (a) and 2.5 (b) already are visualiza-tions of graph databases. We will, however, make the notation easier. Every object will be represented as a box labeled by its identier, written in typewriter font. We do not insist on using IRIs and make no distinction between resources and literals. Predicates will have an italicized font. Thus, a simplied graph database representation of our RDF graph sample on Albert Einstein (cf. Figure 2.5 (a)) is the one depicted in Figure 2.5 (c).

Note that graph morphisms (cf. Denition 2.8 on Page 11) serve a purely structural comparison purpose, later excessively used for dierent querying tasks. Mapping dierent database objects to one another may account for structural similarity, but an object's identity carries information that gets lost by graph homomorphisms. Having reduced our graph database model by blank nodes and RDFS vocabulary, the decision of equality of two graph databasesDB1 andDB2 boils down to actual equality of the database's objects and edges, i. e.,DB1 ⊆DB2 and DB2 ⊆DB1.