• Keine Ergebnisse gefunden

Outline. Section 2.1 provides an introduction to the Semantic Web idea and the various technologies (RDF, RDF-S, OWL, and SPARQL). Furthermore, we give a definition of knowledge graphs. Section 2.2 gives an overview of representation heterogeneity and provides different classification systems for it. A short introduction to heterogeneity issues in relational databases is given in Section 2.3 Section 2.4 describes heterogeneity issues in ontologies. In Section 2.5, we give an overview of heterogeneity issues in knowledge graphs and how they are different from heterogeneity issues in databases and ontologies. Then, we go into the details of entity, relation, literal, and class heterogeneity issues. Several matching approaches for the different heterogeneity issues in knowledge graphs are surveyed in Section 2.6. Finally, in Section 2.7, we shortly conclude state-of-the-art matching systems and describe some open problems.

2.1 RDF, SPARQL, and Knowledge Graphs

To overcome the shortcomings of existing Web technologies for achieving a Semantic Web, Berners-Lee and other Semantic Web researchers have introduced various new technologies to annotate Web sites with semantic knowledge in a machine-readable format using knowledge representation technologies. The Semantic Web idea involves various technologies that can be presented as the Semantic Web stack (cf. Figure 2.1).

The stack involves technologies for encoding and representing knowledge, querying and reasoning, and some layers that have not been implemented (cryptography and trust features). In this work, we mainly focus on the essential parts for knowledge representation and reasoning: RDF, RDF-S, OWL, and SPARQL.

RDF. The core technology, which also became a W3C standard for expressing knowledge on the Web, is the Resource Description Framework(RDF) [24]. At its center is the idea that knowledge is expressed in subject, predicate, object facts. The facts are often also calledtriples.

Example. As an example, a triple expressing the fact that Albert Einstein was born in the city Ulm may be expressed as follows:

(Albert Einstein, bornIn, Ulm)

In this example, the scientist Albert Einstein is a subject. The verb from the natural language sentence is a predicatebornIn, and the city of Ulm is an object in the triple.

Due to the triple format, facts may also be represented as a graph where subjects and objects are nodes. The predicate may be represented as an edge connecting the respective nodes as presented in Figure 2.2.

Albert Einstein bornIn Ulm

Figure 2.2: A simple graph representation of the triple (Albert Einstein, bornIn, Ulm).

Figure 2.1: An illustration of the Semantic Web stack describing different technologies involved in the Semantic Web from [107]. On the lower levels, we have representation technologies, on the middle levels technologies for reasoning, and on the higher levels ideas that have not been yet realized.

Since RDF is a Web technology that should connect knowledge between various Web sites, subject, predicates, and objects are unique identifiers, Internationalized Resource Identifiers (IRI), instead of natural language names. As an example, for Albert Einstein, an IRI which is found in the RDF dataset Wikidata iswww.wikidata.

org/entity/Q937, the IRI for thebornIn relation iswww.wikidata.org/prop/P19.

An IRI may consist of a prefixwww.wikidata.org/entity/ and a unique identifier for the dataset Q937. Since IRIs usually impair readability, in this work, we solely work with natural language labels instead.

More formally, an RDF triple may be defined as follows: (s, p, o)∈E×R×(E∪L).

Subjects are from a set of resources E. They usually represent entities or concepts (often from the real-world). Predicates stem from a set of relations R. The object either is a resource similar to the subject or a literal from the set L. In contrast to resources and relations, literals are not represented by IRIs but may be strings, numbers, or dates. As an example, a triple concerning a literal may be about Einstein’s birthday:

(Albert Einstein, birthDate, ’14 March 1879’)

In logics, a relation is seen as a binary predicate: bornIn(AlbertEinstein, U lm).

In this work, we also use the mathematical term (binary) relation over subjects and objects to describe triples. Note that we focus on RDF without blank nodes and reification.

2.1 RDF, SPARQL, and Knowledge Graphs

RDF-Schema. Schematic information in RDF may be expressed by theResource Description Framework - Schema (RDF-S) [15]. RDF-S provides a fixed vocabulary for expressing schematic information about RDF data by annotating resources and relations properly. Hence it is possible to describe groups of resources or unique resources. The maybe most used RDF-S features is therdfs:labelfor giving natural language labels to resources or relations.

(www.wikidata.org/entity/Q937, rdfs:label, ’Albert Einstein’) Another critical feature of RDF-S is used to construct groups of resources, so-called classes and the idea to build class hierarchies for expressing complex knowledge taxonomies. Here, the RDF-S vocabulary rdfs:type and rdfs:SubClassOf are used:

(Albert Einstein, rdfs:type, Scientist) (Scientist, rdfs:SubClassOf, Person) (Scientist, rdfs:type, rdfs:Class) (Person, rdfs:type, rdfs:Class)

The two triples express that Einstein is of the type of scientist. A scientist is describing a group of entities, also called class. Furthermore, we have expressed that the classscientistis a subclass of the classpersonbuilding a class hierarchy. Class hierarchies are frequently used in Semantic Web datasets to formalize conceptual knowledge.

OWL. The formal representation of knowledge is often done in an ontology. An ontology is a formal way to describe knowledge as concepts, categories, properties, and relations. To go from basic schema information to the more complex idea of ontologies, theWeb Ontology Language (OWL)was introduced [70]. OWL provides a vocabulary for annotating RDF with schema information. However, much more complex semantic expressions are possible. As an example, cardinalities and restrictions for relations or literals may be expressed. Also, complex relations between classes, such as intersections, unions, and complements, are possible. Furthermore, OWL offers a wide range of possibilities for reasoning. OWL reasoning is used to infer new knowledge based on existing knowledge using logical entailment rules. However, in this work, we do not go into further details of OWL reasoning capabilities.

An essential feature of OWL for this work is expressing identity between resources, relations, and classes. Here, the OWL vocabulary offers the three relations owl:sameAs, owl:equivalentProperty andowl:equivalentClass. The property owl:equivalentProperty is used to express that two relations have the same extension. That means the relations are used for precisely the same resources.

Similarly the property owl:equivalentClass is used for classes.

However, both relations are not used to express the equality of classes and relations. The property for expressing identity between individuals is owl:sameAs.

It may express the equality between resources, relations, and classes when they have the same real-world semantics.

SPARQL. The querying language for RDF data is called SPARQL[92]. In this work, we restrict ourselves to the basic querying mechanism of SPARQL, basic graph patterns (BGP). Like the relational database query language SQL, each query in SPARQL consists of a SELECT and WHERE clause. The SELECT clause defines the projection variables. The selection criteria for the query are defined in the WHERE-clause as a set of triple patterns including variables, the BGP.

Example. As an example, we first show a short SPARQL query, asking for the birthplaces of all scientists from Germany.

SELECT ?birthplace WHERE (

?person <bornIn> ?birthplace.

?birthplace <country> <Germany>.

?person <occupation> <Scientist>.

)

The BGP in the WHERE-clause consists of three triples containing variables indicated by a leading question mark and entities/relations indicated by angle brackets.

The first triple asks for a ?personborn in some ?birthplace. This ?birthplaceshould be in the country Germany. The ?personshould have the occupation Scientist.

Note that the naming of these variables is not carrying any semantics. The BGP triples are matched to the knowledge graph, matching each triple to a triple in the knowledge graph, such that variables with the same name are mapped to the same entities.

2.1.1 Knowledge Graphs

The termknowledge graph has been shaped by the idea of the Google Knowledge Graph, first mentioned in a blog article by Google in 2012 [100]. However, a clear definition is still missing. The term knowledge graph often is used interchangeably with the term ontology since they both often work with RDF and use both, classes and class hierarchies. Erlinger and Wöß have reviewed several definitions of knowledge graphs to come up with a unifying definition [28]:

A knowledge graph acquires and integrates information into an ontology and applies a reasoner to derive new knowledge.

Generally speaking, they say that knowledge graphs are similar to classical ontologies.

Particularly the idea of Linked Open Data is closely connected to the presented definition of knowledge graphs. Hence, several solutions from the field of ontology matching can be directly carried over to knowledge graphs.

An essential property of knowledge graphs that is not necessarily inherent in ontologies is that they have often been classified as being very large and usually integrate knowledge from various data sources. Thus, the techniques for solving heterogeneities need to be extremely scalable, being able to deal with hundreds of millions of entities. A knowledge graphis a finite set of triplesKGE×R×(E∪L).