4 COMET: A Context-Aware Matching Technique

To provide a solution to the problem of contextually matching RDF entities, COMET – a context-aware RDF molecule matching technique – is proposed.

This technique is grounded on the semantic data integration techniques pro-posed by Collarana et al. [84], whose work deals with matching and merging RDF molecules that are semantically similar using semantic similarity metric and fusion policies. This work makes use of the concepts of RDF molecules but contributes a new approach as to taking into consideration thecontextof the sys-tem while matching entities. COMET is an entity matching framework designed to create, identify, and match contextually equivalent RDF entities. Grounded on the entity matching component from the data integration technique proposed by Collarana et al. [84], we propose COMET, an entity matching approach to merge equivalent RDF entities based on context. Thus, a solution to theproblem of contextually matching entities is provided (Fig.5).

4.1 Problem Deﬁnition

RDF Molecule [84] – If Φ(G) is a given RDF Graph, we deﬁne RDF Molecule M as a subgraph ofΦ(G) such that,

M ={t1, . . . , tn}

∀ i, j∈ {1, . . . , n}

subject(t_i) =subject(t_j)

Where t1, t2, . . . , t_n denote the triples in M. In other words, an RDF Molecule M consists of triples which have the same subject. That is, it can be repre-sented by a tuple M = (R, T), where R denotes the URI of the molecule’s

(a)Φ(G),Φ(D) and ContextC (b) HomomorphismθC (c) IdealizedFc

Fig. 5. Problem Deﬁnition.The left side shows two RDF Graphs the system Con-text. The right side shows the application of homomorphismθC on the RDF graphs, resulting in the formation of Contextualized RDF GraphFc.

subject, and T denotes a set of property and value pairs p = (prop, val) such that the triple (R, prop, val) belongs to M. For example, the RDF molecule forArnold Schwarzenegger is (dbr:Arnold-Schwarzenegger,{(dbo:occupation, Politician), (dbp:title,Governor)}). An RDF GraphΦ(G) described in terms of RDF molecules is deﬁned as follows:

Φ(G) ={M = (R, T)|t= (R, prop, val)∈G∧(prop, val)∈T}

Context – We deﬁne a context C as any Boolean expression which represents the criteria of a system. Two entities, such as an RDF moleculeM1andM2, can be either similar or not similar with respect to a given context. That is,C is a Boolean function that takes as input two moleculesM1andM2and returnstrue if they are similar according to system context, and false otherwise. Below is an example of context C, modeled after the example presented in Fig.1, where two molecules are similar if they have the same occupation. IfP = (p, v) is the predicate representing the occupation property of a molecule, then context.

C(M1, M2) =

true, ifP∈M1∧P ∈M2. false, otherwise.

Depending on the requirements of the integration scenario, this context can be any Boolean expression.

Semantic Similarity Function – Let M1 and M2 be any two RDF molecules.

Thensemantic similarity function Sim_fis a function that measures thesemantic

similarity between these two molecules and returns a value between [0, 1]. A value of 0 expresses that the two molecules are completely dissimilar and 1 expresses that the molecules are identical. Such a similarity function is deﬁned in GADES [371].

Contextually Equivalent RDF Molecule – Let Φ(G) and Φ(D) be two sets of RDF molecules. Let MG andMDbe two RDF molecules from Φ(G) andΦ(D), respectively. Then,MG andMD are deﬁned as contextually equivalent iﬀ

1. They are in the same context. That is,C(M1, M2) =true 2. They have the highest similarity value, i.e.,

Sim_f(MG, MD) =max(∀_m∈Φ(D)Sim_f(MG, m))

LetF_c be an idealized set ofcontextually integrated RDF molecules from Φ(G) andΦ(D). Letθ_C be a homomorphism such thatθ_C:Φ(G)∪Φ(D)→F_c. Then there is an RDF MoleculeM_F fromF_c such thatθ(M_D) =θ(M_G) =M_F. From the motivation example, this means that the molecule ofArnold Schwarzenegger, the politician, iscontextually equivalentto the molecule ofDonald Trumpas they are similarandthey satisfy the context condition of having the same occupation.

In this work, we tackle the problem of explicitly modeling the context and then matching RDF molecules from RDF graphs that are both highly similar and equivalent in terms of this context. This problem is deﬁned as follows: given RDF graphsΦ(G) andΦ(D), letMGandMDbe two RDF molecules such that MG∈Φ(G) andMD∈Φ(D). The system is supplied with a context parameter C, which is a Boolean function evaluating if two molecules are in the same context. It is also supplied with a similarity functionSim_f, which evaluates the semantic similarity betweenMG andMD.

The problem of creating a contextualized graph ΦC consists of building a homomorphism θ_C : Φ(G)∪Φ(D) → F_c, such that for every pair of RDF molecules belonging toΦ_Cthere are none that arecontextually equivalent accord-ing to system context C. IfM_G and M_D are contextually equivalent molecules belonging toF_c, thenθ_C(M_G) =θ_C(M_D), otherwise θ_C(M_G)=θ_C(M_D).

An example of this problem is illustrated in Figure X, which depicts a use case with two RDF graphs and a single context conditionC. With respect toC, the RDF moleculeArnold.S fromΦ(G) is in the same context asDonald.Tfrom Φ(D), but not in the same context as the moleculeArnold.S fromΦ(G). So the problem is to identify a homomorphismθC which evaluates the RDF molecules based on system context and maps these RDF molecules in a way that they can be integrated into a contextualized graph.

4.2 The COMET Architecture

We propose COMET, an approach to match contextually equivalent RDF graphs according to a given context, thus providing a solution to the problem of con-textually matching RDF graphs. Figure6 depicts the main components of the COMET architecture. COMET follows a two-fold approach to solve the problem

of entity matching in RDF graphs in a context-aware manner: First, COMET computes the similarity measures across RDF entities and resorts to the For-mal Concept Analysis algorithm to map contextually equivalent RDF entities.

Finally, COMET combines the results of the ﬁrst step and executes a 1-1 perfect matching algorithm for matching RDF entities based on the combined scores to ﬁnally synthesize the matching into a contextualized RDF graph.

Simf

Dataset ParƟƟoner

(Thresholdγ ) 1-1 Perfect Matching Calculator

Φ(G) Scoring Contextually Equivalent EnƟƟes 1-1 Context-Aware Matching

0.2

Fig. 6. The COMET Architecture. COMET receives two RDF datasets, e.g., G and D; a similarity functionSim_f; and a contextC. The output is a set of contextually matching RDF entities.

4.3 Identifying Contextually Equivalent Entities

Building a Bipartite Graph. The COMET pipeline receives two RDF graphs Φ(G), Φ(D) as input, along with context parameterC, and a similarity function Sim_f. COMET ﬁrst constructs a bipartite graph between the sets φ(G) and φ(D). TheDataset Partitioner employs a similarity functionSim_f and ontology Oto compute the similarity between RDF molecules inφ(G) andφ(D) assigning the similarity score as vertices weight in the bipartite graph. COMET allows for arbitrary, user-supplied similarity functions that leverage diﬀerent algorithms to estimate similarity between RDF molecules. Thus, COMET supports a variety of similarity functions including simple string similarity. However, as shown in [84], semantic similarity measures are advocated (in the implementation of this work we particularly use GADES [371]) as they achieve better results by considering semantics encoded in RDF graphs.

After RDF molecules similarity comparison, the result of the similarity func-tion is tested against a threshold γ to determine entity similarity (the similar-ity threshold’s minimum acceptable score). Thus, edges are discarded from the bipartite graph whose weights are lower than γ. A threshold equal to 0.0 does not impose any restriction on the values of similarity; thus the bipartite graph includes all the edges. High thresholds, e.g. 0.8, restrict the values of similarity, resulting in a bipartite graph comprising just a few edges.

(a) Bipartite graph after applying thresholdγ= 0.5

(b) Context validation using FCA (c) Perfect 1-1 matches

Fig. 7. Context Validation.The left side shows a bipartite graph after the applica-tion of threshold. The remaining edges go through a special 1-1 matching algorithm which takes into account the system context using FCA. The result is a perfect match between contextually equivalent molecules.

Pruning RDF Entities According to ContexB. The main step on the COMET pipeline is to validate and prune pairs of RDF molecules that do not comply with the input context C, making COMET a context-aware approach.

For identifying contextually equivalent RDF entities, theContext Validator com-ponent employs the Formal Concept Analysis (FCA) algorithm. FCA is the study of binary data tables that describe the relationship between objects and their attributes. Applying this context validation step over the RDF molecules ensures that only contextually relevant tuples are kept. In COMET, context is modeled as any Boolean function. Two molecules are matched if they satisfy this con-dition, otherwise they are not matched. The algorithm by V. Vychodil [451] is applied in COMET; it performs formal concept analysis to compute formal con-cepts within a set of objects and their attributes. This algorithm is extended in our approach for validating complex Boolean conditions. A typical formal concept analysis table is shown in Table1.

Table 1.Object-Attribute table for performing FCA.

Attribute 1 Attribute 2 Attribute 3

Object 1 X X

Object 2 X

Object 3 X X

Instead of using attributes in the column of the FCA matrix, in our approach, we replace the attributes with a boolean condition C. This is the same as the context condition C used in our approach. For example, the context C from the motivating example can be broken down into C = C1∧C2 where C1 =

“contains propertydbo:occupation”, andC2= “has the same value for property dbo:occupation”. The execution of the FCA algorithm remains unchanged by this adaptation since the format of the input to FCA is still a binary matrix.

When applied to RDF molecules, formal concept analysis returns a set of formal concepts< M, C >whereM is a set of all the molecules that contain all conditions contained inC. That is, by applying FCA, the set of molecules that satisfy a certain context condition can be obtained. Thus, the molecules that do not meet the context condition are pruned. In Fig.7, an example of context validation is demonstrated. Edges in a bipartite graph are ﬁltered according to a threshold valueγ as detailed in the previous section. Next, the remaining edges are validated by constructing an FCA matrix according to context condition C. The FCA algorithm returns the edge satisfying the context conditions. The edges that do not satisfy the context condition are discarded.

4.4 The 1-1 Perfect Matching Calculator

COMET solves the problem of context-aware entity matching by computing a 1-1 weighted perfect matching between the sets of RDF molecules. The input of the 1-1 weighted perfect matching component is the weighted bipartite graph created on the previous step. Since each weight of an edge between two RDF molecules corresponds to a combined score of semantic similarity and context equivalence value, we call this a 1-1 context-aware matching calculator. The eﬀect of this 1-1 context aware matching calculator is demonstrated in Fig.9Finally, a combinatorial optimization algorithm like the Hungarian algorithm [267] is utilized to compute the matching.

4.5 Integration Use Case: Applying Fusion Policies

In order to apply this context-aware entity matching pipeline into a data inte-gration scenario, we envision the usage of fusion policies deﬁned by Collarana et al. [84]. To consolidate entities identiﬁed as contextually equivalent, COMET can make use of synthesis policies, i.e. a user-supplied function that deﬁnes how the RDF molecules should be combined to form a connected whole. COMET can adopt the following synthesis policies:

1. TheUnion Policy, which includes all predicates-object pairs, removing the one that is syntactically the same;

2. TheLinking Policy, which producesowl:sameAslinks between contextually equivalent RDF molecules;

3. The Authoritative Policy, which allows for deﬁning one RDF graph as a prevalent source selecting its properties in case of property conﬂicts, i.e., properties annotated as owl:FunctionalProperty, equivalent prop-erties owl:equivalentProperty, and equivalent classes annotated with owl:sameAs orowl:equivalentClass.

Algorithm 1:closure(B,y)

1 forj←0tondo

2 D[j]←1;

3 foreachiinrows[y]do

4 match←True;

Fig. 8.Implemented algorithms (extended from [451]).

By applying these policies, the end output is a synthesized graph with linked entities that are contextually equivalent. In the next chapter, we take a look at another use case of context-aware entity matching: the temporal summarization of knowledge graph entities.

Im Dokument Knowledge Graphs and Big Data Processing (Seite 137-143)