• Keine Ergebnisse gefunden

In the previous section, we have shown that, indeed, the end-to-end precision of instance matching results for more than two knowledge graphs is significantly affected by the quality of transitively added identity links. Therefore, the quality of instance matching systems in a realistic matching environment might be significantly worse than in a simple scenario with exactly two knowledge graphs.

Overcoming the quality problem of transitive links is achieved by identifying incorrect owl:SameAs relations that connect correct subcomponents of an identity graph: When we come back to the example presented in Figure 3.2, its original end-to-end precision was only at around 42.9%. The removal of incorrect links could prevent a large proportion of the incorrect transitive identity links, which would boost the overall quality of the resulting owl:SameAs links significantly. As an interesting addition, also the removal of a correct identity link may improve the end-to-end precision by preventing a large number of incorrect transitive links.

Example. If we go back to our example from Figure 3.2, removing the correct owl:SameAs link between the instance D and E would boost the overall end-to-end join quality to 66.7%, since several transitively incorrect links are also removed.

To overcome this problem, we propose four methods to identify incorrect owl:SameAs relations to boost the end-to-end precision of instance matching systems.

As an input, each of our methods uses the output of an existing instance matching system, hence similarity values between entities.

In contrast to existing matching systems, our techniques make use of the transitivity of identity links and work on top of equivalence classes to break them up with structural graph measures:

1. The first approach is working on the output weights of an instance matching system to remove weak matching links.

2. A clustering-based approach is based on standard community detection techniques, usually used for social network analysis.

3. We use the clique concept from graph theory to identify fully connected subgraphs in the identity graphs.

4. Another graph clustering approach, which is identifying dense regions using random walks.

3.3.1 Weakest Links

The first approach is a simple baseline approach, which does not consider the structure of the identity graphs but only the similarity values from the instance matching

3.3 Overcoming Transitivity Problems

systems. Since the computed similarity values are reflecting the confidence of an instance matching systems that two entities express the same real-world entities, the weakest link approach tries to remove links with low similarity values. Concretely, for each equivalence class with at least three entities, identity links with the lowest similarity values are removed until the class is split into two or more subcomponents.

Example. In Figure 3.2, the link between D and E is removed so that the equivalence class is split into two subcomponents. In this case, the edge with the lowest similarity value was a correct edge, and only the link between C-D, with a slightly higher value, is incorrect. Unfortunately, this was not detected by the algorithm, but the end-to-end precision is still improved because several transitively incorrect edges are removed anyways. The resulting end-to-end precision is 0.67.

3.3.2 Edge Betweenness

The edge betweenness approach is a graph clustering approach, taking into account the structure of identity graphs together with the similarity values output by the instance matching systems. The key idea here is that a high number of identity links within a group of entities of the identity graph is a strong indication that these entities are representing the same real-world entity. Hence, identifying such groups within an equivalence class that is better intra-connected than inter-connected and removing every other link should improve the end-to-end precision.

The edge betweenness idea itself stems from the field of community detection in social network analysis as heavily researched by Girvan and Newman [38, 77].

The algorithm we chose is called the Girvan-Newman algorithm, which is able to find highly connected subcomponents in a graph by identifying links between these subcomponents. Hence, it is based on edge betweenness, a measure for links on how likely it is that they are between two highly connected communities.

The edge betweenness of each identity link is computed by counting the number of shortest paths in the identity graph that go through this link. If two highly connected subcomponents are interconnected by a single link, the shortest paths between the nodes of these two communities all have to go through the single inter-connecting link. Hence, its edge betweenness value is high. By removing the link with the highest edge betweenness is an equivalence class and then recomputing the edge betweenness values, a graph is partitioned into communities. We apply the edge betweenness approach to equivalence classes with low quality and remove links until the class is split into two or more components.

Example. In our example, the shortest paths between every two nodes are computed.

Since the graph consists of two highly connected components (A,B,C) and (E,F,G), the shortest paths connecting these subcomponents have to pass through D. Hence, the two adjacent edges, C-D and D-E have the highest edge betweenness values. In this case, the edge D-E is detected by the algorithm and removed from the identity graph, partitioning the equivalence class into two subcomponents (A,B,C) and (D,E,F,G) similar to the weakest link approach, resulting into a similar end-to-end

join quality of 0.67.

3.3.3 Clique

The third approach aims to achieve high precision instance matching results by being restrictive, only keeping fully connected subcomponents of the identity graphs.

Similar to the previous approach, the idea is also based on graph clustering, in this case, on complete-link clustering. The clique approach identifies subcomponents where the pairwise similarity between all entities of this component have a matching link, i.e., are above the threshold Γ. Hence, all resulting subcomponents are fully connected identity graphs. In contrast to the previous approach, the clique approach removes significantly more edges, particularly for larger equivalence classes.

Example. For the example, 2 owl:SameAs links are removed from the example graph, such that it is decomposed into 3 different subcomponents: (A,B,C), (D) and (E,F,G). Overall, one incorrect between C and D, but also one correct link between D and E was removed. This restrictive approach leaves us with only correct links in the two identity graph components. In numbers, this results in an end-to-end join precision of 100%.

3.3.4 Markov Clustering

An approach that has already shown excellent results in the work of Hassanzadeh et al. for duplicate detection is Markov Clustering [43]. Markov Clustering (MCL) was proposed as a random walk-based clustering algorithm for weighted graphs [111].

Similar to the edge betweenness approach, the algorithm detects dense regions (clusters) in graphs without specifying the number of clusters as an input parameter.

MCL is based on simulating random walks on the identity graph. The intuition behind using random walks is that they tend to end in the same dense region of the graph they started. If starting multiple random walks in the identity graph, edges that frequently used by random walks should be within the same dense region/community.

More formally, the random walks are described by two operations on a stochastic matrix created from a weighted adjacency matrix (build from the similarity values of the identity graph) by adding self-loops to each node and normalizing the entries, such that each column sums up to 1. Now the random walks are simulated by alternating an expansion and inflation step. Expansion corresponds to squaring the matrix, which simulates a random walk step. Thus, expansion increases the probabilities of intra-cluster edges. Inflation corresponds to taking the entry wise power of the matrix and normalizing the resulting matrix to be stochastic again.

Hence, high entries in the matrix are further boosted in the inflation step, whereas low values are damped. After alternating both steps for several iterations, the matrix converges, such that clusters are formed. The inflation power is an input parameter, which is used to influence the granularity of clusters. The higher the inflation, the larger the number of resulting clusters. In our experiments, we chose the inflation parameter of 8.0 since it has shown promising results.