• Keine Ergebnisse gefunden

Finding Synonyms in DBpedia with Manual Evaluation 67

4.2 Knowledge Graph Embeddings for Finding Synonyms

4.2.2 Detecting Synonymous Relations

4.2.3.3 Finding Synonyms in DBpedia with Manual Evaluation 67

As a last experiment, we also want to show that our method identifies existing synonyms in a large-scale and heterogeneous knowledge graph. Therefore, we evaluate our method with all embedding models and the baseline on a sample of DBpedia-16-2010. Due to its size, again, a random sample similar to the procedure before is taken, resulting in a dataset with 12,664,192 triples and 15,654 distinct relations.

For the manual evaluation on DBpedia, the annotator were supposed to evaluate relation pairs into synonyms and non-synonyms. To measure the task’s difficulty, we first measured the inter-annotator agreement on a small sample of our dataset. We achieved an annotator agreement of over 0.90 for two independent raters, implying that they came to similar results. Due to this experiment and the dataset’s size, we decided on only a single annotator for the manual evaluation. This manually-built dataset stems from the top 500 results for each embedding model and the baseline summing up to around 3600 relation pairs, of which 1100 have been classified as correct. The dataset is freely available online 3. With this dataset we are able to obtainPrecision@k values up to k= 500.

The results as Precision@k of our manual classification are presented in Figure 4.5.

For the baseline approach in this experiment, we chose a minimum support that returns around 500 results to be comparable to the other results. Choosing lower minimum support would increase the number of returned results but decreases the precision. Compared to the other models, the baseline starts with a low precision for k = 50, with a steadily increasing precision of up to 0.25 at k = 500. Note that the baseline is never exceeding a precision of 0.3 with the chosen minimum support value. The unconventional behavior of the curve is due to the assumption synonymous relations never co-occur for the same subject. This assumption is not

3https://figshare.com/s/11d4af3169a0e6d2437b

true for DBpedia. The precision of our classification method on top of knowledge graph embeddings is showing higher precision for almost all models. HolE, ComplEx, and ANALOGY all show comparably high precision values for high k values. In contrast, the translation embedding models TransE, TransD, and TransH are quite weak to the earlier experiments. HolE with L1-metric in Figure 4.5 show the best results with a precision of 0.94 at k= 50 and still a precision of 0.7 at k = 500.

During the extensive manual evaluation of the models, we got a detailed insight into the advantages and disadvantages of the models on DBpedia. Widespread synonymous relations that can clearly be distinguished from others are clearly identified as synonyms. These are for example relations for genre, almaMater, deathPlace,birthPlace andaward. Problematic, at least in DBpedia, are rarely used relations (fuelSystem, drums), relations with spelling errors in their label (amaMater,birthPace) and relations that are similar to others other existing relations (club, youthteam). Several other false positives stem from DBpedia containing relations that are automatically extracted from external data sources that should be integrated and reformulated. As an example, it imports an external baseball database by creating two relations for every row of a table: e.g. stat1label, stat1value for the first row andstat2label,stat2value. These false positives are not synonymous relations but problematic relations that should be reformulated.

In all three experiments, we have shown the advantages of our embedding-based classification method on various knowledge graphs. The baseline has been outperformed by almost all embedding techniques because it heavily relies on synonymous relations to share object entities. In contrast, knowledge graph embedding-based approaches detect synonyms even though they do not share any subject nor object entities. As an additional drawback, the baseline requires parameter tuning for the minimum support value.

We have seen that many synonymous relations are detected in knowledge graphs if they are frequently used. The semantics of rare relations can hardly be mapped to the knowledge graph embedding, hindering the data-driven synonym detection mechanism. All embedding models show varying qualities across the different datasets, with HolE showing consistently good if not the best results when choosing L1-metric.

For most other models, L1-metric is also showing better results. Still, no model could identify all synonymous relations with high quality only based on the knowledge graph itself.

The fine-grained modeling of relations (as in Freebase and DBpedia) is often problematic since these relations may hardly be distinguished from real synonyms, even in our extensive manual evaluation. We observed that relation pairs that have been counted as false positives often are pairs of relations that are incredibly similar.

For example, /education/university/local_tuition./.../currency and /education/university/domestic_tuition./.../currency both are highly sim-ilar in their extension, however are, semantically speaking, slightly different. One is used for the currency of the tuition at universities for local students, the other one for domestic students. The semantics of local and domestic students might be very close but is not synonymous. We believe that these relations could be integrated.

4.2 Knowledge Graph Embeddings for Finding Synonyms

However, detecting such a difference by a purely data-driven approach seems to be almost impossible.

4.2.4 Discussion

In this section, we have shown how relation representations from knowledge graph embeddings are used to identify synonymous relations in real-world knowledge graphs.

We have performed experiments on Freebase, Wikidata, and DBpedia and showed embeddings could be used to achieve high precision and high recall. While some embeddings performed significantly better than others, all of them outperformed the frequent itemset baseline technique.

In the manual analysis of the results for the DBpedia experiments, we could gain several insights on the performance and the problems of the embeddings. For some potential synonyms, making a decision for or against being synonym was extremely difficult without looking into the data or having background knowledge on the relation’s domain. In such complex cases, knowledge graph embedding-based synonym detection techniques are far from achieving perfect results. We believe that most complex cases may not be detected by any automatic method when additional background information is not available. Here, we think a human-in-the-loop process could help to further advance synonym detection in knowledge graphs.