An Excursion into Named Entity Linking - Entity Linking to Wikipedia

categories requires understanding of the language, training a topic model does not, only when aiming for a manual inspection and interpretation of the formed word clusters. If a specific version of Wikipedia does not provide enough textual content, we might acquire content by crawling news pages in the respective language. How-ever, in that case we cannot reason on the performance of the model since the topic model might then not reflect the articles in the Wikipedia version of interest.

For future work, it will be interesting to exploit more sophisticated variants of LDA. Some variants allow the incorporation of background knowledge to account for additional structures and priors over words and documents Andrzejewski et al.

[2009], Steyvers et al. [2011], Newman et al. [2011]. Polylingual topic models (Mimno et al. [2009]) might be useful for knowledge transfer among different Wikipedias.

Other variants of LDA, such as the method proposed in Wahabzada et al. [2011], allow faster learning over larger datasets, an asset that may be useful when handling more diverse reference contexts. Alternatively to LDA, we note that the continu-ous word representations recently proposed by Mikolov et al. [2013] should also be investigated. These word representations are computed from the hidden layers in a neural network and belong to the deep learning techniques that have recently achieved enormous attention in NLP and various other fields. For instance, they may be used to further enhance context representation but also to learn a new form of entity profiles.

Having evaluated and discussed our approach to person name linking, we will now give a brief overview on approaches to Named Entity Linking. Since we directly generalize to arbitrary entity types in the following chapter, this section serves as a connection and highlights the major findings of the relevant approaches. Important aspects will be discussed again in the following chapter.

3.7 An Excursion into Named Entity Linking

In this chapter we have focussed mainly on the linking of person names. Although we have shown in our experiments that thematic context distance can achieve supe-rior results on a dataset containing other types of entity names, i.e. WikiMisc_E, we have not directly evaluated the performance for entities different than persons. One natural next step would therefore be named entity linking. Named entity linking ex-tends person name linking and usually includes locations and organizations. Named entity linking has been widely studied in recent years and has also been one focus of the Knowledge Base Population shared tasks at the Text Analysis Conferences (TAC) since 2009 (McNamee and Dang [2009], Ji et al. [2010, 2011]).

Hachey et al. [2013] thoroughly compared the most successful approaches of 2009 (Varma et al. [2009]) and 2010 (Lehmann et al. [2010]) against those of Bunescu and Pasca [2006] and Cucerzan [2007]. Since we have no access to the either of

Chapter 3 Topic Models for Person Linking

the employed datasets¹, we here summarize the findings of Hachey et al.’s overview concerning the dataset from 2010. The TAC 2010 dataset is a collection of Reuters news articles and web pages containing named entity mentions that are to be linked against a snapshot of Wikipedia articles (from 2008) or to NIL if the underlying entity is not covered in this snapshot.

Varma et al. [2009] achieved the best results in the TAC 2009 challenge. They used a carefully constructed candidate selection method with in-document co-reference resolution for acronym expansion in combination with a rather simple candidate consolidation method that maximizes the cosine of mention context and candidate context. Their candidate selection method uses an inverted index over alias names against which mentions are matched both token- as well as phrase-wise. Lehmann et al. [2010] use a similar technique but achieve a higher candidate recall. The pre-sumably most important difference is that Lehmann et al. [2010] also use alias infor-mation derived from links which gives them not only more aliases but also enables the usage of priors such as entity-mention probability. The candidate consolidation of Lehmann et al. [2010] is a heuristic ranking over features such as alias trustwor-thiness, the similarity between mention and candidate name and the matching of mention and candidate type. It also includes a supervised binary logistic classifier used for NIL detection. Employing a separate classifier for NIL detection was also reported by Zheng et al. [2010] to slightly increase the results of Varma et al. [2009]

on the TAC 2009 corpus.

Interestingly, Hachey et al. found the re-implementation of Cucerzan [2007] su-perior to that of Varma et al. [2009] (81.6% vs. 84.5% accuracy), as the former achieved a much higher accuracy for covered entity mentions. We assume that this is due to the collective nature of Cucerzan’s approach which can be superior to the simpler contextual similarity method of Varma et al.. Hachey et al. argue that this can also result from the nature of the dataset. Varma et al. specialised in organisa-tion and acronym handling but the number of respective menorganisa-tions is far lower on the TAC 2010 dataset, i.e. 21% in the dataset from 2009 vs. 15% in the dataset from 2010. However, both methods gain from in-document co-reference resolution both for acronyms as well as other mentions. This may also explain why Bunescu and Pasca’s method showed with an accuracy of 80.8% the weakest overall performance.

Bunescu and Pasca neither use in-document co-reference resolution nor candidate selection methods as elaborate as Cucerzan or Varma et al.. Furthermore, as de-scribed in this chapter, Bunescu and Pasca’s approach was originally designed for person name linking, while the TAC dataset also includes other entity types. Hachey et al. tried to account for this by generalizing the employed category set for their implementation to different top-level category sets.

1The datasets are available only to the participants of the challenge. When asking for the data, the consortium would give allowance if we participate in the upcoming challenge which at that time was unfortunately not possible.

3.7 An Excursion into Named Entity Linking

Hachey et al. compared all methods against two baselines. The first is a ti-tle&redirect baseline that uses exact matches of mentions against Wikipedia titles and redirects. The second is a NIL-baseline that assigns every mention to NIL. Interestingly, the title&redirect baseline was found to achieve an overall accuracy of 79.4% and the NIL-baseline arrived at an accuracy of 54.7%. The latter is due to the even distribution of covered and uncovered entity mentions in this dataset. For person mentions in news texts, the title&redirect baseline was found to achieve a nearly perfect accuracy of 97.0%. Hachey et al. attribute this to editorial standards in news, which lead to entity mentions in their most common form and thus men-tions close to Wikipedia titles. However, most of these menmen-tions also truly referred to an entity in Wikipedia. In contrast, this baseline showed with 82% a far weaker accuracy for person mentions in web texts. Unfortunately, since we miss important statistics and also don’t know the average ambiguity of these person mentions we can not further judge these results.

Hachey et al.’s evaluation allows for several important insights that go along with the experimental results obtained in this thesis. The performance of linking approaches need not generalize across datasets and may strongly depend on the number of uncovered entity mentions but also the distribution over entity types of mentions. This also concerns the number of examples for uncovered entity mentions in the training dataset since this fraction usually influences model parameters and thus also the performance on test datasets.

In an analysis of the systems performance broken down by entity type, Hachey et al. found that all systems perform best for persons, with remarkably lower re-sults for organizations and geopolitical entities (about 20% lower accuracy). As no approach was found to perform consistently superior across document type (news or blog) and entity types, the authors suggest the combination of entity specific models or voting combinations.

Locations and organisations as well as their mentions have different character-istics than persons. First off, locations may be mentioned off-topic as geographic anchors, e.g. as a reference in a news article reporting on some sports event. In such cases, the reference context may not provide enough evidence to distinguish among mentions of Lincoln, Ontario,Lincoln, AlabamaorLincoln, Kansas. Fur-thermore, the article texts of locations usually describe historical, geographical or political characteristics and do in most cases not thematically relate to reference contexts. Notably, a mention Lincoln may also refer to a person (Abraham Lin-coln), an educational institution (University of Lincoln), an English football club (Lincoln City F.C.) or many more candidate entities. Approaches restricted to specific entity types (e.g. persons) may then further suffer from potential errors of NER models.

Similar characteristics apply to organisations and probably most difficult are sports associations for which we often find not only natural language text but also many tables in the article text. Tables are inherently relational and we will approach

Chapter 3 Topic Models for Person Linking

them in a more relational approach that treats relations more explicitly than the LDA topic modelling technique.

Instead of focussing on named entity’s, we will investigate general entity linking in the next chapter. General entity linking covers named entity linking but goes a step further by treating all kinds of entities, i.e. concepts usually denoted by noun phrases. We aim for a collective approach that may gain from interactions among and across all kinds of entities.

The candidate retrieval methods of Varma et al. [2009] and Lehmann et al. [2010]

have been an inspiration for the method we will present in the following chapter. For general entity linking, we will extend them with relational information derived from co-occurrences of entities in ensemble queries treating all mentions in a document.

Im Dokument Entity Linking to Wikipedia (Seite 105-108)