Mention Enrichment - Entity Linking to Wikipedia

We assume the input to our linking model to be a natural language text document with a collection of mentionsM ={m₁, . . . , m_k}to link. These mentions can either be given, as is the case for the benchmark corpora described in Section 4.2, or they can be provided by a chunker or a NER model. Note that in contrast to other

120

4.5 Mention Enrichment

approaches such as Hoffart et al. [2011b], we do not restrict to mentions of named entities but also treat conceptual entities such as bank or tree. We may evaluate the available type information in candidate retrieval, but we do not thoroughly rely on it. Instead, we will mainly use it in our heuristic name expansion algorithm.

4.5.1 Name Expansion

Entities and typically persons are usually mentioned only once with their full name in a document. Subsequent mentions are often abbreviations and use for instance only the last name of a person. This can be misleading for candidate retrieval as it can notably increase the number of candidates and may also distract from the correct candidate when a different entity has a notably higher entity-mention probability EMP. For instance, this would be the case for a mentionWilhelm Busch that is later in the document abbreviated to Busch.

To account for this, we propose name expansion with a simple, heuristic in-document co-reference resolution. For this, we first apply the publicly available Apache OpenNLP NER model¹ to infer entity types for the mentions in a docu-ment. Then we collect all mentions from the document and use each mention’s surface form along with its type information (if present) for name expansion. More specifically, we iterate over the mention sequence M = {m₁, . . . , m_k} and search for mentions that are partially or token-wise contained in a preceding mention, i.e. name(mi) ⊂ name(mi−j) for any i = 1, . . . , k −1 and 0 < j < i. If such a match is found and the type of the corresponding two mentions is the same, e.g.

type(m_i) = type(mi−j) =person, the shorter mention is expanded to the longer one.

Example 14 (Name expansion) For the mention collection

M={(Al Gore, person),(Gore, person),(Gore Bay, location)}

name expansion yields

M_exp ={(Al Gore, person),(Al Gore, person),(Gore Bay, location)}.

This name expansion based on co-reference resolution is similar to Shen et al.

[2012] but additionally incorporates the type information. Cucerzan [2007] used a similar method but required mentions to have the same type for expansion. We relax this assumption when encountering mentions of unknown entity type. Assume that the type of Gorein Example 14 would be unknown because the NER model failed to recognize it as a person mention. Then, we still assume that it resolves to the person

1http://opennlp.apache.org/

Chapter 4 Local and Global Search for Entity Linking

mention Al Gore, since the abbreviation of person names is supposedly much more common than the abbreviation of location names. Thus, since this name expansion does not fully depend on type information, mentions without type assignment are also handled.

Given that our matching is token-based and not character-based, we do not expand acronyms as done by Cucerzan [2007] or Varma et al. [2009]. This would certainly be a point worth investigating for future work, particularly because Hachey et al.

[2013] report that the performance of Cucerzan [2007] drops by about 5% when this expansion is omitted. The authors report similar decrease in performance when the acronym expander of Varma et al. [2009] is omitted. Here, we also experimentally evaluated the effect of name expansion and found that is has a positive impact on linking performance. We will detail our findings further in Section 4.8.2.

4.5.2 Context Representation

We use different context representations for candidate retrieval and candidate con-solidation. For candidate retrieval, we emphasize on the local, mention specific context. For candidate consolidation we will use document-level information and also latent topics, the details will be described in Section 4.7. Here we describe the mention context representation used in candidate retrieval.

Since the disambiguating quality of the mention context is important for entity linking, we extract context not only on the document level but also from local, mention specific context features. To do so, we first extract all PoS tags in the document using the Apache OpenNLP PoS tagging tool. Then, local context words are the two nouns left and right of the mention. This is motivated by the idea that noun phrases, named entities or conceptual entities that imply the sense of an ambiguous mention usually appear close to this mention.

Additionally, we extract the top 20 TF-IDF ranked keywords from the document text as document-level keywords. For this, we use Wikipedia as background corpus for IDF computation. Since short documents may contain a lower number of im-portant words, we refrain from using a threshold and simply use the 20 words with highest TF-IDF score. The document-level keyword set is then localized for each mention. From the joint set of local context words and document keywords, we keep only those terms that relate to any of the mention’s candidates. More specifically, the candidate specific terms are those words that appear at least once in the text of any candidate entity whose title matches the surface form of the mention. This candidate dependent keyword selection aims at using specifically those terms that are discriminative for entities. In the same way, we compute keywords from the headline of the input document, assuming that headline information is especially important. Since we refrain from tuning extraction methods to specific corpora, we use a simple headline extraction method that assumes the headline to be the first line in the document followed by a line break.

122

Im Dokument Entity Linking to Wikipedia (Seite 134-137)