• Keine Ergebnisse gefunden

Entities in Wikipedia

Im Dokument Entity Linking to Wikipedia (Seite 35-39)

candidate retrieval model. These will be described next, together with the other main attributes of entities in Wikipedia, i.e. article texts and categories.

2.3 Entities in Wikipedia

2.3.1 Textual Descriptions

In Wikipedia, each entity has a natural language context in its article text. Article texts provide textual descriptions of entities that can be used to assess the contextual similarity of mentions and entities.

Notation (Article text)

For any entity e∈W we usetext(e)to denote the entity’s context which is derived from the respective Wikipedia article text.

By Wikipedia standards, an article is supposed to describe its entity in a concise but comprehensive way. In the following we consider the Wikipedia article text, i.e.

the plain text without markup, tables, infoboxes or figures, as a natural language text definition of the described entity. Analogously, the document referencing a mention provides the (natural) language context text(m) of a mentionm.

Notation (Mention context)

For a mention m we use text(m) to denote its context which is derived from the document in which the mention appears.

In general, we assume a context text(m) to comprise all words surrounding a mention m, meaning either the complete document or a restricted, localized con-text window, for example five words left and right of the mention. This concon-text is assumed to disambiguate the mention so that its true underlying entity can be inferred. Note that the natural language text in Wikipedia is comparable to the natural language context of a mention, assuming overlap in the underlying vocabu-laries. This allows us to formulate entity linking based on a similarity function over the two contexts text(m) and text(e). The most prominent contextual similarity measure is cosine similarity that compares the two word-vectors of entity and men-tion context. We will give further details in Chapter 3 where we also propose new contextual measures for entity linking.

2.3.2 Entity Categorization

To group articles on similar subjects, Wikipedia employs a categorization system.

Below the top-level categories distinguishing persons from cultural or economical entities, many other categories exist that further describe the entity depicted in an

Chapter 2 Entity Linking: Preliminaries

Table 2.1: Entities and a selection of their assigned categories from the English Wikipedia (distinct categories are separated by a semicolon).

title(e) categoriesc(e) John Taylor

(bass guitarist)

Living people; English rock bass guitarists; Power Station (band) members; Duran Duran members; English Roman Catholics; Ivor Novello Award winners; . . .

John Taylor (jazz)

Living people; Post-bop pianists; ECM artists; Musicians from Manchester; British jazz pianists; . . .

John Taylor (athlete)

American sprinters; Athletes (track and field) at the 1908 summer Olympics; Olympic medallists in athletics (track and field); Olympic track and field athletes of the united states

article. Categories may be thematically related to the article content but also state the gender of a person or the founding year of an organization.

By Wikipedia standards, every article is required to have at least one category that is manually assigned by a contributor using Wikipedia markup language. We use the following notation to refer to the categories of an entity.

Notation (Categories)

We denote the collection of all Wikipedia categories by CW ={c1, . . . , c|CW|}. The subset of categories applying to a specific entity e ∈ W is denoted by c(e) = {c1(e), . . . , c|c(e)|(e)} ⊂CW.

Tab. 2.1 lists some exemplary categories from the English Wikipedia. For example, the categories assigned toJohn Taylor(bass guitarist)depict his profession as musician and the genre of music, i.e. rock, he is involved with. While grouping this entity with the musicianJohn Taylor (jazz) on a higher level, more specific categories distinguish the rock guitarist from the jazz pianist (e.g. English rock bass guitarists and British jazz pianists).

Originally a tree, the Wikipedia category system has evolved to graph with many interconnections and loops. Due to these loops and also other inconsistencies, we found the analysis of Wikipedia’s category system non-trivial in preliminary stud-ies. Moreover, even though there exist guidelines on categorization1, Wikipedia categories can be very general but also overly specific. Rather general categories such asLiving People apply to very many entities, overly specific categories such as Fictional elephants apply to only very few entities.

As categories group entities by subject, they can be used to measure semantic relatedness among entities and also to extend contextual information. The seman-tic relatedness expressed by categories (Strube and Ponzetto [2006]) has also been

1http://en.wikipedia.org/wiki/Wikipedia:Categorization

22

2.3 Entities in Wikipedia

Table 2.2: Examples of Wikipedia titles and associated redirects (distinct redirects are separated by a semicolon).

title(e) r(e)

Nick Rhodes Nicholas James Bates

Stephen Duffy Steven Tin Tin Duffy; Stephen TinTin Duffy; Stephen ’Tin Tin’

Duffy; Stephen Tin Tin Duffy; Duffy (group) John Taylor

(bass guitarist)

John Taylor (Duran Duran); Nigel John Taylor

exploited in entity linking approaches. These approaches usually do not consider all available categories but use, often manually, selected subsets, either to avoid noise or to emphasize semantic relatedness in specific subgroups. For example, Cucerzan [2007] used filtered subsets for a named entity disambiguation model, Bunescu and Pasca [2006] used the specific branch People by Occupation for their person name disambiguation model.

2.3.3 Alias Names for Entities

Wikipedia provides several means to collect alias names for its entities by which the synonymy and polysemy of entity names can be resolved. The first important means are redirects that can be used to collect alternative names for an entity and thus account for the synonymy of entity names. A redirect is a meta-page that contains only a forwarding link to an actual entity. The title of a redirect page is considered as an alias of the target entity the redirect page points to. An entity in Wikipedia may have several redirects, one for every alternative name that a Wikipedia contributor used to refer to it.

Notation (Redirects)

For any entity e ∈ W, we use r(e) ={r1(e), . . . , rr(e)} to denote the collection of titles that redirect to e.

Tab. 2.2 lists examples of entity titles and their associated redirects. For in-stance, redirects may hold the full name of a person (e.g. Nicholas James Bates for Nick Rhodes), cover nickname variants (e.g.Stephen ’Tin Tin’ Duffy forStephen Duffy) or provide more name variants.

Redirects provide a large resource of synonyms and have been exploited exten-sively in the literature. However, seldom considered is the fact that redirects can also be misleading since they do not necessarily compose equivalence relations. For instance, the German chancellor Angela Merkel has a redirect Ulrich Merkel.

This is not an identity relation as Ulrich Merkel is a different person, namely the

Chapter 2 Entity Linking: Preliminaries

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

104 105 106

6.26 5.92

5.51 5.23

4.95 4.76

4.54 4.38

4.23 4.1

3.95 3.86

3.73 3.65

3.55 3.46

redirect set size frequency(log10)

Frequency of redirect set sizes

Figure 2.3: To measure synonymy in terms of redirects, the figure shows the car-dinality of redirect set sizes by frequency. This is clamped to set sizes between 0 and 15 due to the long tail of entities with up to 900 distinct redirects. The number of distinct redirects of a Wikipedia entity may be used as an approximation on the number of its synonyms.

first husband of Angela Merkel. Thus, the usage of redirects may introduce errors in the disambiguation process. In the worst case scenario, we would link an uncovered mention to an entity in Wikipedia because of an erroneous redirect map-ping in this alias dictionary. Alternatively, we may erroneously link a mention to a merely related entity, as would be the case for Ulrich Merkel. But then, assuming that the creation of such a redirect was somehow intentional, we argue that the predicted entity may at least provide useful information both for the reader of the linked context or other link consuming systems.

In general, we assume such errors to be rare and furthermore, a better defined redirect scheme would already require a disambiguation step. Therefore we consider all redirects as is without pre-processing and create alias dictionaries that map all redirects of an entity to its unique title (and vice versa).

Notably, using redirects, we can also estimate the number of synonyms for an entity. As they provide name alternatives, redirects are comparable to synonyms and the number of redirects of an entity may serve as an approximation of the true number of potential synonyms of this entity. To illustrate synonymy in terms of redirect numbers, we counted the number of redirects for all articles in Wikipedia and depict in Fig. 2.3 how often a specific number appears. For visualization purposes, this is restricted to redirect set sizes between zero and 15 as there is a long tail of entities with more than 100 and up to 900 distinct redirects. From the figure we see that about 23% of the 3.4 million entities in the English Wikipedia have at least two redirects. This finding has two implications. First, a rather large number of entities can be assumed to have several synonyms and thus a high variation in

24

Im Dokument Entity Linking to Wikipedia (Seite 35-39)