Interlinkage of Entities in Wikipedia - Entity Linking to Wikipedia

Chapter 2 Entity Linking: Preliminaries

in the article text of another entity e, contributing authors are expected to link at least the first mention to the corresponding article of e⁰.

More specifically, a link l in Wikipedia is a triple of link source, link anchor text and link target.

Notation (Links in Wikipedia)

LetL denote the collection of all links in W. A link l ∈L is a triple l = (ls, lt, la),

wherel_s denotes the link source, l_t the link target andl_a the link anchor text.

Link sources and link targets are entities in Wikipedia and thus we have l_s ∈W andl_t ∈W. Link anchor texts are part of an entity’s article text and formed of one or more strings from the Wikipedia vocabulary VW, i.el_a∈VW ×. . .×VW. A link l is placed in the article text of its link source l_s using Wikipedia markup notation.

This notation couples link anchor text and link target, i.e. [[l_t |l_a]]. Alternatively, it may merely hold the link target, i.e. [[l_t]], if link anchor text and link target do not differ, i.e. l_a=title(e).

The collection of all links L in Wikipedia encodes a directed hyperlink graph, where nodes are entities and edges the links between them. Using the following notation, we distinguish amonginlinks and outlinks.

Notation (Inlinks and Outlinks)

The inlinks L_in(e) of an entitye are the links where e is the link target l_t: L_in(e) ={l ∈L|(l_s =·, l_t =e, l_a =·)} ⊂L.

The outlinks L_out(e) of an entity e are the links wheree is the source l_s: L_out(e) ={l ∈L|(l_s =e, l_t =·, l_a =·)} ⊂L.

For illustration, we give the following example using the links depicted in Fig. 2.4.

Example 3

As Fig. 2.4 shows, the article text of Duran Duran contains the link l¹ as outlink, i.e. l¹ ∈L_out(Duran Duran). This link is described by

link source:l_s=Duran Duran link anchor text:l_a=John Taylor

link target:l_t=John Taylor (bass guitarist).

At the same time, the link l¹ is an inlink of the link target and we have l₁ ∈ L_in(John Taylor (bass guitarist)). Here, the link target is obfuscated by Wikipedia’s markup notation, i.e.

[[John Taylor (bass guitarist)|John Taylor]].

2.4 Interlinkage of Entities in Wikipedia

John Taylor andNick Rhodesformed Duran Duran in Birmingham, England in 1978, where they became the resident band at the city’sRum Runner nightclub.

text(Duran Duran)

l¹= (ls =Duran Duran, lt=John Taylor (bass guitarist), la=John Taylor) l²= (l_s =Duran Duran, l_t=Nick Rhodes, l_a =Nick Rhodes)

l³= (l_s =Duran Duran, l_t=Birmingham, l_a=Birmingham)

l⁴= (l_s =Duran Duran, l_t=Rum Runner (nightclub), l_a=Rum Runner) l¹, . . . , l⁴∈L_out(Duran Duran)

Figure 2.4: Excerpt from the article text of the entity Duran Duran (top) to-gether with the contained links (bottom). Each linkl¹, . . . , l⁴ is contained in the out-link set L_out(Duran Duran) and we havel_s =Duran Duranfor each l¹, . . . , l⁴. The example above shows two properties of links. First, the link anchor text l_a can be considered as an alias for the target entity that is referenced through the link target l_t. Second, a link constitutes a disambiguated mention m. The link l¹ in this example provides a grounded mention of the entity John Taylor (bass guitarist) as the underlying entity for the ambiguous mention John Taylor is annotated in the link target.

This property renders Wikipedia a source of disambiguated data. Each inlink l ∈ L_in(e) constitutes a textual mention of the entity e in the article text of the link source e⁰. Through the link target l_t, the link anchor text l_a is annotated with the ground truth entity, i.e. e⁺(la) = lt = e. For each Wikipedia entity e with at least one inlink, we can extract the textual contents of all referencing articles in L_in(e). The derived dataset can then be used both for the training as well as the evaluation of a disambiguation model, as done first by Bunescu and Pasca [2006] and subsequently in many other approaches (Pilz et al. [2009], Pilz and Paaß [2009], Pilz [2010], Ratinov et al. [2011]). This procedure allows the learning of entity linking models in a supervised setting not only for English but all other language versions of Wikipedia. For instance in this thesis, we learn linking models for German and French, two languages previously neglected in the entity linking literature (Chapter 3).

Furthermore, we may use the same data base to learn a linking model that ac-counts for uncovered entities. One possible avenue for this is using the links that point to a target that does not yet exists in Wikipedia¹. However, such links are rare and usually appear in listings with only few natural language context such as disam-biguation pages. They may also result from a faulty annotation, for instance when an author does not realize that the relevant article already exists and erroneously chooses a different link target.

1The Wikipedia software colours such links in red.

Chapter 2 Entity Linking: Preliminaries

A presumably better alternative was proposed in Bunescu and Pasca [2006]. The authors simulate uncovered mentions by removing the article for a fixed fraction of entities. All mentions previously linked to these entities are then re-assigned to NIL. We follow this strategy when learning models with Wikipedia data and will detail the extraction of such datasets in Chapter 3.

2.4.1 Link-Based Relatedness of Entities

A link in Wikipedia implies a semantic relation between the two Wikipedia entities it connects. Therefore, Wikipedia’s hyperlink graph allows the derivation of relat-edness measures and is thus a key component in most research on entity linking.

The most commonly used measure was introduced by Milne and Witten [2008a], who presented a Wikipedia adaption of the normalized Google distance (Cilibrasi and Vitanyi [2007]). Replacing Google search results with Wikipedia links, Milne and Witten [2008a] defined the semantic relatedness (SRL) of two entities e and e⁰ over their inlink sets L_in(e)and L_in(e⁰):

SRL(e, e⁰) = log (max(|L_in(e)|,|L_in(e⁰)|))−log (|L_in(e)∩L_in(e⁰)|)

log(|L|)−log (min(|L_in(e)|,|L_in(e⁰)|)) ∈[0,1] (2.2) Note that the range of [0,1] given above only holds if we to take into account edge cases that are not explicitly covered in Milne and Witten [2008a]. These arise when an entity has no inlinks or the shared set of inlinks is empty. Thus, we define

L_in(e) = ∅ ∨L_in(e⁰) =∅ ∨L_in(e)∩L_in(e⁰) =∅ →SRL(e, e⁰) := 1. (2.3) With the above definition, SRL may take values in the interval [0,1]. Low values of SRL are realized for similar sets of inlinks and high values for dissimilar inlink sets. The lower bound is obtained only if the sets L_in(e) and L_in(e⁰) are identical.

The upper bound could only be obtained if all links in Wikipedia would be targeted towards one entity, an extremely unlikely edge case. Thus, while Milne and Witten [2008a] somehow counter-intuitively termed SRL a semantic relatedness measure, we argue that this behaviour is that of a dissimilarity measure. Since in this thesis we will use SRL as a similarity measure, we define for all implementations of SRL

SRL^* := 1−SRL (2.4)

where SRL is based on the definition in Eq. 2.2 and the adaption in Eq. 2.3. Then, SRL^*(e_i, e_j) = 0 implies unrelated entities while SRL^*(e_i, e_j) = 1 states that two entities have identical inlink targets. Still, the upper bound is not likely to be obtained due to the magnitude of L in Eq. 2.2. Ratinov et al. [2011] also evaluated semantic relatedness over outlinks sets L_out(e) and L_out(e⁰). Analogously to SRL,

2.4 Interlinkage of Entities in Wikipedia

the operating figure is given by:

SRL_out(e, e⁰) = log (max(|L_out(e)|,|L_out(e⁰)|))−log (|L_out(e)∩L_out(e⁰)|)

log(|L|)−log (min(|L_out(e)|,|L_out(e⁰)|)) ∈[0,1].

(2.5) Again, to ensure a range of [0,1] we need to account for edge cases and analogously to Eq. 2.3 define

L_out(e) =∅ ∨L_out(e⁰) = ∅ ∨L_out(e)∩L_out(e⁰) = ∅ →SRL_out(e, e⁰) := 1 (2.6) to arrive at SRL_out ∈ [0,1]. If not otherwise stated, we use SRL^* to refer to the measure computed over inlinks. The semantic relatedness measure over outlinks is denoted with SRL_out.

Semantic relatedness computed over shared inlinks is basically a measure for the co-occurrence of Wikipedia entities. From this co-occurrence we may derive coher-ence among entities and for instance conclude that a document jointly mentioning Michael Jordan and NBA is more likely to refer to the basketball player and the basketball association instead of the machine learning professor and the boxing as-sociation.

2.4.2 Priors derived from Links

Wikipedia’s hyperlink graph also allows the derivation of priors. Ratinov et al.

[2011] formulate entity-mention probability (EMP) as the prior probability that a link anchor text m refers to an entity eby analysing all pairs of link target and link anchor text in Wikipedia. Then, entity-mention probability is the ratio of times an entity eis the targetl_t for a link anchor textl_a =mto the overall number of targets referenced by m:

p(e|m) = p(lt=e|la=m) (2.7)

≈ |{l ∈L|l = (ls =·, lt=e, la=m)}|

e⁰∈W|{l∈L|l = (l_s=·, l_t=e⁰, l_a =m)}|.

The numerator in Eq. 2.7 is the absolute frequency of e being the target l_t = e of l with anchor text l_a = m and the denominator is the sum over all possible entities e⁰ ∈ W that have been referenced by m through a link anchor text la. While the above formulation results theoretically in a true probability with range [0,1], we point out that in practice we may observe that P

mp(e|m) 6= 1. Due to parsing errors, erroneous links or other pitfalls, we need to interpret EMP as an approximation.

The following example illustrates EMP with values computed using the link index proposed in Pilz and Paaß [2012] (this link index will be thoroughly in Section 4.3.2).

Chapter 2 Entity Linking: Preliminaries

Example 4 (Entity-Mention Probability (EMP))

Based upon Eq. 2.7, we obtain entity-mention probabilities such as:

p(e=Washington, D.C.|m =Washington)≈0.9 p(e=Washington, D.C.|m =D.C.)≈0.2

p(e=Washington (state)|m=Washington)≈0.1

Note that EMP is a value that is not stored in the article text itself and can only be extracted from a knowledge base similar to Wikipedia that provides this information through its internal link structure. It was found to be a proven feature for disambiguation in many approaches (e.g. Milne and Witten [2008b], Fader et al.

[2009], Ratinov et al. [2011], Hoffart et al. [2011b], Pilz and Paaß [2012]). However, note that this formulation of EMP puts a bias towards Wikipedia entries: any men-tion of an uncovered entity that has a surface form matching a prominent Wikipedia entity is likely to be assigned to this entity when no additional information is used.

Another figure derivable from the link graph is the popularity prior of an entity in Wikipedia. The popularity prior of an entitye is the ratio of articles linking toe and the total number of links in Wikipedia:

p(e)≈ |L_in(e)|

|L| . (2.8)

This measure stands in analogy to the in-degree of a node in a graph but is normal-ized through the number of all links in Wikipedia. While defined over Wikipedia links, it may also serve as a prior for the popularity of an entity in other contexts assuming that entities often interlinked in Wikipedia are also frequently mentioned for instance in news articles.

The popularity prior has been successfully used as a baseline attribute for en-tity linking (Ratinov et al. [2011]). However, especially in the English version of Wikipedia, the overall number of links is with 54 million very large¹. Therefore the popularity prior for most entities is very small or close to zero and only a handful of entities have priors greater than a few per mill. For instance, the highest popular-ity prior we observed in the context of this thesis was 0.006 for the entpopular-ity United States. In Pilz and Paaß [2012] we therefore proposed to use the more effective absolute value of |L_in(e)| without normalization factor. In Chapter 4, we will give more details and show how we use this prior for an adaptive filtering of mention candidates.

1Version from September 1st, 2011.

Im Dokument Entity Linking to Wikipedia (Seite 39-45)