• Keine Ergebnisse gefunden

Interlinkage of Entities in Wikipedia

Im Dokument Entity Linking to Wikipedia (Seite 39-45)

Chapter 2 Entity Linking: Preliminaries

in the article text of another entity e, contributing authors are expected to link at least the first mention to the corresponding article of e0.

More specifically, a link l in Wikipedia is a triple of link source, link anchor text and link target.

Notation (Links in Wikipedia)

LetL denote the collection of all links in W. A link l ∈L is a triple l = (ls, lt, la),

wherels denotes the link source, lt the link target andla the link anchor text.

Link sources and link targets are entities in Wikipedia and thus we have ls ∈W andlt ∈W. Link anchor texts are part of an entity’s article text and formed of one or more strings from the Wikipedia vocabulary VW, i.ela∈VW ×. . .×VW. A link l is placed in the article text of its link source ls using Wikipedia markup notation.

This notation couples link anchor text and link target, i.e. [[lt |la]]. Alternatively, it may merely hold the link target, i.e. [[lt]], if link anchor text and link target do not differ, i.e. la=title(e).

The collection of all links L in Wikipedia encodes a directed hyperlink graph, where nodes are entities and edges the links between them. Using the following notation, we distinguish amonginlinks and outlinks.

Notation (Inlinks and Outlinks)

The inlinks Lin(e) of an entitye are the links where e is the link target lt: Lin(e) ={l ∈L|(ls =·, lt =e, la =·)} ⊂L.

The outlinks Lout(e) of an entity e are the links wheree is the source ls: Lout(e) ={l ∈L|(ls =e, lt =·, la =·)} ⊂L.

For illustration, we give the following example using the links depicted in Fig. 2.4.

Example 3

As Fig. 2.4 shows, the article text of Duran Duran contains the link l1 as outlink, i.e. l1 ∈Lout(Duran Duran). This link is described by

link source:ls=Duran Duran link anchor text:la=John Taylor

link target:lt=John Taylor (bass guitarist).

At the same time, the link l1 is an inlink of the link target and we have l1 ∈ Lin(John Taylor (bass guitarist)). Here, the link target is obfuscated by Wikipedia’s markup notation, i.e.

[[John Taylor (bass guitarist)|John Taylor]].

26

2.4 Interlinkage of Entities in Wikipedia

John Taylor andNick Rhodesformed Duran Duran in Birmingham, England in 1978, where they became the resident band at the city’sRum Runner nightclub.

text(Duran Duran)

l1= (ls =Duran Duran, lt=John Taylor (bass guitarist), la=John Taylor) l2= (ls =Duran Duran, lt=Nick Rhodes, la =Nick Rhodes)

l3= (ls =Duran Duran, lt=Birmingham, la=Birmingham)

l4= (ls =Duran Duran, lt=Rum Runner (nightclub), la=Rum Runner) l1, . . . , l4∈Lout(Duran Duran)

Figure 2.4: Excerpt from the article text of the entity Duran Duran (top) to-gether with the contained links (bottom). Each linkl1, . . . , l4 is contained in the out-link set Lout(Duran Duran) and we havels =Duran Duranfor each l1, . . . , l4. The example above shows two properties of links. First, the link anchor text la can be considered as an alias for the target entity that is referenced through the link target lt. Second, a link constitutes a disambiguated mention m. The link l1 in this example provides a grounded mention of the entity John Taylor (bass guitarist) as the underlying entity for the ambiguous mention John Taylor is annotated in the link target.

This property renders Wikipedia a source of disambiguated data. Each inlink l ∈ Lin(e) constitutes a textual mention of the entity e in the article text of the link source e0. Through the link target lt, the link anchor text la is annotated with the ground truth entity, i.e. e+(la) = lt = e. For each Wikipedia entity e with at least one inlink, we can extract the textual contents of all referencing articles in Lin(e). The derived dataset can then be used both for the training as well as the evaluation of a disambiguation model, as done first by Bunescu and Pasca [2006] and subsequently in many other approaches (Pilz et al. [2009], Pilz and Paaß [2009], Pilz [2010], Ratinov et al. [2011]). This procedure allows the learning of entity linking models in a supervised setting not only for English but all other language versions of Wikipedia. For instance in this thesis, we learn linking models for German and French, two languages previously neglected in the entity linking literature (Chapter 3).

Furthermore, we may use the same data base to learn a linking model that ac-counts for uncovered entities. One possible avenue for this is using the links that point to a target that does not yet exists in Wikipedia1. However, such links are rare and usually appear in listings with only few natural language context such as disam-biguation pages. They may also result from a faulty annotation, for instance when an author does not realize that the relevant article already exists and erroneously chooses a different link target.

1The Wikipedia software colours such links in red.

Chapter 2 Entity Linking: Preliminaries

A presumably better alternative was proposed in Bunescu and Pasca [2006]. The authors simulate uncovered mentions by removing the article for a fixed fraction of entities. All mentions previously linked to these entities are then re-assigned to NIL. We follow this strategy when learning models with Wikipedia data and will detail the extraction of such datasets in Chapter 3.

2.4.1 Link-Based Relatedness of Entities

A link in Wikipedia implies a semantic relation between the two Wikipedia entities it connects. Therefore, Wikipedia’s hyperlink graph allows the derivation of relat-edness measures and is thus a key component in most research on entity linking.

The most commonly used measure was introduced by Milne and Witten [2008a], who presented a Wikipedia adaption of the normalized Google distance (Cilibrasi and Vitanyi [2007]). Replacing Google search results with Wikipedia links, Milne and Witten [2008a] defined the semantic relatedness (SRL) of two entities e and e0 over their inlink sets Lin(e)and Lin(e0):

SRL(e, e0) = log (max(|Lin(e)|,|Lin(e0)|))−log (|Lin(e)∩Lin(e0)|)

log(|L|)−log (min(|Lin(e)|,|Lin(e0)|)) ∈[0,1] (2.2) Note that the range of [0,1] given above only holds if we to take into account edge cases that are not explicitly covered in Milne and Witten [2008a]. These arise when an entity has no inlinks or the shared set of inlinks is empty. Thus, we define

Lin(e) = ∅ ∨Lin(e0) =∅ ∨Lin(e)∩Lin(e0) =∅ →SRL(e, e0) := 1. (2.3) With the above definition, SRL may take values in the interval [0,1]. Low values of SRL are realized for similar sets of inlinks and high values for dissimilar inlink sets. The lower bound is obtained only if the sets Lin(e) and Lin(e0) are identical.

The upper bound could only be obtained if all links in Wikipedia would be targeted towards one entity, an extremely unlikely edge case. Thus, while Milne and Witten [2008a] somehow counter-intuitively termed SRL a semantic relatedness measure, we argue that this behaviour is that of a dissimilarity measure. Since in this thesis we will use SRL as a similarity measure, we define for all implementations of SRL

SRL* := 1−SRL (2.4)

where SRL is based on the definition in Eq. 2.2 and the adaption in Eq. 2.3. Then, SRL*(ei, ej) = 0 implies unrelated entities while SRL*(ei, ej) = 1 states that two entities have identical inlink targets. Still, the upper bound is not likely to be obtained due to the magnitude of L in Eq. 2.2. Ratinov et al. [2011] also evaluated semantic relatedness over outlinks sets Lout(e) and Lout(e0). Analogously to SRL,

28

2.4 Interlinkage of Entities in Wikipedia

the operating figure is given by:

SRLout(e, e0) = log (max(|Lout(e)|,|Lout(e0)|))−log (|Lout(e)∩Lout(e0)|)

log(|L|)−log (min(|Lout(e)|,|Lout(e0)|)) ∈[0,1].

(2.5) Again, to ensure a range of [0,1] we need to account for edge cases and analogously to Eq. 2.3 define

Lout(e) =∅ ∨Lout(e0) = ∅ ∨Lout(e)∩Lout(e0) = ∅ →SRLout(e, e0) := 1 (2.6) to arrive at SRLout ∈ [0,1]. If not otherwise stated, we use SRL* to refer to the measure computed over inlinks. The semantic relatedness measure over outlinks is denoted with SRLout.

Semantic relatedness computed over shared inlinks is basically a measure for the co-occurrence of Wikipedia entities. From this co-occurrence we may derive coher-ence among entities and for instance conclude that a document jointly mentioning Michael Jordan and NBA is more likely to refer to the basketball player and the basketball association instead of the machine learning professor and the boxing as-sociation.

2.4.2 Priors derived from Links

Wikipedia’s hyperlink graph also allows the derivation of priors. Ratinov et al.

[2011] formulate entity-mention probability (EMP) as the prior probability that a link anchor text m refers to an entity eby analysing all pairs of link target and link anchor text in Wikipedia. Then, entity-mention probability is the ratio of times an entity eis the targetlt for a link anchor textla =mto the overall number of targets referenced by m:

p(e|m) = p(lt=e|la=m) (2.7)

≈ |{l ∈L|l = (ls =·, lt=e, la=m)}|

P

e0∈W|{l∈L|l = (ls=·, lt=e0, la =m)}|.

The numerator in Eq. 2.7 is the absolute frequency of e being the target lt = e of l with anchor text la = m and the denominator is the sum over all possible entities e0 ∈ W that have been referenced by m through a link anchor text la. While the above formulation results theoretically in a true probability with range [0,1], we point out that in practice we may observe that P

mp(e|m) 6= 1. Due to parsing errors, erroneous links or other pitfalls, we need to interpret EMP as an approximation.

The following example illustrates EMP with values computed using the link index proposed in Pilz and Paaß [2012] (this link index will be thoroughly in Section 4.3.2).

Chapter 2 Entity Linking: Preliminaries

Example 4 (Entity-Mention Probability (EMP))

Based upon Eq. 2.7, we obtain entity-mention probabilities such as:

p(e=Washington, D.C.|m =Washington)≈0.9 p(e=Washington, D.C.|m =D.C.)≈0.2

p(e=Washington (state)|m=Washington)≈0.1

Note that EMP is a value that is not stored in the article text itself and can only be extracted from a knowledge base similar to Wikipedia that provides this information through its internal link structure. It was found to be a proven feature for disambiguation in many approaches (e.g. Milne and Witten [2008b], Fader et al.

[2009], Ratinov et al. [2011], Hoffart et al. [2011b], Pilz and Paaß [2012]). However, note that this formulation of EMP puts a bias towards Wikipedia entries: any men-tion of an uncovered entity that has a surface form matching a prominent Wikipedia entity is likely to be assigned to this entity when no additional information is used.

Another figure derivable from the link graph is the popularity prior of an entity in Wikipedia. The popularity prior of an entitye is the ratio of articles linking toe and the total number of links in Wikipedia:

p(e)≈ |Lin(e)|

|L| . (2.8)

This measure stands in analogy to the in-degree of a node in a graph but is normal-ized through the number of all links in Wikipedia. While defined over Wikipedia links, it may also serve as a prior for the popularity of an entity in other contexts assuming that entities often interlinked in Wikipedia are also frequently mentioned for instance in news articles.

The popularity prior has been successfully used as a baseline attribute for en-tity linking (Ratinov et al. [2011]). However, especially in the English version of Wikipedia, the overall number of links is with 54 million very large1. Therefore the popularity prior for most entities is very small or close to zero and only a handful of entities have priors greater than a few per mill. For instance, the highest popular-ity prior we observed in the context of this thesis was 0.006 for the entpopular-ity United States. In Pilz and Paaß [2012] we therefore proposed to use the more effective absolute value of |Lin(e)| without normalization factor. In Chapter 4, we will give more details and show how we use this prior for an adaptive filtering of mention candidates.

1Version from September 1st, 2011.

30

Im Dokument Entity Linking to Wikipedia (Seite 39-45)