Collective Search - Candidate Retrieval - Entity Linking to Wikipedia

4.6 Candidate Retrieval

4.6.1 Collective Search

Collective search is motivated by the assumption that Wikipedia entities referenc-ing many of the given mentions are likely to have a similar subject as the input document. Then, from the outlink target entities these entities provide, we can automatically generate intermediate but reliable candidate entity sets that will in many cases contain the correct underlying entities for the given mentions. To find the best source entities of these outlink targets, we create an ensemble query over all the given mentions. This ensemble query is then matched against the link infor-mation encoded in our indexIW and thus also implicitly against the full hyperlink graph of Wikipedia.

Ensemble Query Generation

To exploit the co-occurrence of mentions as link anchor texts in Wikipedia, we create an ensemble queryqMthat jointly treats the namesname(mi)of all mentions m_i ∈M={m₁, . . . , m_k}. This query then contains one link anchor text query term q_l_a(m_i) per mention m_i ∈ M and, according to Eq. 4.5, is formed as a conjunction over these terms:

q_M=q_l_a(name(m₁))∧. . .∧q_l_a(name(m_k)) (4.8) Importantly, we use no mandatory terms to state that a specific mention must ap-pear. First, this would require prior knowledge on the importance of mentions. Sec-ond, if a mandatory mention was never observed as a link anchor text in Wikipedia, a search in IW using this query would always return zero results. Also, we do not use weights on specific terms and thus treat all mentions equally. For future work, it would be worth investigating the existence of seed entities, i.e. entities that should be weighted higher in such a query because they are more influential for the document-level entity distribution.

Note that such an ensemble query can also be used to approximate the probability of joint occurrence of the mentionsM: few hits indicate a low joint probability, many hits indicate a high joint probability.

Candidates retrieved from Ensemble Queries

Now, to retrieve the aforementioned source entities, we searchI_W using the ensem-ble query q_M and obtain a ranked list of source entities S_q_M ∈ W that collectively contain a high number of the input mentions m_i as values in their link text fields.

To avoid noise, we restrict the number of returned hits and use at most 30 source entities. As described in Section 4.3.1, Lucene ranks each source entity e_q_M ∈ S_q_M with a score sIW that is based on the number of matches of the mention m_i on the link text fields (linkText, m_i) of e_q_M. According to Eq. 4.6, this score relates to

124

4.6 Candidate Retrieval

[. . . ] Shepard, Glenn set first two milestones

May 5, 1961: Alan Shepardbecomes first American in space.

Feb. 20, 1962: John Glennbecomes first American inorbit.

Jan. 27, 1967: Gus Grissom,Edward WhiteII and Roger Chaffeedie inApollo 1 spacecraft fire onlaunch pad.

July 20, 1969: Apollo 11’sNeil ArmstrongandBuzz Aldrin land onmoon.

July 17, 1975: American Apollo and Soviet Soyuz spacecraftlink in orbit.

April 12, 1981: Columbiasoars on firstspace shuttleflight.

June 18, 1983: Sally Ridebecomes first American woman in space.

Jan. 28, 1986: Challengerexplodes, killing all seven on board.

April 25, 1990: Hubble Space Telescopeis released into orbit.

Dec. 2, 1993: First Hubble repair mission is launched.

March 14, 1995: Norman Thagardis first American to be launched on a Russianrocket. Two days later, he becomes first American to visitMir.

June 29, 1995: Atlantisdocks with Mir in first shuttle-station hookup.

Sept. 26, 1996: Shannon Lucidreturns to Earth after 188-day Mir mission, a U.S. space en-durance record and a world record for women.

Nov. 19, 1996: Story Musgrave, at age 61, becomes oldest man in space.

Oct. 29, 1998: Discoveryis scheduled to blast off, carrying 77-year-old John Glenn back into orbit and making him oldest man in space. [. . . ]

document text

Figure 4.1: Excerpt from a document in the AQUAINT corpus. The mentions to be linked to Wikipedia are highlighted, here this is only the first mention of an entity in the document.

the TF-IDF of a mentionm_i for a source entity e_q_M where the operating figures are computed over the index fields l_a. Consequently, the more mentions an entity e_q_M contains as link anchor text l_a, the higher the ranking scoresIW(q_M, e_q_M).

Now, since eache_q_M also provides the link targetsl_t∈W for the link anchor texts l_a in the fields linkTo, we can extract all outlink targets l_t ∈ W from all source entities in S_q_M, i.e. all e= l_t ∈L_out(S_q_M). We endow each of these outlink targets e = l_t ∈ L_out(S_q_M) with a relevance weight w_r(e). This weight is the sum over the scores sI_W(q_M, e_q_M)for the different source entitiese_q_M ∈S_q_M that containe as an outlink target e∈L_out(e_q_M):

wr(e) = X

eqM∈SqM

δesIW(qM, eq_M), δe=

(1 iff e∈L_out(e_q_M),

0 else. (4.9)

These weights are interpreted as the relevance of a candidate and we use them to remove less relevant candidates from the overall set L_out(S_q_M) that may easily contain more than a thousand of different entities appearing only once as outlink target. Therefore, we keep only a reduced set L^∗_out(S_q_M) of the top 100 candidate entities in L_out(S_q_M)that have the highest relevance weights w_r(e).

Chapter 4 Local and Global Search for Entity Linking

Hubble Space Telescope Apollo 11

Moon Buzz Aldrin Neil Armstrong

Alan Shepard John Glenn Edward Higgins White

Space Shuttle Discovery Apollo (spacecraft) Soyuz (rocket family) Space Shuttle Columbia Space Shuttle Columbia disaster

Space Shuttle Challenger Space Shuttle Atlantis

Soyuz (spacecraft) NASA

Space Shuttle

International Space Station

Astronaut

Figure 4.2: Illustration of collective search results (Example 15). The figure shows a reduced link network for the top ranked source entitiesS_q_M (middle) retrieved with an ensemble query for the mentions depicted in Fig. 4.1. The entitiesNASA,Space Shuttleand International Space Station have highest rank as they contain many link anchor texts matching the given mentions, e.g. Challenger, Columbiaand Atlantis. For simplicity we depict only outlinks and each link only once. Indeed, each link may appear multiple times in the article of a source entity, e.g. we find five outlinks fromNASA to Space Shuttle.

To illustrate the described process of collective search, we give the following ex-ample using a document from the AQUAINT corpus as depicted in Fig. 4.1. For convenience, we reduce to a subset of the mentions contained in the document and show only a selection of source entities and outlink targets in Fig. 4.2.

Example 15 (Collective Search)

Take a list of mentions contained in the document as depicted in Fig. 4.1:

M={Apollo 11,spacecraft,Columbia,Challenger,Atlantis,Discovery, . . .}.

Following Eq. 4.8, the ensemble query created for M is

q_M =q_l_a("Apollo 11")∧q_l_a("spacecraft")∧q_l_a("Columbia")

∧qla("Challenger")∧qla("Atlantis")∧qla("Discovery")∧. . . .

A search in IW using this query q_M returns 30 ranked sources entities S_q_M. For illustration, we show here only the highlights:

1. NASA, sIW = 143.76

2. Space Shuttle, sIW = 120.98

3. International Space Station,sIW = 119.56

126

4.6 Candidate Retrieval

4. Apollo program, sIW = 89.13 ...

8. Moon, s_I_W = 56.51 9. Astronaut, sIW = 53.78

...

20. Space Shuttle Atlantis, sI_W = 34.29

21. Space Shuttle Columbia disaster, sI_W = 32.23 ...

Note that the scoressI_W(q_M, e_q_M)are given here only for illustration and can not be interpreted without the context of this example. As we see from this ranked list, the retrieved source entities are thematically very related to the content of the document (cf. Fig. 4.1). This is also illustrated through the dense linkage in Fig. 4.2: the entities NASA, International Space Station and Space Shuttle are ranked highest as they contain most of the given mentions, e.g.

Challenger, Columbia and Atlantis, as link anchor texts, albeit with different link targets. The entity Apollo Program is already more specific and, containing less mentions as link anchor text, receives a notably lower scoresIW.

Since these entities contain the required link anchor texts, they consequently also contain many outlink targets that indeed correspond to the ground truth entities of the given mentions. As we see in Fig. 4.2, the set of outlink targets L_out(S_q_M)of the source entities NASA, Space Shuttle etc. contains

L_out(S_q_M) = {Space Shuttle Columbia, . . . , Space Shuttle Challenger, . . . , Space Shuttle Atlantis, . . . , Space Shuttle Discovery, . . .}

Following Eq. 4.9, these outlink targets are weighted with relevance weightsw_r(e) based on the score of their respective source entities in Sq_M. The higher this weight, the more often the entitye is an outlink target of any source entitye_q_M ∈ S_q_M. The highest weight would be obtained for an entity that appears in all outlink target sets of the source entities containing at the same time many mentions as link anchor text.

At this point, the retrieved candidates form a set of potential targets that is not yet related to the input mentions. To link the elements in the target entity set L^∗_out(S_q_M) with the input mentions, we use their respective title and redirect index fields. More specifically, we analyse for each e ∈L^∗_out(S_q_M) if either the title or the redirect of e contains any name(m_i). If so, we add e to the candidate set e^c_i(m_i) for mention m_i. Note that one e can then be contained in multiple candidate sets.

Alternatively, when this candidate-mention association yields no result, no collective

Chapter 4 Local and Global Search for Entity Linking

search candidate can be assigned.

The result of the collective search and the above candidate assignment is the collectione^c₁(m1), . . . ,e^c_k(mk), where each sete^c_i(mi)is a set of candidate entities for mention m_i.

Cross Coherence among Candidate Entities

Our intuition is that entities mentioned jointly in a document should be related.

Following other approaches, we use the SRL^* measure over inlinks (cf. Eq. 2.4) to compute the relatedness among Wikipedia entities. Now, SRL^* is usually used to compute the pairwise relatedness of two entities. Here, we introducecross coherence to account for thecollective fitness of a set of entities.

Cross coherence states how well a specific candidate entitye_ij ∈e^c_i(m_i)fits to the other candidate entities {e^c_l}^|M|_l=1,l6=i. More formally, we define the cross coherence coh× of a candidate e_ij ∈e^c_i towards a collection of other candidates {e^c_l}^|M|_l=1,l6=i as:

coh×(eij,{e^c_l}^|M|_l=1

l6=i

) := 1

|M|

l=1l6=i e^c_i6=e^c_l

|e^c_l| X

e⁰∈e^c_l eij6=e⁰

∆·SRL^*(eij, e⁰). (4.10)

Here, |M| is the total number of mentions in the document, i the index over these mentions andj the index over the candidates for a mentionmi. The second sum in Eq. 4.10 computes the averaged pairwise SRL^* of candidate e_ij for mentionm_i and the candidates in another candidate sete^c_l for another mentionm_l. This is weighted by the factor∆, a real-value scalar that we use to account for contextual similarity and describe in more detail in the following. The weighted averaged relatedness is then again averaged over all candidate sets for all mentions by the first sum in Eq. 4.10.

With the definition above, cross coherence can be interpreted as the average dis-tance of an entity to a collection of entities and has range [0,1]. This range is also preserved through the weighting factor ∆ in Eq. 4.10. This factor serves as an additional weight of relatedness and may be the EMP of a candidate (Milne and Witten [2008b]) or a binary value indicating that the two candidates link to each other (Ratinov et al. [2011]). Since the first variant may erroneously prioritize high popularity candidates and the second is somewhat restrictive, we here propose factors that constitute contextual similarity weights.

Cross Coherence Weight Factors

To evaluate the effect of the different weighting factors, we will compare against a baseline that uses no weight factor for semantic relatedness by omitting the term∆

128

4.6 Candidate Retrieval

in Eq. 4.10. For better distinction, we will denote this baseline with coh_SRL* and then have

coh_SRL*(e_ij,{e^c_l}^|M|_l=1

l6=i

) := 1

|M|

l=1l6=i e^c_i6=e^c_l

|e^c_l| X

e⁰∈e^c_l eij6=e⁰

SRL^*(e_ij, e⁰). (4.11)

The first weight that we evaluate is based on the cosine similarity of candidate contexts cos(text(e), text(e⁰))(cf. Eq. 3.1). Analogously to Eq. 4.11, we replace the factor ∆in Eq. 4.10 and arrive at

coh_cos_SRL*(e_ij,{e^c_l}^|M|_l=1

l6=i

) := 1

|M|

l=1l6=i e^c_i6=e^c_l

|e^c_l| X

e⁰∈e^c_l eij6=e⁰

cos(text(e_ij), text(e⁰))·SRL^*(e_ij, e⁰).

(4.12) We will also evaluate cross coherence using only cosine similarity. Then, cross co-herence is purely context based and uses no semantic relatedness from links. This is achieved by omitting the SRL^* term in Eq. 4.12, i.e.

coh_cos(e_ij,{e^c_l}^|M|_l=1

l6=i

) := 1

|M|

l=1l6=i e^c_i6=e^c_l

|e^c_l| X

e⁰∈e^c_l eij6=e⁰

cos(text(e_ij), text(e⁰)). (4.13)

Following the results obtained in the previous chapter, we also introduce a coher-ence weight based on topic distributions. In contrast to the previous chapter, topic distributions are not inferred on article texts but, to emphasise the co-occurrence of entities, on link anchor texts. More specifically, the documents used to train the topic model then consist of the concatenation of all link anchor textsl_a contained in the outlink collectionLout(e)of an entitye. Based on this, we introduce as thematic weight the Hellinger distance of the topic probability distributions inferred over the concatenation of link anchor texts l_a∈L_out(e) and l_a ∈L_out(e⁰)

H(T_e_la,T_e⁰

la) =

k=1

p_L_out_(e)(φ_k)p_L_out_(e⁰₎(φ_k), (4.14) with K the number of topics in the LDA model and p_L_out_(e) and p_L_out_(e⁰₎ the topic probability distribution vectors for the concatenation of link texts {l_a} ∈ L_out(e) resp. {l_a} ∈ L_out(e⁰) of the entities e and e⁰. The formulation of Hellinger distance in Eq. 4.14 is an alternative to that in Eq. 3.36 but it can be shown that they are equivalent. The thematic weightH(T_e_la,T_e⁰

la)then constitutes the thematic distance

Chapter 4 Local and Global Search for Entity Linking

over two link text collections and we use this weight as a replacement for the cosine similarity in Eq. 4.12, i.e.

coh_τSRL*(e_ij,{e^c_l}^|M|_l=1

l6=i

) := 1

|M|

l=1l6=i e^c_i6=e^c_l

|e^c_l| X

e⁰∈e^c_l eij6=e⁰

H(T_e

ijla,T_e⁰

la)·SRL^*(e_ij, e⁰). (4.15) To train the required topic model, we randomly chose 90k entities that have at least 10 outlinks, extracted all their link anchor texts and then trained a topic model with 500 topics on the generated documents.

So far we have defined the collective search procedure and the weighting of the retrieved candidates. Since still the setse^c_i contain more than one candidate for each mentionm_i, we now describe how we choose the best fitting candidate from this set.

Selection of Candidates from Collective Search

The selection of the best fitting candidate is the final result of the collective search procedure and determines prioritized candidate entities e^coh(mi). More specifically, from the collectively retrieved candidate entitiese^c_i(m_i)we select one candidate en-titye^coh(m_i)for each mentionm_i based on the product of collective search relevance weightwr (Eq. 4.9) and cross coherence coh× (Eq. 4.10):

e^coh(mi) := arg max

ei(mi)∈e^c_i(mi)

(wr(ei(mi))·coh×(ei(mi),{e^c_l(ml)}^|M|_l=1

l6=i

). (4.16)

That is, among all candidatese^c_i(m_i)for a mentionm_i, we choose the entity that has maximum value for the product of collective search relevance weightw_r(e_i)(Eq. 4.9) and cross coherence coh× (Eq. 4.10). We use coh× in a product with w_r to reduce the dominating effect of w_r as the latter is usually several orders of magnitudes higher. Importantly, note that we can only assign such a candidate e^coh(m_i) to a mention m_i, if the set of e^c_i(m_i) is not empty. In the other case, we have no such candidate.

In an alternative formulation, we might incorporate the EMP of a candidate.

But then popular candidates will dominate in most cases, even if their coherence is low. For instance, Shen et al. [2012] propose a similar global coherence measure, but, instead of computing the global coherence over all candidates as proposed here, the authors use a strong simplification and compute the global coherence only over those candidates that have highest EMP, no matter how well they fit to the context. Thus, less prominent entities are completely ignored. In contrast, we consider all candidates and investigate different weighting schemes as described above. When evaluating different cross coherence weights, we also determine the candidate e^coh(m_i)using the specific weight for cross coherence computation. Then we use either coh_SRL* (Eq. 4.11), coh_cos_SRL* (Eq. 4.12), coh_τSRL* (Eq. 4.15) or the purely contextual formcoh_cos (Eq. 4.13) that omits SRL^*.

130

4.6 Candidate Retrieval

Im Dokument Entity Linking to Wikipedia (Seite 138-145)