• Keine Ergebnisse gefunden

Collective Search

Im Dokument Entity Linking to Wikipedia (Seite 138-145)

4.6 Candidate Retrieval

4.6.1 Collective Search

Collective search is motivated by the assumption that Wikipedia entities referenc-ing many of the given mentions are likely to have a similar subject as the input document. Then, from the outlink target entities these entities provide, we can automatically generate intermediate but reliable candidate entity sets that will in many cases contain the correct underlying entities for the given mentions. To find the best source entities of these outlink targets, we create an ensemble query over all the given mentions. This ensemble query is then matched against the link infor-mation encoded in our indexIW and thus also implicitly against the full hyperlink graph of Wikipedia.

Ensemble Query Generation

To exploit the co-occurrence of mentions as link anchor texts in Wikipedia, we create an ensemble queryqMthat jointly treats the namesname(mi)of all mentions mi ∈M={m1, . . . , mk}. This query then contains one link anchor text query term qla(mi) per mention mi ∈ M and, according to Eq. 4.5, is formed as a conjunction over these terms:

qM=qla(name(m1))∧. . .∧qla(name(mk)) (4.8) Importantly, we use no mandatory terms to state that a specific mention must ap-pear. First, this would require prior knowledge on the importance of mentions. Sec-ond, if a mandatory mention was never observed as a link anchor text in Wikipedia, a search in IW using this query would always return zero results. Also, we do not use weights on specific terms and thus treat all mentions equally. For future work, it would be worth investigating the existence of seed entities, i.e. entities that should be weighted higher in such a query because they are more influential for the document-level entity distribution.

Note that such an ensemble query can also be used to approximate the probability of joint occurrence of the mentionsM: few hits indicate a low joint probability, many hits indicate a high joint probability.

Candidates retrieved from Ensemble Queries

Now, to retrieve the aforementioned source entities, we searchIW using the ensem-ble query qM and obtain a ranked list of source entities SqM ∈ W that collectively contain a high number of the input mentions mi as values in their link text fields.

To avoid noise, we restrict the number of returned hits and use at most 30 source entities. As described in Section 4.3.1, Lucene ranks each source entity eqM ∈ SqM with a score sIW that is based on the number of matches of the mention mi on the link text fields (linkText, mi) of eqM. According to Eq. 4.6, this score relates to

124

4.6 Candidate Retrieval

[. . . ] Shepard, Glenn set first two milestones

May 5, 1961: Alan Shepardbecomes first American in space.

Feb. 20, 1962: John Glennbecomes first American inorbit.

Jan. 27, 1967: Gus Grissom,Edward WhiteII and Roger Chaffeedie inApollo 1 spacecraft fire onlaunch pad.

July 20, 1969: Apollo 11’sNeil ArmstrongandBuzz Aldrin land onmoon.

July 17, 1975: American Apollo and Soviet Soyuz spacecraftlink in orbit.

April 12, 1981: Columbiasoars on firstspace shuttleflight.

June 18, 1983: Sally Ridebecomes first American woman in space.

Jan. 28, 1986: Challengerexplodes, killing all seven on board.

April 25, 1990: Hubble Space Telescopeis released into orbit.

Dec. 2, 1993: First Hubble repair mission is launched.

March 14, 1995: Norman Thagardis first American to be launched on a Russianrocket. Two days later, he becomes first American to visitMir.

June 29, 1995: Atlantisdocks with Mir in first shuttle-station hookup.

Sept. 26, 1996: Shannon Lucidreturns to Earth after 188-day Mir mission, a U.S. space en-durance record and a world record for women.

Nov. 19, 1996: Story Musgrave, at age 61, becomes oldest man in space.

Oct. 29, 1998: Discoveryis scheduled to blast off, carrying 77-year-old John Glenn back into orbit and making him oldest man in space. [. . . ]

document text

Figure 4.1: Excerpt from a document in the AQUAINT corpus. The mentions to be linked to Wikipedia are highlighted, here this is only the first mention of an entity in the document.

the TF-IDF of a mentionmi for a source entity eqM where the operating figures are computed over the index fields la. Consequently, the more mentions an entity eqM contains as link anchor text la, the higher the ranking scoresIW(qM, eqM).

Now, since eacheqM also provides the link targetslt∈W for the link anchor texts la in the fields linkTo, we can extract all outlink targets lt ∈ W from all source entities in SqM, i.e. all e= lt ∈Lout(SqM). We endow each of these outlink targets e = lt ∈ Lout(SqM) with a relevance weight wr(e). This weight is the sum over the scores sIW(qM, eqM)for the different source entitieseqM ∈SqM that containe as an outlink target e∈Lout(eqM):

wr(e) = X

eqM∈SqM

δesIW(qM, eqM), δe=

(1 iff e∈Lout(eqM),

0 else. (4.9)

These weights are interpreted as the relevance of a candidate and we use them to remove less relevant candidates from the overall set Lout(SqM) that may easily contain more than a thousand of different entities appearing only once as outlink target. Therefore, we keep only a reduced set Lout(SqM) of the top 100 candidate entities in Lout(SqM)that have the highest relevance weights wr(e).

Chapter 4 Local and Global Search for Entity Linking

Hubble Space Telescope Apollo 11

Moon Buzz Aldrin Neil Armstrong

Alan Shepard John Glenn Edward Higgins White

Space Shuttle Discovery Apollo (spacecraft) Soyuz (rocket family) Space Shuttle Columbia Space Shuttle Columbia disaster

Space Shuttle Challenger Space Shuttle Atlantis

Soyuz (spacecraft) NASA

Space Shuttle

International Space Station

Astronaut

Figure 4.2: Illustration of collective search results (Example 15). The figure shows a reduced link network for the top ranked source entitiesSqM (middle) retrieved with an ensemble query for the mentions depicted in Fig. 4.1. The entitiesNASA,Space Shuttleand International Space Station have highest rank as they contain many link anchor texts matching the given mentions, e.g. Challenger, Columbiaand Atlantis. For simplicity we depict only outlinks and each link only once. Indeed, each link may appear multiple times in the article of a source entity, e.g. we find five outlinks fromNASA to Space Shuttle.

To illustrate the described process of collective search, we give the following ex-ample using a document from the AQUAINT corpus as depicted in Fig. 4.1. For convenience, we reduce to a subset of the mentions contained in the document and show only a selection of source entities and outlink targets in Fig. 4.2.

Example 15 (Collective Search)

Take a list of mentions contained in the document as depicted in Fig. 4.1:

M={Apollo 11,spacecraft,Columbia,Challenger,Atlantis,Discovery, . . .}.

Following Eq. 4.8, the ensemble query created for M is

qM =qla("Apollo 11")∧qla("spacecraft")∧qla("Columbia")

∧qla("Challenger")∧qla("Atlantis")∧qla("Discovery")∧. . . .

A search in IW using this query qM returns 30 ranked sources entities SqM. For illustration, we show here only the highlights:

1. NASA, sIW = 143.76

2. Space Shuttle, sIW = 120.98

3. International Space Station,sIW = 119.56

126

4.6 Candidate Retrieval

4. Apollo program, sIW = 89.13 ...

8. Moon, sIW = 56.51 9. Astronaut, sIW = 53.78

...

20. Space Shuttle Atlantis, sIW = 34.29

21. Space Shuttle Columbia disaster, sIW = 32.23 ...

Note that the scoressIW(qM, eqM)are given here only for illustration and can not be interpreted without the context of this example. As we see from this ranked list, the retrieved source entities are thematically very related to the content of the document (cf. Fig. 4.1). This is also illustrated through the dense linkage in Fig. 4.2: the entities NASA, International Space Station and Space Shuttle are ranked highest as they contain most of the given mentions, e.g.

Challenger, Columbia and Atlantis, as link anchor texts, albeit with different link targets. The entity Apollo Program is already more specific and, containing less mentions as link anchor text, receives a notably lower scoresIW.

Since these entities contain the required link anchor texts, they consequently also contain many outlink targets that indeed correspond to the ground truth entities of the given mentions. As we see in Fig. 4.2, the set of outlink targets Lout(SqM)of the source entities NASA, Space Shuttle etc. contains

Lout(SqM) = {Space Shuttle Columbia, . . . , Space Shuttle Challenger, . . . , Space Shuttle Atlantis, . . . , Space Shuttle Discovery, . . .}

Following Eq. 4.9, these outlink targets are weighted with relevance weightswr(e) based on the score of their respective source entities in SqM. The higher this weight, the more often the entitye is an outlink target of any source entityeqM ∈ SqM. The highest weight would be obtained for an entity that appears in all outlink target sets of the source entities containing at the same time many mentions as link anchor text.

At this point, the retrieved candidates form a set of potential targets that is not yet related to the input mentions. To link the elements in the target entity set Lout(SqM) with the input mentions, we use their respective title and redirect index fields. More specifically, we analyse for each e ∈Lout(SqM) if either the title or the redirect of e contains any name(mi). If so, we add e to the candidate set eci(mi) for mention mi. Note that one e can then be contained in multiple candidate sets.

Alternatively, when this candidate-mention association yields no result, no collective

Chapter 4 Local and Global Search for Entity Linking

search candidate can be assigned.

The result of the collective search and the above candidate assignment is the collectionec1(m1), . . . ,eck(mk), where each seteci(mi)is a set of candidate entities for mention mi.

Cross Coherence among Candidate Entities

Our intuition is that entities mentioned jointly in a document should be related.

Following other approaches, we use the SRL* measure over inlinks (cf. Eq. 2.4) to compute the relatedness among Wikipedia entities. Now, SRL* is usually used to compute the pairwise relatedness of two entities. Here, we introducecross coherence to account for thecollective fitness of a set of entities.

Cross coherence states how well a specific candidate entityeij ∈eci(mi)fits to the other candidate entities {ecl}|M|l=1,l6=i. More formally, we define the cross coherence coh× of a candidate eij ∈eci towards a collection of other candidates {ecl}|M|l=1,l6=i as:

coh×(eij,{ecl}|M|l=1

l6=i

) := 1

|M|

|M|

X

l=1l6=i eci6=ecl

1

|ecl| X

e0∈ecl eij6=e0

∆·SRL*(eij, e0). (4.10)

Here, |M| is the total number of mentions in the document, i the index over these mentions andj the index over the candidates for a mentionmi. The second sum in Eq. 4.10 computes the averaged pairwise SRL* of candidate eij for mentionmi and the candidates in another candidate setecl for another mentionml. This is weighted by the factor∆, a real-value scalar that we use to account for contextual similarity and describe in more detail in the following. The weighted averaged relatedness is then again averaged over all candidate sets for all mentions by the first sum in Eq. 4.10.

With the definition above, cross coherence can be interpreted as the average dis-tance of an entity to a collection of entities and has range [0,1]. This range is also preserved through the weighting factor ∆ in Eq. 4.10. This factor serves as an additional weight of relatedness and may be the EMP of a candidate (Milne and Witten [2008b]) or a binary value indicating that the two candidates link to each other (Ratinov et al. [2011]). Since the first variant may erroneously prioritize high popularity candidates and the second is somewhat restrictive, we here propose factors that constitute contextual similarity weights.

Cross Coherence Weight Factors

To evaluate the effect of the different weighting factors, we will compare against a baseline that uses no weight factor for semantic relatedness by omitting the term∆

128

4.6 Candidate Retrieval

in Eq. 4.10. For better distinction, we will denote this baseline with cohSRL* and then have

cohSRL*(eij,{ecl}|M|l=1

l6=i

) := 1

|M|

|M|

X

l=1l6=i eci6=ecl

1

|ecl| X

e0∈ecl eij6=e0

SRL*(eij, e0). (4.11)

The first weight that we evaluate is based on the cosine similarity of candidate contexts cos(text(e), text(e0))(cf. Eq. 3.1). Analogously to Eq. 4.11, we replace the factor ∆in Eq. 4.10 and arrive at

cohcosSRL*(eij,{ecl}|M|l=1

l6=i

) := 1

|M|

|M|

X

l=1l6=i eci6=ecl

1

|ecl| X

e0∈ecl eij6=e0

cos(text(eij), text(e0))·SRL*(eij, e0).

(4.12) We will also evaluate cross coherence using only cosine similarity. Then, cross co-herence is purely context based and uses no semantic relatedness from links. This is achieved by omitting the SRL* term in Eq. 4.12, i.e.

cohcos(eij,{ecl}|M|l=1

l6=i

) := 1

|M|

|M|

X

l=1l6=i eci6=ecl

1

|ecl| X

e0∈ecl eij6=e0

cos(text(eij), text(e0)). (4.13)

Following the results obtained in the previous chapter, we also introduce a coher-ence weight based on topic distributions. In contrast to the previous chapter, topic distributions are not inferred on article texts but, to emphasise the co-occurrence of entities, on link anchor texts. More specifically, the documents used to train the topic model then consist of the concatenation of all link anchor textsla contained in the outlink collectionLout(e)of an entitye. Based on this, we introduce as thematic weight the Hellinger distance of the topic probability distributions inferred over the concatenation of link anchor texts la∈Lout(e) and la ∈Lout(e0)

H(Tela,Te0

la) =

K

X

k=1

q

pLout(e)k)pLout(e0)k), (4.14) with K the number of topics in the LDA model and pLout(e) and pLout(e0) the topic probability distribution vectors for the concatenation of link texts {la} ∈ Lout(e) resp. {la} ∈ Lout(e0) of the entities e and e0. The formulation of Hellinger distance in Eq. 4.14 is an alternative to that in Eq. 3.36 but it can be shown that they are equivalent. The thematic weightH(Tela,Te0

la)then constitutes the thematic distance

Chapter 4 Local and Global Search for Entity Linking

over two link text collections and we use this weight as a replacement for the cosine similarity in Eq. 4.12, i.e.

cohτSRL*(eij,{ecl}|M|l=1

l6=i

) := 1

|M|

|M|

X

l=1l6=i eci6=ecl

1

|ecl| X

e0∈ecl eij6=e0

H(Te

ijla,Te0

la)·SRL*(eij, e0). (4.15) To train the required topic model, we randomly chose 90k entities that have at least 10 outlinks, extracted all their link anchor texts and then trained a topic model with 500 topics on the generated documents.

So far we have defined the collective search procedure and the weighting of the retrieved candidates. Since still the setseci contain more than one candidate for each mentionmi, we now describe how we choose the best fitting candidate from this set.

Selection of Candidates from Collective Search

The selection of the best fitting candidate is the final result of the collective search procedure and determines prioritized candidate entities ecoh(mi). More specifically, from the collectively retrieved candidate entitieseci(mi)we select one candidate en-tityecoh(mi)for each mentionmi based on the product of collective search relevance weightwr (Eq. 4.9) and cross coherence coh× (Eq. 4.10):

ecoh(mi) := arg max

ei(mi)∈eci(mi)

(wr(ei(mi))·coh×(ei(mi),{ecl(ml)}|M|l=1

l6=i

). (4.16)

That is, among all candidateseci(mi)for a mentionmi, we choose the entity that has maximum value for the product of collective search relevance weightwr(ei)(Eq. 4.9) and cross coherence coh× (Eq. 4.10). We use coh× in a product with wr to reduce the dominating effect of wr as the latter is usually several orders of magnitudes higher. Importantly, note that we can only assign such a candidate ecoh(mi) to a mention mi, if the set of eci(mi) is not empty. In the other case, we have no such candidate.

In an alternative formulation, we might incorporate the EMP of a candidate.

But then popular candidates will dominate in most cases, even if their coherence is low. For instance, Shen et al. [2012] propose a similar global coherence measure, but, instead of computing the global coherence over all candidates as proposed here, the authors use a strong simplification and compute the global coherence only over those candidates that have highest EMP, no matter how well they fit to the context. Thus, less prominent entities are completely ignored. In contrast, we consider all candidates and investigate different weighting schemes as described above. When evaluating different cross coherence weights, we also determine the candidate ecoh(mi)using the specific weight for cross coherence computation. Then we use either cohSRL* (Eq. 4.11), cohcosSRL* (Eq. 4.12), cohτSRL* (Eq. 4.15) or the purely contextual formcohcos (Eq. 4.13) that omits SRL*.

130

4.6 Candidate Retrieval

Im Dokument Entity Linking to Wikipedia (Seite 138-145)