• Keine Ergebnisse gefunden

Linking as a Ranking Problem

Im Dokument Entity Linking to Wikipedia (Seite 66-70)

3.5 Semantic Labelling of Entities

3.5.3 Linking as a Ranking Problem

Chapter 3 Topic Models for Person Linking

Further, assume the topic distribution Te1 = {0.3,0.7} and a mention context text(m) = {w1, w2}. According to Eq. 3.8, the vector xWTC(m, e1) representing the pair of candidate e1 and mentionm is composed of:

xWTC(m, e1) =





pe11) = 0.3, ∀(w, φk)∈ {(w1, φ1),(w2, φ1)}

pe12) = 0.7, ∀(w, φk)∈ {(w1, φ2),(w2, φ2)}

0, else.

The full instantiation of this vector is given by

xWTC(m, e1) = [ (w1, φ1)

0.3, 0.3

↑ (w2, φ1)

,

(w3, φ1)

↓ 0, 0

↑ (w4, φ1)

,

(w1, φ2)

0.7, 0.7

↑ (w2, φ2)

,

(w3, φ2)

↓ 0, 0

↑ (w4, φ2)

]∈[0,1]K·|VW|.

Sincetext(m)∩text(e2) =∅, the vector xWTC(m, e2)representing the pair(m, e2) has no word-topic correlation features and contains only a zero representing the cosine similarity of the contexts.

Having detailed the feature design of our method WTC and that of its inspiration WCC proposed in Bunescu and Pasca [2006], we will now come to the machine learning method exploiting these designs in order to learn a model for entity linking.

This model is based on ranking candidate entities with respect to a given mention and its context and defines a feature based threshold learning for the detection of uncovered entities. We will then use this method in our experiments to compare WTC and WCC for person name disambiguation in German.

3.5 Semantic Labelling of Entities

evaluated in this thesis, we use the SVMLight implementation by Thorsten Joachims1 that provides both standard classification as well as an adaption for ranking.

Now, a ranking approach for entity linking can be summarized as follows. For a mention m and a set of n candidates e(m) = {e1(m), . . . , en(m)}, the optimal result of a ranking algorithm is a ranking r = {r1, . . . , rn} ∈ Rn that orders the n candidate entities e(m) according to their fitness to the mention (or the mention context). In our case, a ranking can be considered correct if the correct underlying entity e+(m)is ranked at the top position. To describe the underlying technique, we use the description as in Pilz and Paaß [2009] that closely follows that in Joachims [2002] but adapt notation.

As in Joachims [2002] we start with a collection of entitiese={e1, . . . , e|W|}. For a mention m we want to determine a list of relevant entities in e, where the most relevant entities appear first. This corresponds to a ranking relation r(m)⊆e×e that fulfills the properties of a weak ordering, i.e. asymmetric and transitive. If an entity ei is ranked higher than ej for an ordering r, i.e. ei <r ej, then (ei, ej) ∈ r, otherwise (ei, ej)6∈r.

We have to measure the similarity of a proposed ranking r(m) and the target ranking r(m). Such a measure is Kendall’s τ (Kendall [1955]) which is a function of the number ne of concordant pairs in relation to all pairs. A pair ei 6= ej is concordant if either (ei, ej)∈ra ∧(ei, ej)∈rb or(ej, ei)∈ra ∧(ej, ei)∈rb.

Now assume we have a training set D containing n different i.i.d. mentions mi with target rankings

D= (m1, r1),(m2, r2), . . . ,(mn, rn), (3.9) whereri ∈e×eis a ranking on the entities at hand. To achieve a ranking close to the ground truth r, a learner will select a ranking functionf(m)based on the training instance D that maximizes the empirical τD (Kendall [1955]), which measures the similarity of two rankings on the training sample, i.e.

τD(f) = 1 n

n

X

k=1

τ(rf(x(m,ek(m))), rk), (3.10) where rf(x(m,ek(m))) is the ranking induced by the ranking function f and rk the target ranking.

Maximizing Eq. 3.10 is analogous to classification by minimizing training error, with the difference that the target is not a class label, but a binary ordering relation.

Thus, whereas in standard SVMs constraints are formulated over the offset from a separating hyperplane, Ranking SVMs impose different constraints, since addition-ally the relative ordering of the examples has to be modelled. Consider the class of linear ranking functions

(ei, ej)∈fw(m)⇐⇒w·x(m, ei)> w·x(m, ej) (3.11)

1The software is available athttp://svmlight.joachims.org

Chapter 3 Topic Models for Person Linking

wherex(m, ei)∈Rdis a vector ofdreal-valued features that for instance describe the fitness between candidate and mention and w ∈ Rd is a weight vector of matching dimension. For the class of linear ranking functions in Eq. 3.11, maximizing the number of concordant pairs, i.e. maximizing Eq. 3.10, is equivalent to finding the weight vector w so that the maximum number of the following inequalities hold:

∀(ei, ej)∈r1 :w·x(m1, ei)>w·x(m1, ej) (3.12) ...

∀(ei, ej)∈rn :w·x(mn, ei)>w·x(mn, ej)

The exact solution of this problem is NP-hard. As proposed in Joachims [2002], and just like in classification SVMs, the solution is approximated by introducing non-negative slack variables ξi,j,k and minimizing the upper bound, i.e. the sum of slack variables P

ξi,j,k. Regularizing the length of w to maximize margins leads to the following optimization problem:

minimize: V(w, ξ) = 1

2w·w+C

|e|

X

i=1

|e|

X

j=1 n

X

k=1

ξi,j,k (3.13)

subject to:

∀(ei, ej)∈r1 :w·x(m1, ei)≥w·x(m1, ej) + 1−ξi,j,1 (3.14) ...

∀(ei, ej)∈rk :w·x(mk, ei)≥w·x(mk, ej) + 1−ξi,j,k

∀i ∀j ∀k : ξi,j,k ≥0

The parameterC is the usual parameter capturing the trade-off between margin size and training error in terms of ne. As noted in Joachims [2002], this optimization problem is comparable to the ordinal regression approach in Herbrich et al. [2000].

Further, it is convex and has no local optima. By rearranging the constraints in Eq. 3.14 as

w·(x(mk, ei)−x(mk, ej))≥1−ξi,j,k (3.15) it becomes apparent that the optimization problem is equivalent to that of a clas-sification SVM on pairwise difference vectors x(mk, ei)− x(mk, ej). Due to this similarity, it can be solved using decomposition algorithms similar to those used for SVM classification.

To formulate inference using such a ranking function, we first note that it can be shown that a learned ranking function fw(m)can always be represented as a linear combination of the feature vectors:

(ei, ej)∈fw(m)⇔ w·x(m, ei)> w·x(m, ej)

⇔X

ak,lx(mk, el)·x(m, ei)>X

ak,lx(mk, el)·x(m, ej), (3.16) where w is the learned weight vector and ak,l are derived from the values of the Lagrangian dual variables at the solution. Further, we note that the learned ranking

54

3.5 Semantic Labelling of Entities

Figure 3.5: Example of two weight vectors w1 and w2 ranking four points (after Joachims [2002]). The margin δ is the distance between the closest two projections within all target rankings. For w1 and δ1, these are the points 1 and 2, for w2 and δ2 the points 1 and 4.

function fw(m) is here used to rank a set of candidates according to a mention m. Aiming at the candidate with highest rank, it is then sufficient to sort these candidates by their value of

rank(x(m, ei)) =w·x(m, ei) =X

ak,lx(mk, el)·x(m, ej). (3.17) The final prediction eˆis then given by

ˆ

e= arg max

ei∈e(m)

rank(x(m, ei)) = arg max

ei∈e(m)

w·x(m, ei). (3.18) An exemplary ordering implied by a weight vector w is illustrated in Fig. 3.5 (adapted from Joachims [2002]). The figure illustrates how a weight vector w de-termines the ordering of four points in a two-dimensional example. For any weight vector w, the points are ordered by their projection onto w, which is equivalent to an ordering by the signed distance to a hyperplane with normal vector w. In the example in Fig. 3.5, this means that for w1 the points are ordered (1,2,3,4), while w2 implies the ordering (2,3,1,4) (Joachims [2002]).

While Ranking SVMs may just as standard SVMs be used with all kinds of kernels, a linear kernel has the advantage that weights of features can be directly extracted without computational effort. Bunescu and Pasca make use of this to automatically learn the threshold for a decision onNILcandidates. They have demonstrated that, using a linear kernel in the Ranking SVM, this threshold can be learned automati-cally from the weight of an indicative feature:

xnil(m, e) =1(e,NIL). (3.19) This binary feature is active only for a NIL candidate that needs to be provided for each mention in order to learn the threshold from the available features. We may therefore create candidate sets e(m) = {ei(m)} ⊂ W ∪ {NIL} that cover

Chapter 3 Topic Models for Person Linking

all candidates in Wikipedia {ei(m)} ⊂ W and add for each mention an artificial candidate NIL.

To create training instances, we need to assign each training instance represent-ing a mention-candidate pair a rankrepresent-ing. For our implementation of Bunescu and Pasca’s method, we unsuccessfully tried to communicate with the authors on how these target rankings are created for the training data. Since the paper does not indicate otherwise, we assume that the ranking used in Bunescu and Pasca [2006]

is a weak ordering where the correct candidate is assigned the top position and all other candidates that do not represent the ground truth entity share a place in the ordering. In practice, this ordering is realised through real-valued scalars y ∈ R. These are assigned to each vectorx(m, ei) and a high value ofy indicates a leading position in the ranking, a low value of y indicates a late position in the ranking. In our case, i.e. the case of a weak ordering, it suffices to chose a valuey ∈ {−1,+1}.

Then, for instance in the case of three candidatese1,e2 and e3 for a mention m, we have

1e1 =e+(m) :y(x(m, e1)) = +1 e2 6=e+(m) :y(x(m, e2)) = −1 e3 6=e+(m) :y(x(m, e3)) = −1

which putsx(m, e1)at the leading position and letsx(m, e2)and x(m, e3) share the same but lower position.

Having described the model designs of WTC and WCC and the learner used by Pilz and Paaß [2009] as well as by Bunescu and Pasca, we will now experimentally compare these approaches for person name disambiguation in German.

Im Dokument Entity Linking to Wikipedia (Seite 66-70)