Linking as a Ranking Problem - Semantic Labelling of Entities

3.5 Semantic Labelling of Entities

3.5.3 Linking as a Ranking Problem

Chapter 3 Topic Models for Person Linking

Further, assume the topic distribution T_e₁ = {0.3,0.7} and a mention context text(m) = {w₁, w₂}. According to Eq. 3.8, the vector x_WTC(m, e₁) representing the pair of candidate e₁ and mentionm is composed of:

xWTC(m, e1) =







p_e₁(φ₁) = 0.3, ∀(w, φ_k)∈ {(w₁, φ₁),(w₂, φ₁)}

p_e₁(φ₂) = 0.7, ∀(w, φ_k)∈ {(w₁, φ₂),(w₂, φ₂)}

0, else.

The full instantiation of this vector is given by

x_WTC(m, e₁) = [ (w₁, φ₁)

↓

0.3, 0.3

↑ (w₂, φ₁)

(w₃, φ₁)

↓ 0, 0

↑ (w₄, φ₁)

(w₁, φ₂)

↓

0.7, 0.7

↑ (w₂, φ₂)

(w₃, φ₂)

↓ 0, 0

↑ (w₄, φ₂)

]∈[0,1]^K·|V^W^|.

Sincetext(m)∩text(e₂) =∅, the vector x_WTC(m, e₂)representing the pair(m, e₂) has no word-topic correlation features and contains only a zero representing the cosine similarity of the contexts.

Having detailed the feature design of our method WTC and that of its inspiration WCC proposed in Bunescu and Pasca [2006], we will now come to the machine learning method exploiting these designs in order to learn a model for entity linking.

This model is based on ranking candidate entities with respect to a given mention and its context and defines a feature based threshold learning for the detection of uncovered entities. We will then use this method in our experiments to compare WTC and WCC for person name disambiguation in German.

3.5 Semantic Labelling of Entities

evaluated in this thesis, we use the SVM^Light implementation by Thorsten Joachims¹ that provides both standard classification as well as an adaption for ranking.

Now, a ranking approach for entity linking can be summarized as follows. For a mention m and a set of n candidates e(m) = {e₁(m), . . . , e_n(m)}, the optimal result of a ranking algorithm is a ranking r^∗ = {r₁, . . . , r_n} ∈ Rⁿ that orders the n candidate entities e(m) according to their fitness to the mention (or the mention context). In our case, a ranking can be considered correct if the correct underlying entity e⁺(m)is ranked at the top position. To describe the underlying technique, we use the description as in Pilz and Paaß [2009] that closely follows that in Joachims [2002] but adapt notation.

As in Joachims [2002] we start with a collection of entitiese={e₁, . . . , e|W|}. For a mention m we want to determine a list of relevant entities in e, where the most relevant entities appear first. This corresponds to a ranking relation r^∗(m)⊆e×e that fulfills the properties of a weak ordering, i.e. asymmetric and transitive. If an entity e_i is ranked higher than e_j for an ordering r, i.e. e_i <_r e_j, then (e_i, e_j) ∈ r, otherwise (e_i, e_j)6∈r.

We have to measure the similarity of a proposed ranking r(m) and the target ranking r^∗(m). Such a measure is Kendall’s τ (Kendall [1955]) which is a function of the number n_e of concordant pairs in relation to all pairs. A pair e_i 6= e_j is concordant if either (e_i, e_j)∈r_a ∧(e_i, e_j)∈r_b or(e_j, e_i)∈r_a ∧(e_j, e_i)∈r_b.

Now assume we have a training set D containing n different i.i.d. mentions m_i with target rankings

D= (m₁, r^∗₁),(m₂, r₂^∗), . . . ,(m_n, r_n^∗), (3.9) wherer_i^∗ ∈e×eis a ranking on the entities at hand. To achieve a ranking close to the ground truth r^∗, a learner will select a ranking functionf(m)based on the training instance D that maximizes the empirical τD (Kendall [1955]), which measures the similarity of two rankings on the training sample, i.e.

τ_D(f) = 1 n

k=1

τ(r_f_(x(m,e_k_(m))), r^∗_k), (3.10) where r_f_(x(m,e_k_(m))) is the ranking induced by the ranking function f and r_k^∗ the target ranking.

Maximizing Eq. 3.10 is analogous to classification by minimizing training error, with the difference that the target is not a class label, but a binary ordering relation.

Thus, whereas in standard SVMs constraints are formulated over the offset from a separating hyperplane, Ranking SVMs impose different constraints, since addition-ally the relative ordering of the examples has to be modelled. Consider the class of linear ranking functions

(e_i, e_j)∈f_w(m)⇐⇒w·x(m, e_i)> w·x(m, e_j) (3.11)

1The software is available athttp://svmlight.joachims.org

Chapter 3 Topic Models for Person Linking

wherex(m, e_i)∈R^dis a vector ofdreal-valued features that for instance describe the fitness between candidate and mention and w ∈ R^d is a weight vector of matching dimension. For the class of linear ranking functions in Eq. 3.11, maximizing the number of concordant pairs, i.e. maximizing Eq. 3.10, is equivalent to finding the weight vector w so that the maximum number of the following inequalities hold:

∀(e_i, e_j)∈r^∗₁ :w·x(m₁, e_i)>w·x(m₁, e_j) (3.12) ...

∀(ei, ej)∈r_n^∗ :w·x(mn, ei)>w·x(mn, ej)

The exact solution of this problem is NP-hard. As proposed in Joachims [2002], and just like in classification SVMs, the solution is approximated by introducing non-negative slack variables ξ_i,j,k and minimizing the upper bound, i.e. the sum of slack variables P

ξ_i,j,k. Regularizing the length of w to maximize margins leads to the following optimization problem:

minimize: V(w, ξ) = 1

2w·w+C

|e|

i=1

|e|

j=1 n

k=1

ξ_i,j,k (3.13)

subject to:

∀(e_i, e_j)∈r^∗₁ :w·x(m₁, e_i)≥w·x(m₁, e_j) + 1−ξ_i,j,1 (3.14) ...

∀(e_i, e_j)∈r_k^∗ :w·x(m_k, e_i)≥w·x(m_k, e_j) + 1−ξ_i,j,k

∀i ∀j ∀k : ξ_i,j,k ≥0

The parameterC is the usual parameter capturing the trade-off between margin size and training error in terms of n_e. As noted in Joachims [2002], this optimization problem is comparable to the ordinal regression approach in Herbrich et al. [2000].

Further, it is convex and has no local optima. By rearranging the constraints in Eq. 3.14 as

w·(x(m_k, e_i)−x(m_k, e_j))≥1−ξ_i,j,k (3.15) it becomes apparent that the optimization problem is equivalent to that of a clas-sification SVM on pairwise difference vectors x(m_k, e_i)− x(m_k, e_j). Due to this similarity, it can be solved using decomposition algorithms similar to those used for SVM classification.

To formulate inference using such a ranking function, we first note that it can be shown that a learned ranking function f_w^∗(m)can always be represented as a linear combination of the feature vectors:

(e_i, e_j)∈f_w^∗(m)⇔ w^∗·x(m, e_i)> w^∗·x(m, e_j)

⇔X

a^∗_k,lx(m_k, e_l)·x(m, e_i)>X

a^∗_k,lx(m_k, e_l)·x(m, e_j), (3.16) where w^∗ is the learned weight vector and a^∗_k,l are derived from the values of the Lagrangian dual variables at the solution. Further, we note that the learned ranking

3.5 Semantic Labelling of Entities

Figure 3.5: Example of two weight vectors w₁ and w₂ ranking four points (after Joachims [2002]). The margin δ is the distance between the closest two projections within all target rankings. For w₁ and δ₁, these are the points 1 and 2, for w₂ and δ₂ the points 1 and 4.

function f_w^∗(m) is here used to rank a set of candidates according to a mention m. Aiming at the candidate with highest rank, it is then sufficient to sort these candidates by their value of

rank(x(m, e_i)) =w^∗·x(m, e_i) =X

a^∗_k,lx(m_k, e_l)·x(m, e_j). (3.17) The final prediction eˆis then given by

e= arg max

ei∈e(m)

rank(x(m, e_i)) = arg max

ei∈e(m)

w^∗·x(m, e_i). (3.18) An exemplary ordering implied by a weight vector w is illustrated in Fig. 3.5 (adapted from Joachims [2002]). The figure illustrates how a weight vector w de-termines the ordering of four points in a two-dimensional example. For any weight vector w, the points are ordered by their projection onto w, which is equivalent to an ordering by the signed distance to a hyperplane with normal vector w. In the example in Fig. 3.5, this means that for w₁ the points are ordered (1,2,3,4), while w₂ implies the ordering (2,3,1,4) (Joachims [2002]).

While Ranking SVMs may just as standard SVMs be used with all kinds of kernels, a linear kernel has the advantage that weights of features can be directly extracted without computational effort. Bunescu and Pasca make use of this to automatically learn the threshold for a decision onNILcandidates. They have demonstrated that, using a linear kernel in the Ranking SVM, this threshold can be learned automati-cally from the weight of an indicative feature:

x_nil(m, e) =1(e,NIL). (3.19) This binary feature is active only for a NIL candidate that needs to be provided for each mention in order to learn the threshold from the available features. We may therefore create candidate sets e(m) = {e_i(m)} ⊂ W ∪ {NIL} that cover

Chapter 3 Topic Models for Person Linking

all candidates in Wikipedia {e_i(m)} ⊂ W and add for each mention an artificial candidate NIL.

To create training instances, we need to assign each training instance represent-ing a mention-candidate pair a rankrepresent-ing. For our implementation of Bunescu and Pasca’s method, we unsuccessfully tried to communicate with the authors on how these target rankings are created for the training data. Since the paper does not indicate otherwise, we assume that the ranking used in Bunescu and Pasca [2006]

is a weak ordering where the correct candidate is assigned the top position and all other candidates that do not represent the ground truth entity share a place in the ordering. In practice, this ordering is realised through real-valued scalars y ∈ R. These are assigned to each vectorx(m, e_i) and a high value ofy indicates a leading position in the ranking, a low value of y indicates a late position in the ranking. In our case, i.e. the case of a weak ordering, it suffices to chose a valuey ∈ {−1,+1}.

Then, for instance in the case of three candidatese₁,e₂ and e₃ for a mention m, we have

1e₁ =e⁺(m) :y(x(m, e₁)) = +1 e₂ 6=e⁺(m) :y(x(m, e₂)) = −1 e3 6=e⁺(m) :y(x(m, e3)) = −1

which putsx(m, e₁)at the leading position and letsx(m, e₂)and x(m, e₃) share the same but lower position.

Having described the model designs of WTC and WCC and the learner used by Pilz and Paaß [2009] as well as by Bunescu and Pasca, we will now experimentally compare these approaches for person name disambiguation in German.

Im Dokument Entity Linking to Wikipedia (Seite 66-70)