Measures for Thematic Context Distance - Thematic Context Distance

3.6 Thematic Context Distance

3.6.1 Measures for Thematic Context Distance

To generalize from word based comparison, we propose to measure thethematic con-text distance between between mention and entity context to identify the underlying entity of a mention. To measure the thematic context distance for a mention-entity pair (m, e_i), we need to compare the topic probability distributionT_m over the men-tion contexttext(m)with the topic distributionsT_e_i over the candidate entity article text text(e_i)for all candidates e_i ∈e(m).

The topic modelling literature has evaluated a number of divergence measures for probability distributions with different weighting schemes. Here, we focus on their formulations as distances, i.e. the symmetric versions of divergence measures.

We describe three popular distribution distance measures that, applied to topic probability distributions are here interpreted as thematic distances with respect to a given topic model. We evaluate each of these distances for their individual performance for the task of entity disambiguation (see Fig. 3.8 and Fig. 3.9).

Being very popular in the topic modelling literature, the first measure we describe is the Kullback-Leibler divergence. The Kullback-Leibler divergence was introduced by Kullback and Leibler [1951] as a divergence measure or relative entropy that for two probability distributions T_m and T_e is given by

KL(T_e,T_m) =

k=1

p_e(φ_k) log

p_e(φ_k) p_m(φ_k)

∈[0,∞) (3.33)

where p_e(φ_k)is the probability of topic φ_k in the context of entitye and p_m(φ_k) the probability of topicφ_kin the context of mentionm. The Kullback-Leibler divergence

Chapter 3 Topic Models for Person Linking

has a range of [0,∞) and is not symmetric. For all cases where T_e 6= T_m, we have KL(T_e,T_m)6=KL(T_m,T_e).

The symmetric version of Kullback-Leibler divergence is given by sKL(T_e,T_m) = 1

k=1

p_m(φ_k) log

p_m(φ_k) p_e(φ_k)

+p_e(φ_k) log

p_e(φ_k) p_m(φ_k)

∈[0,∞) (3.34) and yields sKL(T_e,T_m) = sKL(T_m,T_e). Now, in our case, as the contexts text(e) andtext(m)are interchangeable with respect to similarity, we assume the symmetric formulation to provide more interpretability.

Another distance measure is the Jensen-Shannon distance after Lin [1991]. This measure is similar to the symmetric Kullback-Leibler divergence in Eq. 3.34 but uses the average of two probabilities r_k = 0.5·(p_m(φ_k) +p_e(φ_k))as denominator:

JS(T_e,T_m) = 1 2

k=1

p_m(φ_k) log

p_m(φ_k) r_k

+p_e(φ_k) log

p_e(φ_k) r_k

∈[0,1].

(3.35) The bound of 0≤JS(T_e,T_m)≤ 1 holds, if the logarithm in Eq. 3.35 is to the base 2. If replaced with the natural logarithm, the upper bound is reduced to ln(2).

Blei and Lafferty [2009] proposed an adapted form of the Hellinger distance as another alternative measure for the similarity of probability distributions:

H(T_e,T_m) =

k=1

pp_m(φ_k)−p

p_e(φ_k)2

∈[0,1]. (3.36) This variant of the Hellinger distance has bounds 0≤H(T_e,T_m)≤1and the upper bound of 1 is obtained whenp_m(φ_k)assigns a probability value of zero to every event to whichp_e(φ_k)assigns a positive probability, and vice versa.

All of the above distances and divergences may be used as a single scalar. How-ever, the representation in a single scalar ignores a lot of information and entities appearing in similar contexts can be difficult to distinguish using such an aggregated measure. Therefore, instead of summing differences in probability values to a single value, we propose to use each difference term separately as a distinct feature. This allows for a better separability of the resulting data points and furthermore a clas-sifier may evaluate correlations between these terms or learn weights individually for each thematic distance value. Thus, for the distances introduced above, we will

3.6 Thematic Context Distance

create one distinct distance term per topic index k = 1, . . . K, i.e.

∀k = 1, . . . , K :

KL(T_e,T_m)_k =p_e(φ_k) log

p_e(φ_k) p_m(φ_k)

(3.37) sKL(T_e,T_m)_k = 1

p_m(φ_k) log

p_m(φ_k) p_e(φ_k)

+p_e(φ_k) log

p_e(φ_k) p_m(φ_k)

(3.38) H(T_e,T_m)_k =p

p_m(φ_k)−p

p_e(φ_k)2

(3.39) JS(Te,Tm)_k = 1

pm(φk) log

pm(φk) r_k

+pe(φk) log

pe(φk) r_k

. (3.40) As in Eqs. 3.33 to 3.36, pe(φk) is the probability of topic φk in the context of entity e, p_m(φ_k) is the probability of topic φ_k in the context of mention m and r_k = 0.5(p_m(φ_k) +p_e(φ_k)) in Eq. 3.40.

Now, to experimentally find the best distance representation for entity linking, we evaluate all of the above distances with different kernels using an SVM classifier from SVM^Light with standard parameters (the results are given in Section 3.6.4).

To do so, we create for each thematic distance a feature vector D(·)(Tm,Te). To distinguish among the employed distance measure, we use as subscript the name of the distance and then have:

DKL(Tm,Te) = [KL(Te,Tm)₁, . . . ,KL(Te,Tm)_K] ∈[0.01,1]^K (3.41) DsKL(Tm,Te) = [sKL(Te,Tm)₁, . . . ,sKL(Te,Tm)_K]∈[0.01,1]^K (3.42) D_JS(Tm,Te) = [JS(Te,Tm)₁, . . . ,JS(Te,Tm)_K] ∈[0.01,1]^K (3.43) D_H(T_m,T_e) = [H(T_e,T_m)₁, . . . ,H(T_e,T_m)_K] ∈[0.01,1]^K (3.44) The elements of the vectorsD_(·)(T_m,T_e)in Eqs. 3.41 to 3.44 are computed accord-ing to Eqs. 3.37 to 3.40 respectively. As each of these feature vector representations computes distance terms between corresponding topic probabilities explicitly, we have a maximum dimension ofK, according to the number of topics in the underly-ing topic model. So, technically, the range of each D(·)(T_m,T_e) is [0,1]^K. However, we here ignore a feature if bothpm(φk)andpc(φk)are less than 0.01 and thus clamp the range of each D_(·)(T_m,T_e) to [0.01,1]^K. This form of feature selection is based on the assumption that we don’t need to spend modelling effort for unimportant topics since we don’t give too much weight on the long tail of unimportant topics. It also has the side effect that the overall number of non-sparse features will be rather low, which speeds up the kernel computation in the SVM classifier.

In Pilz and Paaß [2011] we also evaluated a linear concatenation of the two prob-ability distributions T_m and T_e, i.e. D(m, e) = [T_m,T_e] ∈[0,1]^2K. In this represen-tation, only the part in T_e varies over a given set of candidates. This formulation

Chapter 3 Topic Models for Person Linking

showed by far the weakest performance compared to the other distance measures, even in a quadratic kernel that can model the interactions between the topics for e and m. Therefore, we give no further attention to this formulation and omit the obtained results here.

In our experiments, we follow Bunescu and Pasca [2006] and use as baseline feature the cosine similaritycos(m, e)(cf. Eq. 3.1). This baseline feature is used to evaluate directly matching words in the contexts ofeandmin all of the following experiments.

Application in Multiple Languages

As the underlying algorithm of topic models does not dependent on the language of the corpus, topic models can be trained on corpora of different languages without the need to modify this algorithm. Hence, we may use topic modelling for entity linking in multiple languages as long as unique entity descriptions are available. We will empirically show that our proposed method generalizes to other languages, as its application for entity linking to the English, the German, and the French version of Wikipedia, currently the three largest Wikipedia versions, shows quite similar performance figures for all of these languages.

We build distinct topic models with 200 topics for each of these languages. To create training corpora, we extract 100k random documents, mostly articles articles describing persons, from the respective versions of Wikipedia and use the resulting collections as training documents for LDA. More specifically the English topic model is build on articles derived from the English Wikipedia, the German topic model on articles derived from the German version and analogously the French topic model is trained on French Wikipedia articles. We are aware of the chance that the LDA training corpus may contain some of the entity descriptions in the candidate pools W_c or example references used for training and evaluation. However, we argue that this is not too harmful, since this overlap will be very small due to random sampling. Furthermore, a strict distinction between these datasets will not produce significantly different results since we may infer topic distributions also when a document contains previously unseen words. In the very unlikely case that no word is known to the model, no context based similarity measure will work.

Apart from some language specific adaptions, we use the same pre-processing techniques for all languages. We extract the plain text and stem it using the appro-priate language settings for the Porter stemmer and change the respective stop word lists. Having trained a model, we use it to infer the topic probability distributionT_m for mention contexts as well as the topic probability distribution T_e_i for candidate contexts.

In preliminary experiments, we evaluated topic models with different values of K and different training corpora for the task of entity linking. That is, we var-ied K from 50 to 500 and increased the number of documents in training corpus.

We also evaluated different combinations of topic models and formulated

concate-70

3.6 Thematic Context Distance

nated topic distributions through the concatenation of distributions derived from different models, i.e. T_m = [T_m,LDA₁, . . . ,T_m,LDA_k]. However, we found no major dif-ference in predictive performance when increasing the number of topics above 200 or varying training corpora or topic distribution representation. Considering the hyper-parameters α and β, no additional evaluation is necessary since the Mallet implementation of LDA automatically optimizes these parameters. So even when explicitly using different initial values of α and β, the learned models are more or less the same and yield the same or very similar performance.

In Pilz and Paaß [2009] and Pilz and Paaß [2012], we used a Ranking SVM to learn a linking model. To learn a linking model based on thematic distances, we use a classification method based on a standard SVM. We will next detail how entity linking is formulated as a classification problem.

Im Dokument Entity Linking to Wikipedia (Seite 81-85)