Semantic Labels from Topics - Semantic Labelling of Entities

3.5 Semantic Labelling of Entities

3.5.2 Semantic Labels from Topics

Since they group articles by subject, Wikipedia categories can also be interpreted as document labels. However, they do not express relevance or any other weighting scheme comparable to a topic distribution. Moreover, categories are manually as-signed by contributors and hence subject to the individual taste of an author. Even though there exist clear guidelines on the assignment of existing and the creation of new Wikipedia categories, these are not necessarily strictly followed. Analysing Wikipedia categories, we observed that categories can be very general but also overly specific. One example from the German Wikipedia is the category Träger des Bun-desverdienstkreuzes (Großkreuz in besonderer Ausführung) that applies only to the two entities Konrad Adenauer and Helmut Kohl. Notably, the latter entity even has its own Wikipedia category. On the other hand, categorization may also be incomplete and not fully descriptive for an entity.

Assuming that topic distributions provide a more expressive summary of the ar-ticle text, we propose a model that replaces Wikipedia categories with semantic labels derived from topic probability distributions, i.e. topic indices. This method was published in Pilz and Paaß [2009], the first kernel based entity linking approach using the German Wikipedia.

For further motivation, we give an example that compares the information content of topic distributions and Wikipedia categorization for two entities from the German Wikipedia: a politician Willi Weyer (Politiker)¹ and a soccer player Willi Weyer (Fußballspieler). For this, we use a topic model with K = 200 topics

1While the original title does not contain a disambiguation term, we use one here for better distinction.

Chapter 3 Topic Models for Person Linking

Vorsitz, Abge-ordnet, SPD,

FDP, CDU, Bundestag, Wahlkreis, . . . p_e(φ₆₇) = 12%

Prasident, Regier, Amt-szeit, Minist, Minister-prasident, . . . p_e(φ₁₈₆) = 10%

Finanzminister (NRW),

Bundestagsab-geordneter, FDP-Mitglied, Sportfunktionär

c(e) Willi Weyer (Politiker)

Saison, Tor, Fussballspiel,

Mannschaft, Bundesliga, Sturm, . . . pe(φ168) = 17%

Koln, Dussel-dorf, Kurt, Rot,

Aach, Bernd, Wuppertal,

Willi, . . . pe(φ122) = 9%

Fußballspieler (Deutschland)

c(e) Willi Weyer (Fußballspieler)

Figure 3.4: The most important words of topicsφ_k for the entitiesWilli Weyer (Politiker)andWilli Weyer (Fußballspieler)along with each entity’s cat-egories c(e). Topics are automatically inferred from LDA, categories are manually assigned by Wikipedia contributors.

that was trained on 100K randomly selected articles from the German Wikipedia that describe persons. From these articles we extracted the full text, removed markup language and used words in stemmed form with stems obtained from the German version of the Snowball algorithm (Porter [2001]). Using this model, we inferred the topic distributions summarized in Fig. 3.4 for our example entities. As depicted in the figure, we observe two topics for the politician that represent his occupation. For the soccer player, we also find one dominant topic describing his occupation. We also find a second topic showing the names of cities that represent the football teams he was engaged with, e.g. Cologne.

Now, the information covered by the respective Wikipedia categories differs no-tably. The politician is assigned several categories related to his affiliation, for in-stance his appointment as finance minister (Finanzminister (NRW)¹) or his political party (FDP-Mitglied). Comparing manually assigned categories and automatically generated LDA topics we here find a strong semantic overlap. However, the soc-cer player Willi Weyer (Fußballspieler) is assigned only one category, i.e.

Fußballspieler (Deutschland), that expresses his profession and nationality. The in-ferred topic distribution does also express his profession, but furthermore relates him to the city of Cologne and thus also hints at the soccer club he was engaged with.

While not being expressed explicitly but latently, the inferred topic distribution seems to carry much more information than the assigned category.

Note that the indices of the topics in this example, i.e. φ₆₇, φ₁₈₆, φ₁₆₈, and φ₁₂₂, may serve as abstract labels for their specific distribution over words. Even though these semantic labels have limited interpretability for a human, at least without the knowledge of the associated words, we argue that they can be used as a replacement for the manually assigned Wikipedia categories. We also assume that since topic models rely on the article text and not on the contributor’s intuition they potentially

1NRW is the acronym for the German federal state Nordrhein-Westfalen.

3.5 Semantic Labelling of Entities

yield a more representative assignment of labels.

Having motivated that topic distributions can be interpreted as semantic multi-label assignments, we will now describe the entity linking based upon this. This model is inspired by Bunescu and Pasca’s word-category correlation method but replaces Wikipedia categories with topic assignments and is therefore independent of error-prone and costly manual document categorization.

We evaluate word-topic correlation (WTC) by correlating each common word w_i ∈text(m)∩text(e)with the topic distributionT_e of the candidate context. This is analogous to the formulation of cWCC (Eq. 3.6), where we assumed that words shared by the two contexts are more influential and that entity specific feature selection is also useful. So here we have a word-topic-correlation dictionary that pairs each common word w_i from the mention context with the probability p_e(φ_k) for a topic in the candidate context, i.e

VW × Te ={(wi, pe(φk))}, wi ∈text(m)∩text(e), φk, k = 1, . . . , K. (3.7) Building upon Eq. 3.6, we substitute categories with topics and binary values with document topic probability values p_e(φ_k), k = 1, . . . , K, i.e. the probability of each specific topic in the context of candidate entity e. More specifically, for each candi-date e(m)∈e(m) we create feature vectors according to

forwi ∈text(m)∩text(e)∈VW, φk, k = 1, . . . , K, e(m)∈e(m) : x_WTC(m, e) =

(p_e(φ_k), ∀(w_i, φ_k)∈ {(w_i ∈text(m)∩text(e))× T_e}

0, else. (3.8)

With the formulation above, the vector x_WTC(m, e) representing the word-topic-correlation for a mention-entity pair (m, e) contains K probability values for every common word wi that appears both in the mention’s as well as in the candidate’s context. The maximum dimension of such a vector is then K · |VW| where at most K · |text(m)∩text(e)| entries have non-zero values.

Analogously to the feature vector representations of WCC (Eq. 3.5) and cWCC (Eq. 3.6), the vector x_WTC(m, e) would be empty if the mention context and the candidate context share no common word. Therefore we extend this feature vector with the cosine similarity (Eq. 3.1) as baseline feature that in such cases evaluates to cos(text(m), text(e)) = 0. As for WCC, we give also here a small example for better illustration.

Example 6 (Word-Topic-Correlation (WTC))

Assume a topic model with K = 2, build over the contexts of two entities e₁ and e₂, with text(e₁) = {w₁, w₂} and text(e₂) = {w₃, w₄}. This results in the word-topic dictionary

VW × T_e ={(w₁, φ₁),(w₁, φ₂),(w₂, φ₁),(w₂, φ₂),(w₃, φ₁), . . . ,(w₄, φ₂)}.

Chapter 3 Topic Models for Person Linking

Further, assume the topic distribution T_e₁ = {0.3,0.7} and a mention context text(m) = {w₁, w₂}. According to Eq. 3.8, the vector x_WTC(m, e₁) representing the pair of candidate e₁ and mentionm is composed of:

xWTC(m, e1) =







p_e₁(φ₁) = 0.3, ∀(w, φ_k)∈ {(w₁, φ₁),(w₂, φ₁)}

p_e₁(φ₂) = 0.7, ∀(w, φ_k)∈ {(w₁, φ₂),(w₂, φ₂)}

0, else.

The full instantiation of this vector is given by

x_WTC(m, e₁) = [ (w₁, φ₁)

↓

0.3, 0.3

↑ (w₂, φ₁)

(w₃, φ₁)

↓ 0, 0

↑ (w₄, φ₁)

(w₁, φ₂)

↓

0.7, 0.7

↑ (w₂, φ₂)

(w₃, φ₂)

↓ 0, 0

↑ (w₄, φ₂)

]∈[0,1]^K·|V^W^|.

Sincetext(m)∩text(e₂) =∅, the vector x_WTC(m, e₂)representing the pair(m, e₂) has no word-topic correlation features and contains only a zero representing the cosine similarity of the contexts.

Having detailed the feature design of our method WTC and that of its inspiration WCC proposed in Bunescu and Pasca [2006], we will now come to the machine learning method exploiting these designs in order to learn a model for entity linking.

This model is based on ranking candidate entities with respect to a given mention and its context and defines a feature based threshold learning for the detection of uncovered entities. We will then use this method in our experiments to compare WTC and WCC for person name disambiguation in German.

Im Dokument Entity Linking to Wikipedia (Seite 63-66)