Latent Dirichlet Allocation - Entity Linking to Wikipedia

Since its introduction by Blei et al. [2003], topic modelling by Latent Dirichlet Allocation (LDA) achieved very much attention in a growing number of research fields, ranging from text analysis to computer vision. In the context of natural language processing, topics generated by LDA are clusters of words that often co-occur. These topics are used for example to find related documents, to summarize documents or create the input for other text categorization tasks (Blei et al. [2003], Griffiths and Steyvers [2004], Rubin et al. [2012] among many others).

LDA has a its core a Bayesian probabilistic model that describes document corpora

3.4 Latent Dirichlet Allocation

α θ z w φ β

N_d D

K Dirichlet prior on

per-document topic distribution

document topic distribution

word topic

assignments words

topics

Dirichlet prior on per-topic word

distribution

Figure 3.1: Plate notation of smoothed LDA after Blei et al. [2003]. The plates (rectangles) represent repetitions of variables (circles) in the graphical model. The outer plate represents a collection of D documents, the inner plate represents the repeated choice of topics and words within a document. The observable variables, i.e. words, are shaded in grey.

in a fully generative way. LDA assumes a fixed number K of underlying topics in a document collection where each document is a mixture of topics and generated by picking a distribution over the latent topics. Given this mixture, the topic of each word is chosen and, given their topics, the observable variables, i.e. the words, are generated. This process is depicted in Fig. 3.1 and formally described as follows.

Assume we have a vocabularyV of |V|words and want to generate Ddocuments of sizes N₁, . . . , N_D:

1. Randomly draw the overall topic distribution φk ∼ Dir(β), ∀ k = 1, . . . , K with φ_k ∈R^|V^|,φ_k,i ≥0andP|V|

i=1φ_k,i= 1. K is a fixed number used to assess the number of latent topics in the corpus. The parameter β ∈(0,∞)^|V^|is the prior vector on the per-topic word distribution.

2. Randomly draw document-specific topic proportions θ_d ∼ Dir(α), ∀ d = 1, . . . , D with θ_d ∈ R^K, θ_d,k ≥ 0 and PK

k=1θ_d,k = 1. The probability vec-tor θ_d describes the distribution of topics in document d. The parameter α, also a positive vector of dimension K, is the concentration parameter of the Dirichlet prior on the per-document topic distributions.

3. For each of the words w_d,n, ∀d= 1, . . . , D, ∀ n= 1, . . . , N_d

a) Randomly draw a topicz_d,n ∼ Multinomial (θ_d), z_d,n ∈ {1, . . . , K}.

b) Finally, the observed word w_d,n ∈V is randomly drawn from the distri-bution of the selected topic: w_d,n ∼ Multinomial φ_z_d,n

In Fig. 3.1, the repeated draws of random variables and observable variables (circles) are depicted through the plates (rectangles).

Chapter 3 Topic Models for Person Linking

The fundamental idea in probabilistic topic models is that the words of a doc-ument d are generated according to a mixture of topic distributions, where the mixture depends on the document-specific mixture weights θd. LDA introduces a Dirichlet prior on θ and in this way extends Probabilistic Latent Semantic Index-ing (Hofmann [1999]), which makes no assumption on the prior distributions. The Dirichlet distribution, a conjugate prior for the Multinomial distribution, is a con-venient choice as prior, simplifying the problem of statistical inference. Using a Dirichlet prior for the topic distribution θ results in a smoothed topic distribution, with the amount of smoothing determined by the parameter α=α₁, . . . , α_K. Each parameter α_k can be interpreted as a prior observation count for the number of times topic k is sampled in a document, before having observed any actual words from that document (Gelman et al. [2013]).

The nature and influence of the priors α and β was studied in Wallach et al.

[2009]. The authors empirically found that an asymmetric prior α over document-topic distributions and a symmetric priorβ over topic-word distributions gives best results. In the symmetric prior, all β₁, . . . , β_|V_| have the same value, in the asym-metric prior, all α₁, . . . , α_K have different values. The findings are implemented in the toolkit Mallet (McCallum [2002]), which optimizes the prior αaccording to the underlying collection in a Markov Chain Monte Carlo method. Mallet uses an implementation of Gibbs sampling, i.e. SparseLDA (Yao et al. [2009]), a statistical technique meant to quickly construct a sample distribution. It repeatedly samples a topic for each word in each document using the distributions defined by the model.

After some time this distribution converges to a stationary state where the topic probability distribution of each word in a document remains constant. All topic models used in this thesis are generated using this software¹.

To infer the topic distribution for a new document, the topic distribution is sam-pled in the same way as for training. Given the set of observed words, LDA estimates which topic configuration is most likely to have generated the data by sampling a distribution based on the word counts. The average probability of topic φ_k for a document is then the average of the probabilities of topicφ_k for each wordwin this document.

The properties described above allow topic models based on LDA to alleviate synonymy and polysemy. Synonyms such as car and automobile have the same meaning and will usually co-occur in similar contexts and hence usually belong to the same topic. On the other hand, as LDA is a generative model, there is no notion of mutual exclusivity. Words may belong to more than one topic. This allows LDA to capture polysemy: depending on the context at hand, different topics will be assigned to a word like plant. If a document is mostly about industry LDA will assign a topic that hints at the industrial plant. If the document is mostly about biology, LDA will assign a topic that hints at the biological plant.

1The newest version of this software is available athttp://mallet.cs.umass.edu

3.4 Latent Dirichlet Allocation

album, releas, record, featur, produc, singl, rapper, rap, music, group

music, piano, compos, work,

orchestra, symphoni, perform,

record, composit,

concert music

spanish, spain, portugues,

portug, madrid, barcelona, lisbon, pedro,

alfonso, jose

africa, african, south, parker, cape, ghana, nation, coloni,

kenya, boer nations

school, univers, educ,

year, studi, work, high, student, attend, colleg

univers, professor, studi, scienc,

research, institut, work, receiv, colleg, degre education

senat, democrat, elect, vote, republican, campaign, state, hous, polit, support

state, gov-ernor, serv, elect, senat, term, repres, democrat,

law, unit politics

play, theatr, stage, role,

perform, actor, product,

appear, film, drama

seri, role, televis, appear, episod, star, play, season, born, charact film, stage

game, season, nba, team, point, basketbal, play, player, career, coach

play, team, cup, club, footbal, goal, career, nation,

season, score sport

Figure 3.2: Topics from a topic model (K = 200) trained with 100k Wikipedia articles. Each topic is depicted by its associated words and a manually assigned label (grey box). The order of appearance implies the importance of a word for a topic.

Another example is the music topic in Fig. 3.2. As the figure shows, the word music appears in two music related topics, where one is more about classical music and the other more about modern music. Similarly, the word season appears both in the basketball and the soccer topic in Fig. 3.2.

3.4.1 Topic Distributions as Context Descriptions

Vanilla LDA is based on the bag-of-words assumption, as the only relevant infor-mation is the number of times words are generated in a document. However, it allows for a more general document representation. LDA effectively generates low-dimensional representations from sparse high-low-dimensional data and is a means to substitute high-dimensional bag-of-words vectors with low-dimensional topic mix-ture vectors. Accordingly, we may represent a document as a mixmix-ture of K topics that summarizes the main content of the document, relative to the topic model. At the same time, the representation via topic clusters provides a generalization to a wider context as topic clusters potentially grasp more information than the implic-itly expressed, and thus latent information carried by the observed words in a text document.

To illustrate how topics describe entities, Fig. 3.3 summarizes the topic distribu-tions for the contexts of three entities calledJohn Taylorfrom the English Wikipedia.

Chapter 3 Topic Models for Person Linking

band, album, play, guitar, record,

releas, tour, rock, music, solo,

song, guitarist, bass, member, perform, . . . p_e(φ₁₈) = 30%

album, releas, song, music, singl,

record, perform, chart, hit, tour, artist, produc, new,

track, singer, . . . p_e(φ₇₀) = 16%

John Taylor (bass guitarist)

olymp, world, medal, won, championship, athlet, gold, record,

summer, game, metr, compet, second, win,. . . p_e(φ₈₀) = 42%

american, pennsyl-vania, washington,

state, new, philadelphia, maryland, john, unit, connecticut,

delawar, . . . p_e(φ₁₃₅) = 10%

John Taylor (athlete)

jazz, record, band, play, music, musician, perform, album, group, orchestra, trumpet,

pianist, work, compos, blue,. . .

p_e(φ₁₄₁) = 58%

John Taylor (jazz)

Figure 3.3: Topics for concrete entities in the English Wikipedia. The figure shows the most important words of the topics φ_k with highest probability p_e(φ_k) to have generated the article texts text(e) of the respective entities.

The depicted topics are generated from a topic model withK = 200that was trained on a random selection of 100k Wikipedia articles describing persons and used to infer the topic probability distributions on the article texts of the entitiesJohn Taylor (athlete), John Taylor (bass guitarist) and John Taylor (jazz). For each of these entities, Fig. 3.3 shows the two topics φk that have the highest prob-ability p_e(φ_k) to have generated the article texts text(e) of the respective entities.

Each topicφ_k is depicted by a selection of words associated with it. The association of words and topics is based on the probability that a topic has generated a word.

Here, a selection of high probability words is shown. These words can be interpreted as important for a topic and also be understood as a summary of an entity’s article text. This example also illustrates the dimensionality reduction provided by LDA:

the three entities here are well described by only one or two topic clusters that also enable a distinction among these entities on the first glance.

For example, the most prominent topicφ80derived forJohn Taylor (athlete) describes his sportive success in the Olympic Games. The topic φ₁₃₅ with lower probability can be interpreted as an indicator for his nationality. This entity is described by a rather short article text and therefore less informative topics such as the nationality topic may get higher weight. Note that we should generally consider document length as this length influences the total word count. Hence it also influences the inferred topic distribution of the document. In very short documents we often find one very prominent topic, longer documents usually have a higher variety but nevertheless a large number of topics with a near-zero probability.

To distinguish among topic distributions for entity and mention context, we use the following notation.

Notation (Topic distribution over mention and entity context)

We denote the probability distribution of K topics in the mention context text(m) with

T_m =T(text(m)) = (p_m(φ₁), . . . , p_m(φ_K)),

Im Dokument Entity Linking to Wikipedia (Seite 56-61)