Hidden Markov Models with Latent Annotations

2.3 Sequence Prediction

2.3.2 Hidden Markov Models with Latent Annotations

Hidden Markov models with latent annotations (HMM-LA) were introduced by Huang et al.

(2009). They are an adaptation of a similar algorithm developed for probabilistic context-free grammars (Petrov et al., 2006). We already discussed that the Markov assumption made during the training has an important effect on the runtime of the Viterbi algorithm. HMM-LAs soften the Markov assumptions of traditional HMMs by increasing the number of states. The approach is thus similar to increasing the order of the HMM, but differs in that the states are trained in a way to maximize the likelihood of the training data. The procedure consists of two iteratively executed steps. The first step is called “splitting” and splits every HMM state into two similar substates. Then the expectation–maximization (EM) algorithm is applied to optimize the param-eters of the new model. In the second step, called “merging” the improvement in likelihood that every single split provides is estimated and a certain fraction of the splits with the lowest gains are reversed. The two steps are then iterated until a certain number of states is reached. As an example from part-of-speech tagging: It might be beneficial to split the PoS tag “Noun” into common nouns and proper nouns in the first iteration and in the second iteration proper nouns could be split into company names (often followed by corp. or inc.) and person names (often proceeded byMr). On the other hand it makes little sense to have a high number of determiner tags so we would expect that the determiner splits get reversed in the merging step.

EM

The EM algorithm estimates models over unobserved latent variables (Dempster et al., 1977).

It was originally proposed as a method of handling incomplete data sets. The algorithm finds a local optimum in the marginal likelihood:

LL_θ(D) = X

x∈D

p(x|D) =X

x∈D

p(x, z|θ) (2.32)

whereDis a data set of observed itemsx, thezare the values of the unobserved latent variables and θ is a model. EM iteratively improves θ: In the expectation step, it calculates estimated counts using the model:

c_D,θ_t(x, z) =c_D(x)·p(z|x, θ_t) =c_D(x)· p(x, z|θ_t) P

z⁰p(x, z⁰|θ_t) (2.33) From which a new model can be learned in the maximization step:

θ_t+1 = arg max

x,z

c_D,θ_t(x, z)·p(x, z|θ) (2.34) It can be shown that this procedure produces models with monotonically increasing likeli-hood (Dempster et al., 1977). As we already discussed, EM is used in the training of HMM-LAs, where we observe input and output symbols and need to estimate the frequencies of the unobserved substates. In this casep(z|x, θ_t)cannot be computer by summation over all possible

sequences z as there are exponentially many. We thus use a more efficient dynamic program similar to the Viterbi algorithm.

Forward-Backward

The forward-backward algorithm allows us to calculate the posterior probabilities needed for the EM e-step. Given an input sequence, the forward-backward algorithm calculates the posterior probability of a transition from z to z⁰ at position t. This posterior can be decomposed into a product of three probabilities: The probability of all output sequences ending inz at positiont, the probability of a transition from z to z⁰ and the probability of all output sequences starting withz⁰at position t+ 1. While the transition probability is a simple model parameter, the other two probabilities have to be calculating using dynamic programs. We start with the probability of all sequences ending in z at position t, the corresponding program is known as the forward algorithm and is similar to the Viterbivmatrix except that the paths are combined by summation instead of maximization:

α0,start = 1 α_z,t=X

z⁰

αt−1,z⁰ ·p(z|z⁰)·p(x_t|z)

where for the sake of simplicity we ignore the dependencies onθ, xandzin the notation. The probability of all sequences starting at positiontin symbolz⁰can be calculated similarly:

β_T_+1,stop = 1 β_z,t=X

z⁰

β_t+1,z⁰ ·p(z⁰|z)·p(x_t+1|z)

The probability of a transition at positiontis then given by:

p(x, z, z⁰|t) = αt,z·p(z⁰|z)·βz⁰,t+1 (2.35) Normalizing by the sum off all sequences, we obtain:

p(z, z⁰|x, t) = α_t,z·p(z⁰|z)·β_z⁰_,t+1

α_T_+1,stop (2.36)

Just as for Viterbi the time complexity of the FB algorithm is dominated by the summation over all possible transitions at a specific position and thus grows polynomially with the number of states and exponentially with the model order. The posterior probability then allows us to estimate the frequencies needed for HMM training: The frequency of a state, following another state and of a state co-occurring with a specific input symbol:

c_D,θ_t(z, z⁰) = X

x,t

c_D(x)·p(z, z⁰|x, t) ˆ

cD,θt(z, x) = X

x,z⁰,t

cD(x)·p(z, z⁰|x, t)

Here we derived the forward-backward computation for an unrestricted HMM. In the case of HMM-LA training we already know the correct output symbols and just need to calculate the probabilities of the possible substates. We thus calculate probabilities of the following form:

p(z_y, z_y⁰|x, t, y, y⁰). This can be easily achieved by updating the forward and backward defini-tions:

α_z⁰

y,t = X

z_y⁰∈Ω(y)

α⁰_t−1,z0

y·p(z_y|z⁰_y)·p(x_t|z_y) (2.37) whereΩ(y)is the set of all substates ofy. With EM as the method to adjust the latent substates to the training set, we can continue with our description of split-merge training for HMMs. The procedure starts by collecting frequency counts for a bigram HMM from an annotated corpus.

We denote the transition frequency of statey⁰followingybyc_y,y⁰ and the emission frequency of symbolxoccurring with state yby cy,x. The training procedure consists of iteratively splitting tag symbols into two latent subsymbols and adjusting the resulting latent model to the training set using expectation–maximization (EM) training. It then approximates the gain in likelihood (L) of every split and reverts splits that give little increase and needlessly increase the complexity of the model.

Splitting

In the split phase we split every stateyinto two subtagsy₀ andy₁. We set c_y₀_,x= c_y,x

2 +r c_y₁_,x= c_y,x

2 −r

whereris a random number r ∈ [−ρc_y,x, ρc_y,x]andρ ∈ [0,1]controls how much the statistics fory₀ andy₁ differ. The exact value ofρis of secondary importance, if it is just big enough to break the symmetry (the model could not learn anything if the parameters for y₀ and y₁ were identical). Analogously to the emission frequencies we initialize the transition frequencies as follows:

c_y₀_,y⁰

0 = c_y,y⁰

4 +r c_y₀_,y⁰

1 = c_y,y⁰ 4 −r c_y₁_,y⁰

0 = c_y,y⁰

4 +r⁰ c_y₁_,y⁰

1 = c_y,y⁰ 4 −r⁰ We then run EM training to fit the new model to the training data.

Merging

To prevent the model from wasting parameters for splits that only yield small increases in likeli-hood we revert the splits with the lowest gains. The gain of a split can be calculated analogously toPetrov et al. (2006). To this end we calculate for each split, what we would lose in terms of likelihood if we were to reverse this split. The likelihood contribution at a certain positiontwith stateycan be recovered from the FB probabilities:

p(x,y|t) = X

zy∈Ω(y)

α_t,z_y ·β_t,z_y p(x|z_y)

In order to calculate the likelihood of a model that does not splityintoy0andy1 we first have to derive the parameters of the new model from the parameters of the old model:

p(x|y) = X

i∈{0,1}

p_i·p(x|y_i) p(y⁰|y) = X

i∈{0,1}

p_i·p(y⁰|y_i) p(y|y⁰) = X

i∈{0,1}

p(yi|y⁰)

where p₀ and p₁ are the relative frequency of y₀ and y₁ given their parent tag y. We can now derive the approximated forward and backward probabilities by substituting the corresponding parameters. Huang et al.(2009) do not provide the equations or derivation for merging in their paper, but due to personal communication we think that their approximation has the same form:

α⁰_t,y ≈p(x_t|y) X

i∈{0,1}

α_t,y_i/p(x_t|y_i) β_t,y⁰ ≈ X

i∈{0,1}

p_i·β_t,y_i

where theα values can be added because they do not contain terms that directly depend on t, while the other terms have to be interpolated. This calculation is approximate because it ignores any influence that the merge might have at any other position in the sequence. The likelihood of the new model can then by calculated from the new FB values and a fraction of the states with the lowest likelihood improvements is reversed.

Im Dokument General methods for fine-grained morphological and syntactic disambiguation (Seite 52-55)