Conditional Random Fields - Sequence Prediction

2.3 Sequence Prediction

2.3.3 Conditional Random Fields

Merging

To prevent the model from wasting parameters for splits that only yield small increases in likeli-hood we revert the splits with the lowest gains. The gain of a split can be calculated analogously toPetrov et al. (2006). To this end we calculate for each split, what we would lose in terms of likelihood if we were to reverse this split. The likelihood contribution at a certain positiontwith stateycan be recovered from the FB probabilities:

p(x,y|t) = X

zy∈Ω(y)

α_t,z_y ·β_t,z_y p(x|z_y)

In order to calculate the likelihood of a model that does not splityintoy0andy1 we first have to derive the parameters of the new model from the parameters of the old model:

p(x|y) = X

i∈{0,1}

p_i·p(x|y_i) p(y⁰|y) = X

i∈{0,1}

p_i·p(y⁰|y_i) p(y|y⁰) = X

i∈{0,1}

p(yi|y⁰)

where p₀ and p₁ are the relative frequency of y₀ and y₁ given their parent tag y. We can now derive the approximated forward and backward probabilities by substituting the corresponding parameters. Huang et al.(2009) do not provide the equations or derivation for merging in their paper, but due to personal communication we think that their approximation has the same form:

α⁰_t,y ≈p(x_t|y) X

i∈{0,1}

α_t,y_i/p(x_t|y_i) β_t,y⁰ ≈ X

i∈{0,1}

p_i·β_t,y_i

where theα values can be added because they do not contain terms that directly depend on t, while the other terms have to be interpolated. This calculation is approximate because it ignores any influence that the merge might have at any other position in the sequence. The likelihood of the new model can then by calculated from the new FB values and a fraction of the states with the lowest likelihood improvements is reversed.

formal implementation. We start with a discussion of maximum entropy (ME) models of which conditional random fields are a special case. The general idea behind ME models is to find a principled way to model the distribution of a random variable X given a data setD. Unlike in the derivations of n-gram models and HMMs we do not want to make explicit independence assumptions. The only assumptions we make are introduced via feature functionsφ_i(x) which tell us which aspects of the problem are important. In the case of language modeling for example, whereX was the set of all sequences of words of a vocabularyV, we could decide that theφi

count how often certainn-grams occurs in the sentence x. We require that the expected values of these φ_i in the density function f of our model equal the expected values of the empirical distributionPD:

Ef(φ_i) = EPD(φ_i),1≤i≤N (2.38) This essentially guarantees that the model memorizes the important aspects of our data set.

We now want to set the parameters of our model without making any further assumptions by maximizing the entropy of the model, where we define entropy as the expected Shannon infor-mationI(x)of a discrete random variableX¹:

H(X) =EX(I) = X

x∈X

P(x)I(x) = −X

x∈X

P(x) logP(x) (2.39) A ME model without constraints corresponds to the uniform probability distribution. If we add a constraint to the model, the model will change in order to satisfy the constraint, but still stay asuniformas possible. Consider the following example: We are modeling a discrete random variable with the three values A, B, and C. If we know nothing more then we should assume a uniform distribution withP(A) = P(B) = P(C) = 1/3. If we now add the constraint that A and B should make up 50% of our probability mass then the natural change would be to assume P(A) = P(B) = 1/4, where we set the probabilities of A and B uniform because we do not want to assume anything.

We now need to find the form of the density functionf(x). We do so by solving an optimiza-tion problem. We want to definef(x)in a way to maximize the ME and satisfying the features constraints and the constraint thatf(x)is properly normalized. Using Lagrange multipliers we arrive at the following updated objective function:

H⁰(f, λ) = H(f, λ) +λ₀(X

x⁰∈X

f(x⁰)−1) +

i=1

λ_i[Ef(φ_i)−EpD(φ_i)]

=−X

x⁰∈X

f(x⁰) logf(x⁰) +λ₀(X

x⁰∈X

f(x⁰)−1) +

i=1

λ_i[Ef(φ_i)−EpD(φ_i)]

1By convention we havelim_x→0xlogx= 0

H⁰ has the following derivatives:

∂H⁰(f, λ)

∂f(x) =−logf(x)−1 +λ₀+

i=1

λ_iφ_i(x) (2.40)

∂H⁰(f, λ)

∂λ₀ = X

x⁰∈X

f(x⁰)−1 (2.41)

∂H⁰(f, λ)

∂λi

=Ef(φ_i)−EpD(φ_i) (2.42)

Setting Eq. 2.40 and Eq. 2.41 zero we obtain:

f(x, λ) = expPN

i=1λiφi(x)

Z(λ) (2.43)

Z(λ) =X

x∈X

exp

i=1

λ_iφ_i(x) (2.44)

Which is the general form of a ME model. The normalization constant Z is called the par-tition function. The remaining constraints Eq. 2.42 are met during training and it can be shown that this can also been done by maximizing the likelihood ofD given the model (seeSudderth (2006) for details and a proof).

The models we are interested in for classification have the following conditional form:

p_ME(y|x) = 1

Z_ME(~λ, x)expX

λ_iφ_i(x, y) (2.45) Z_ME(~λ, x) = X

expX

λ_iφ_i(x, y) (2.46)

As we already discussed in the last subsection, these models can be maximized by optimizing the (conditional) likelihood ofD:

ll[D](p_ME(~λ)) = X

x,y∈D

logp_ME(y|x, ~λ)

= X

x,y∈D

λ_i·φ_i(x, y)− X

x,y∈D

logZ_ME(~λ, x) (2.47)

∂ll[D](p_ME(~λ))

∂λi

= X

x,y∈D

φ_i(x, y)− X

x,y∈D

y⁰

φ_i(x, y⁰)p_{M E}(y⁰|x, ~λ) (2.48)

Givenll[D](~λ)and the gradient∇ll[D](~λ)we can use general numeric optimization to cal-culate the optimal ˆλ. Here we just give a simple stochastic gradient descent algorithm used by Tsuruoka et al.(2009). The algorithm receives an initial step widthη₀and a maximal number of iterationsN. It then iterates over the data in an online fashion and moves the model parameters in the direction of the gradient. The step widthηdecays with the number of processed items in order to achieve convergence:

Algorithm 2.2Stochastic Gradient Descent

~λ←~0 i←0

forepoch = 1→N do shuffleD

for~x, ~y∈Ddo η_i ← _1+i/|D|^η⁰

~λ ←~λ+η_i· ∇~λlogp(~y|~x) i←i+ 1

end for end for

During training, a ME Classifier might tend to use rare features to explain classes that cannot be explained otherwise. These rare features might then obtain a high weightλ_i even though we cannot be sure they will behave in the same way on unseen data. We thus add a penalty term to our likelihood objective which pushes the weight vector towards zero. Every increase in a feature weight has now to be justified by an sufficient increase in training data likelihood. The most common form of regularization is to impose constraints on the norm of the weight vector.

The so calledl2-regularization has the following form:

ll⁰_D(λ) = ll_D(λ)−µX

|λ_i|² (2.49)

∂ll⁰_D(λ)

∂λ_i =∇ll_D(λ)−2µλ_i (2.50)

whereµis the strength of the regularizes, a hyper parameter that needs to be optimized on held-out data. Another common form isl₁-regularization:

ll⁰⁰_D(λ) =llD(λ)−µX

|λi| (2.51)

∂ll_D⁰⁰(λ)

∂λ_i =∇ll_D(λ)−µsign(λ_i) (2.52)

The difference between the two forms of regularization is thatl2 generates smallλ weights whilel₁ will set the less important features to zero. l₁-regularization is thus also useful as a fea-ture selection method.

Conditional random fields are ME classifiers over sequences and can be defined in the fol-lowing way:

pCRF(y|x) = expP

iλ_i·φ_i(y_t, y_t−1,x, t)

Z_CRF(~λ,x) (2.53)

Z_CRF(~λ,x) = X

expX

λ_i·φ_i(y_t, y_t−1,x, t) (2.54) The only difference between MEs and CRFs is that we need to define some dynamic pro-grams to make the calculations needed during parameter estimation efficient. We start with the partition functionZ_CRF which involves a summation over all the possible sequencesy. As ex-plained in the definition of the Viterbi algorithm the numbers of sequences rises exponentially with the number of output symbols. However, we have already discussed that the sum of all the paths through a sequence lattice can be efficiently calculated using the forward or backward algorithm:

logZ_CRF(~λ,x) = logX

expX

λ_i ·φ_i(y_t, yt−1,x, t)

= logX

expX

ψ(y_t, y_t−1,x, t)

⊕

ψ(yt, yt−1,x, t)

=α(T + 1,stop) =β(0,start)

where ⊕ denotes the addition of two numbers in log space: a ⊕ b = log(expa + expb), ψ(yt, yt−1,x, t) = P

iλi ·φi(yt, yt−1,x, t) is the score potential around position t and we de-fineαandβ– following our discussion of FB for HMM– as:

α(0,start) = 0 α(t, y) =

⊕

y⁰

α(t−1, y⁰) +ψ(y⁰, y,x, t)

β(T + 1,stop) = 0 β(t, y) =

⊕

y⁰

β(t+ 1, y⁰) +ψ(y⁰, y,x, t)

The difference to our definition HMM is that we now do not calculate probabilities, but unnormalized log-probabilities. In order to estimate parameters we just have to derive ll_D(~λ) and∇ll_D(~λ):

ll_D(~λ) = X

x,y∈D

logp_CRF(y|x, ~λ)

= X

x,y∈D

λ_i·φ_i(y_t, yt−1,x, t)− X

x,y∈D

logZ_CRF(~λ,x)

∂llD(~λ)

∂λ_i = X

x,y∈D

φ_i(y_t, yt−1,x, t)− X

x,y∈D

t,y⁰

φ_i(y⁰_t, y_t−1⁰ ,x, t)·p_CRF(y⁰|x, ~λ)

= X

x,y∈D

φ_i(y_t, yt−1,x, t)− X

x,y∈D

t,y⁰,y⁰⁰

φ_i(y⁰, y⁰⁰,x, t)·p_CRF(y⁰, y⁰⁰|x, t, ~λ) (2.55)

wherep_CRF(y⁰, y⁰⁰|x, t, ~λ)denotes the posterior probability of a transition fromy⁰toy⁰⁰at position t. We already discussed the calculation of this probability for the HMM case and can derive a similar formula using FB:

p_CRF(y, y⁰|x, t, ~λ) =

P{ψ(y)|y:path with ay→y⁰ transition att}

yψ(y)

= exp(α(t, y) +ψ(y⁰, y,x, t) +β(t+ 1, y⁰)) Z_CRF(~λ,x)

The log-likelihood can then be optimized using the SGD algorithm we discussed for ME classifiers. The runtime of the parameter estimation is dominated by the FB calculations above and thus inO(NⁿT)whereN denotes the number of output states,T the length of the sequence and n the order of the CRF. CRF training is thus slow when high model orders (> 1) or big tagsets are used (>100). In Chapter 5 we discuss pruning strategies that yield substantial speed ups in these particular cases.

Im Dokument General methods for fine-grained morphological and syntactic disambiguation (Seite 55-60)