• Keine Ergebnisse gefunden

2.3 Sequence Prediction

2.3.3 Conditional Random Fields

Merging

To prevent the model from wasting parameters for splits that only yield small increases in likeli-hood we revert the splits with the lowest gains. The gain of a split can be calculated analogously toPetrov et al. (2006). To this end we calculate for each split, what we would lose in terms of likelihood if we were to reverse this split. The likelihood contribution at a certain positiontwith stateycan be recovered from the FB probabilities:

p(x,y|t) = X

zy∈Ω(y)

αt,zy ·βt,zy p(x|zy)

In order to calculate the likelihood of a model that does not splityintoy0andy1 we first have to derive the parameters of the new model from the parameters of the old model:

p(x|y) = X

i∈{0,1}

pi·p(x|yi) p(y0|y) = X

i∈{0,1}

pi·p(y0|yi) p(y|y0) = X

i∈{0,1}

p(yi|y0)

where p0 and p1 are the relative frequency of y0 and y1 given their parent tag y. We can now derive the approximated forward and backward probabilities by substituting the corresponding parameters. Huang et al.(2009) do not provide the equations or derivation for merging in their paper, but due to personal communication we think that their approximation has the same form:

α0t,y ≈p(xt|y) X

i∈{0,1}

αt,yi/p(xt|yi) βt,y0 ≈ X

i∈{0,1}

pi·βt,yi

where theα values can be added because they do not contain terms that directly depend on t, while the other terms have to be interpolated. This calculation is approximate because it ignores any influence that the merge might have at any other position in the sequence. The likelihood of the new model can then by calculated from the new FB values and a fraction of the states with the lowest likelihood improvements is reversed.

formal implementation. We start with a discussion of maximum entropy (ME) models of which conditional random fields are a special case. The general idea behind ME models is to find a principled way to model the distribution of a random variable X given a data setD. Unlike in the derivations of n-gram models and HMMs we do not want to make explicit independence assumptions. The only assumptions we make are introduced via feature functionsφi(x) which tell us which aspects of the problem are important. In the case of language modeling for example, whereX was the set of all sequences of words of a vocabularyV, we could decide that theφi

count how often certainn-grams occurs in the sentence x. We require that the expected values of these φi in the density function f of our model equal the expected values of the empirical distributionPD:

Efi) = EPDi),1≤i≤N (2.38) This essentially guarantees that the model memorizes the important aspects of our data set.

We now want to set the parameters of our model without making any further assumptions by maximizing the entropy of the model, where we define entropy as the expected Shannon infor-mationI(x)of a discrete random variableX1:

H(X) =EX(I) = X

x∈X

P(x)I(x) = −X

x∈X

P(x) logP(x) (2.39) A ME model without constraints corresponds to the uniform probability distribution. If we add a constraint to the model, the model will change in order to satisfy the constraint, but still stay asuniformas possible. Consider the following example: We are modeling a discrete random variable with the three values A, B, and C. If we know nothing more then we should assume a uniform distribution withP(A) = P(B) = P(C) = 1/3. If we now add the constraint that A and B should make up 50% of our probability mass then the natural change would be to assume P(A) = P(B) = 1/4, where we set the probabilities of A and B uniform because we do not want to assume anything.

We now need to find the form of the density functionf(x). We do so by solving an optimiza-tion problem. We want to definef(x)in a way to maximize the ME and satisfying the features constraints and the constraint thatf(x)is properly normalized. Using Lagrange multipliers we arrive at the following updated objective function:

H0(f, λ) = H(f, λ) +λ0(X

x0∈X

f(x0)−1) +

N

X

i=1

λi[Efi)−EpDi)]

=−X

x0∈X

f(x0) logf(x0) +λ0(X

x0∈X

f(x0)−1) +

N

X

i=1

λi[Efi)−EpDi)]

1By convention we havelimx→0xlogx= 0

H0 has the following derivatives:

∂H0(f, λ)

∂f(x) =−logf(x)−1 +λ0+

N

X

i=1

λiφi(x) (2.40)

∂H0(f, λ)

∂λ0 = X

x0∈X

f(x0)−1 (2.41)

∂H0(f, λ)

∂λi

=Efi)−EpDi) (2.42)

Setting Eq. 2.40 and Eq. 2.41 zero we obtain:

f(x, λ) = expPN

i=1λiφi(x)

Z(λ) (2.43)

Z(λ) =X

x∈X

exp

N

X

i=1

λiφi(x) (2.44)

Which is the general form of a ME model. The normalization constant Z is called the par-tition function. The remaining constraints Eq. 2.42 are met during training and it can be shown that this can also been done by maximizing the likelihood ofD given the model (seeSudderth (2006) for details and a proof).

The models we are interested in for classification have the following conditional form:

pME(y|x) = 1

ZME(~λ, x)expX

i

λiφi(x, y) (2.45) ZME(~λ, x) = X

y

expX

i

λiφi(x, y) (2.46)

As we already discussed in the last subsection, these models can be maximized by optimizing the (conditional) likelihood ofD:

ll[D](pME(~λ)) = X

x,y∈D

logpME(y|x, ~λ)

= X

x,y∈D

X

i

λi·φi(x, y)− X

x,y∈D

logZME(~λ, x) (2.47)

∂ll[D](pME(~λ))

∂λi

= X

x,y∈D

φi(x, y)− X

x,y∈D

X

y0

φi(x, y0)pM E(y0|x, ~λ) (2.48)

Givenll[D](~λ)and the gradient∇ll[D](~λ)we can use general numeric optimization to cal-culate the optimal ˆλ. Here we just give a simple stochastic gradient descent algorithm used by Tsuruoka et al.(2009). The algorithm receives an initial step widthη0and a maximal number of iterationsN. It then iterates over the data in an online fashion and moves the model parameters in the direction of the gradient. The step widthηdecays with the number of processed items in order to achieve convergence:

Algorithm 2.2Stochastic Gradient Descent

~λ←~0 i←0

forepoch = 1→N do shuffleD

for~x, ~y∈Ddo ηi1+i/|D|η0

~λ ←~λ+ηi· ∇~λlogp(~y|~x) i←i+ 1

end for end for

During training, a ME Classifier might tend to use rare features to explain classes that cannot be explained otherwise. These rare features might then obtain a high weightλi even though we cannot be sure they will behave in the same way on unseen data. We thus add a penalty term to our likelihood objective which pushes the weight vector towards zero. Every increase in a feature weight has now to be justified by an sufficient increase in training data likelihood. The most common form of regularization is to impose constraints on the norm of the weight vector.

The so calledl2-regularization has the following form:

ll0D(λ) = llD(λ)−µX

i

i|2 (2.49)

∂ll0D(λ)

∂λi =∇llD(λ)−2µλi (2.50)

whereµis the strength of the regularizes, a hyper parameter that needs to be optimized on held-out data. Another common form isl1-regularization:

ll00D(λ) =llD(λ)−µX

i

i| (2.51)

∂llD00(λ)

∂λi =∇llD(λ)−µsign(λi) (2.52)

The difference between the two forms of regularization is thatl2 generates smallλ weights whilel1 will set the less important features to zero. l1-regularization is thus also useful as a fea-ture selection method.

Conditional random fields are ME classifiers over sequences and can be defined in the fol-lowing way:

pCRF(y|x) = expP

t

P

iλi·φi(yt, yt−1,x, t)

ZCRF(~λ,x) (2.53)

ZCRF(~λ,x) = X

y

expX

t

X

i

λi·φi(yt, yt−1,x, t) (2.54) The only difference between MEs and CRFs is that we need to define some dynamic pro-grams to make the calculations needed during parameter estimation efficient. We start with the partition functionZCRF which involves a summation over all the possible sequencesy. As ex-plained in the definition of the Viterbi algorithm the numbers of sequences rises exponentially with the number of output symbols. However, we have already discussed that the sum of all the paths through a sequence lattice can be efficiently calculated using the forward or backward algorithm:

logZCRF(~λ,x) = logX

y

expX

t

X

i

λi ·φi(yt, yt−1,x, t)

= logX

y

expX

t

ψ(yt, yt−1,x, t)

=

X

y

X

t

ψ(yt, yt−1,x, t)

=α(T + 1,stop) =β(0,start)

where ⊕ denotes the addition of two numbers in log space: a ⊕ b = log(expa + expb), ψ(yt, yt−1,x, t) = P

iλi ·φi(yt, yt−1,x, t) is the score potential around position t and we de-fineαandβ– following our discussion of FB for HMM– as:

α(0,start) = 0 α(t, y) =

X

y0

α(t−1, y0) +ψ(y0, y,x, t)

β(T + 1,stop) = 0 β(t, y) =

X

y0

β(t+ 1, y0) +ψ(y0, y,x, t)

The difference to our definition HMM is that we now do not calculate probabilities, but unnormalized log-probabilities. In order to estimate parameters we just have to derive llD(~λ) and∇llD(~λ):

llD(~λ) = X

x,y∈D

logpCRF(y|x, ~λ)

= X

x,y∈D

X

t

X

i

λi·φi(yt, yt−1,x, t)− X

x,y∈D

logZCRF(~λ,x)

∂llD(~λ)

∂λi = X

x,y∈D

X

t

φi(yt, yt−1,x, t)− X

x,y∈D

X

t,y0

φi(y0t, yt−10 ,x, t)·pCRF(y0|x, ~λ)

= X

x,y∈D

X

t

φi(yt, yt−1,x, t)− X

x,y∈D

X

t,y0,y00

φi(y0, y00,x, t)·pCRF(y0, y00|x, t, ~λ) (2.55)

wherepCRF(y0, y00|x, t, ~λ)denotes the posterior probability of a transition fromy0toy00at position t. We already discussed the calculation of this probability for the HMM case and can derive a similar formula using FB:

pCRF(y, y0|x, t, ~λ) =

P{ψ(y)|y:path with ay→y0 transition att}

P

yψ(y)

= exp(α(t, y) +ψ(y0, y,x, t) +β(t+ 1, y0)) ZCRF(~λ,x)

The log-likelihood can then be optimized using the SGD algorithm we discussed for ME classifiers. The runtime of the parameter estimation is dominated by the FB calculations above and thus inO(NnT)whereN denotes the number of output states,T the length of the sequence and n the order of the CRF. CRF training is thus slow when high model orders (> 1) or big tagsets are used (>100). In Chapter 5 we discuss pruning strategies that yield substantial speed ups in these particular cases.