Baum-Welch algorithm - Parameter estimation: Intermittent presentation

9.4 Parameter estimation: Intermittent presentation

9.4.1 Baum-Welch algorithm

−2 log(σ²/µ)−2ψ(µ²/σ²) + 2 log(D) + 3−4µ²ψ⁰(µ²/σ²)/σ²

= 1 σ²

2 log(σ²/µ) + 2ψ(µ²/σ²) + 2ψ(µ²/σ²)−2 log(µ/σ²)−3 + 4µ²ψ⁰(µ²/σ²)/σ² I12=I21=−E

1 σ³

4µ³ψ⁰(µ²/σ²)/σ²+ 4µψ(µ²/σ²)−4µlog(D) + 4µlog(σ²/µ)−6µ+ 2D

= 1 σ³

−4µ³ψ⁰(µ²/σ²)/σ²−4µlog(µ/σ²)−4µlog(σ²/µ) + 4µ I₂₂=−E

1 σ⁴

6µ²log(D)−6µ²log(σ²/µ) + 10µ²−6µ²ψ(µ²/σ²)−4µ⁴ψ⁰(µ²/σ²)/σ²−6µD

= 1

σ⁴ × −6µ²(ψ(µ²/σ²)−log(µ/σ²)) + 6µ²log(σ²/µ)−10µ² +6µ²ψ(µ²/σ²) + 4µ⁴ψ⁰(µ²/σ²)/σ²+ 6µ²

We plugged in E[log(D)] = ψ(µ²/σ²)−log(µ/σ²), which can be obtained via elementary integral calculations.

We invert the Fisher-information matrix and divide bynto obtain for the asymptotic variances of the asymptotic normal distributions of the estimators ˆµ^MLand ˆσ^ML the values

Var ˆµ^ML

≈ I₁₁⁻¹(µ, σ)

n = 1

I22(µ, σ²)

I₁₁(µ, σ²)I₂₂(µ, σ²)−I₁₂(µ, σ²)² = σ² n, where we do not show the last equal sign in detail and

Var ˆσ^ML

≈ I₂₂⁻¹(µ, σ)

n = 1

I11(µ, σ²)

I₁₁(µ, σ²)I₂₂(µ, σ²)−I₁₂(µ, σ²)². Note that as ˆµ^ML is the sample mean even its exact variance is given byσ²/n.

9.4 Parameter estimation: Intermittent presentation

The parameters of Hidden Markov Models are typically estimated via maximum likelihood.

Prominent approaches carried out are the expectation maximization (EM) algorithm (Baum et al., 1970; Dempster et al., 1977) and direct numerical maximization (MacDonald and Zucchini, 1997). In this study, we focus on the EM algorithm, which is in the case of HMMs called Baum-Welch algorithm (BWA). In Section 9.4.1 we discuss the BWA and refer for more details to Baum et al. (1970); Dempster et al. (1977); Rabiner (1989); Bilmes (1998). Section 9.4.2 contains a short introduction to the direct numerical maximization idea.

9.4.1 Baum-Welch algorithm

For the HMM with inverse Gaussian dominance times we aim at estimating the parameter set Θ_HMM:= (µ_S, σ_S, µ_U, σ_U, p_SS, p_{U U}, π_start,S) and for the HMM with Gamma distributions the

9. A Hidden Markov Model

parameter set Θ_HMM:= (µ_S, σ_S, µ_U, p_SS, p_{U U}, π_start,S) should be estimated. In both cases the likelihoodL(d|Θ_HMM) given by

L(d|Θ_HMM) =P(D₁ⁿ=dⁿ₁|Θ_HMM) =π_startE(d₁)P E(d₂)P . . . P E(d_n)(1,1)^T (9.10) is maximized withP as the transition matrix of the hidden Markov chain with diagonal entries pSSandpU U andE(di) as diagonal matrices with the conditional densitiesfµS,σS(di), fµU,σU(di) on the diagonal (compare, e.g., Bulla (2006)).

The parameter set is estimated with the Baum-Welch algorithm (Baum et al., 1970) which is an iteratively working instance of the EM-Algorithm maximizing the model likelihood locally.

Here, we explain its most important steps (for details see, e.g., Rabiner, 1989) and distinguish between IG and Gamma distributions when updating the emission parameters. A graphical summary of the algorithm can be found in Figure 9.11. In order to avoid computational problems when using very small numbers, we additionally present a scaling technique for the BWA. Moreover, we discuss starting values as well as constraints necessary to obtain reasonable estimates also for subjects with less clear distinction between stable and unstable dominance times.

In the first step of the BWA one applies the so-called forward- and backward-Algorithm.

The forward-variable αj(i) := αj(i|Θ_HMM) := P(D₁ⁱ = dⁱ₁, Yi = j|Θ_HMM) is defined as the probability of being in statejat timeiand observing the sequenced1, d2, . . . , digiven the model parameters. The backward-variable β_j(i):= β_j(i|Θ_HMM) := P(Dⁿ_i+1 = dⁿ_i+1|Y_i = j,Θ_HMM) denotes the probability of observing the ending partial sequencedi+1, di+2, . . . , dn given state j at timei. Both variables can be derived iteratively as follows (Lemma 9.12 a) + b))

αj(1) =πstart,jf_µ_j_,σ_j(d1) andαj(i+ 1) =f_µ_j_,σ_j(di+1) X

k∈{S,U}

αk(i)pkj for i= 1, . . . , n−1 β_j(n) = 1 and β_j(i) = X

k∈{S,U}

p_jkf_µ

k,σ_k(d_i+1)β_k(i+ 1) for i=n−1, . . . ,1,

where p_SU = 1−p_SS, p_{U S} = 1−p_{U U}, and f_µ,σ(x) denotes the density of the IG or the Gamma distribution with expectationµand standard deviationσ evaluated atx. Note that we suppress the dependence ofα_j(i) and β_j(i) on the parameter set Θ_HMM for convenience.

The forward and backward variables are used to derive the probability

γj(i):= γj(i|Θ_HMM) := P(Yi = j|Dⁿ₁ = dⁿ₁,ΘHMM) of being in state j at time i, given the whole sequenced:= (d1, . . . , dn) and the parameters Θ_HMM (Lemma 9.12 c))

γ_j(i|Θ_HMM) = αj(i)βj(i)

α_S(i)β_S(i) +α_U(i)β_U(i). Moreover, we need the probability

ξj,k(i):= ξj,k(i|Θ_HMM) :=P(Yi =j, Yi+1 = k|Dⁿ₁ = dⁿ₁,ΘHMM) of being in state j at time i and in state kat timei+ 1, given the whole datadand the parameters ΘHMM,

ξ_j,k(i|Θ_HMM) = αj(i)pjkβk(i+ 1)f_µ^IG_k_,σ_k(di+1) P

kαj(i)pjkβk(i+ 1)f_µ^IG_k_,σ_k(di+1) which is proven in Lemma 9.12 d).

To iteratively derive the parameter estimates, the BWA applies expectation maximization as follows. Let Θ^(m)_HMM denote the parameter estimates after them-th iteration step, and letY

9. A Hidden Markov Model

denote the set of all possible state sequences of the hidden Markov chain. LetY = (Y₁, . . . , Y_n) denote a Y-valued random variable and y = (y1, . . . , yn) a realization of Y. In the E-step (Figure 9.11) theQ-function (e.g., Ephraim and Merhav, 2002) overY,

Q:=Q(ΘHMM|Θ^(m)_HMM) :=E h

logL(d, y|Θ_HMM)|D=d,Θ^(m)_HMM i

y∈Y

logL(d, y|Θ_HMM)P(Y =y|d,Θ^(m)_HMM),

i.e., the expectation of the complete-data log-likelihoodL across all possible pathsy∈Yis derived. In the M-step the updated parameter set Θ^(m+1)_HMM is chosen such that it maximizes Q.

These iterative steps are repeated until a desired level of convergence is reached. Here, we stop the algorithm if the improvement in the log-likelihood from the last iteration to the present one is smaller thanδ_stop = 0.005 or if 1000 iterations were computed (compare also page 98).

●●

Choose Θ⁽⁰⁾

E−step

Derive the Q−Function using the current parameter estimate Θ^(m)

M−step

Compute parameter estimate Θ^(m+1)maximizing Q

m:=m+1 converged?

Yes

Final estimate Θ^(m)

Figure 9.11: The Baum-Welch algorithm as Expectation-Maximization-Algorithm. The steps are explained in detail in the main text. Note that we set Θ := ΘHMM

in the graph.

The following Lemma 9.11 – which shows that maximizing the Q-function is equivalent to maximizing the likelihood-function – is essential for the correctness of the Baum-Welch algorithm.

9. A Hidden Markov Model

Lemma 9.11. Maximization of the Q-function It holdsQ

ΘHMM|Θ^(m)_HMM

≥Q

Θ^(m)_HMM|Θ^(m)_HMM

⇒L(D₁ⁿ=dⁿ₁|Θ_HMM)≥L

Dⁿ₁ =dⁿ₁|Θ^(m)_HMM . Moreover,

Θ_HMM|Θ^(m)_HMM

Θ^(m)_HMM|Θ^(m)_HMM

⇔L

D₁ⁿ=dⁿ₁|Θ^(m)_HMM

=L(D₁ⁿ=dⁿ₁|Θ_HMM). Proof: See Ephraim and Merhav (2002).

Next, we show in detail how to update the parameters in them+ 1-st step. For a fixed state sequencey= (y₁, . . . , y_n) the log-likelihood of the data is

logL(d, y|Θ_HMM) = logπstart,y1 + logf_µ_y

1σy1(d1) +

i=2

log(pyi−1yi) + log(f_µ_yi_,σ_yi(di))

. Insertion intoQ yields (e.g., Ephraim and Merhav, 2002)

ΘHMM|Θ^(m)_HMM

= X

y1∈{S,U}

logπstart,y1P(Y1=y1|d,Θ^(m)_HMM)

i=2

yi−1∈{S,U}

yi∈{S,U}

logp_y_i−1_y_iP(Yi−1 =yi−1, Y_i =y_i|d,Θ^(m)_HMM)

i=1

yi∈{S,U}

logfµ_yi,σ_yi(di)P(Yi=yi|d,Θ^(m_HMM). (9.11) Note that the first line depends only on the initial distributionπ_start, the second line depends on the transition probabilities and the third line depends on the parameters of the IG or the Gamma distributions. Therefore, iterative parameter estimation separately maximizes these terms. Note further that we can rewrite P(Y_i = y_i|d,Θ^(m)_HMM) = γ_y_i(i|Θ^(m)_HMM) and P(Yi−1=yi−1, Yi=yi|d,Θ^(m)_HMM) =ξyi−1yi(i−1|Θ^(m)_HMM), which yields the following estimates in them+ 1-st iteration step.

Using the Lagrangian multiplier Γ with the constraintP

jπ_start,j= 1 and setting the derivative with respect toπ_start,j to zero, we obtain

P(Y1 =j|d,Θ^(m)_HMM) πstart,j

+ Γ = 0.

Multiplying withπstart,j, summing over j to get Γ and solving for πstart,j we arrive at ˆ

π^(m+1)_start,j =P(Y₁ =j|d,Θ^(m)_HMM) =γ_j(1).

Alternatively to this procedure and in order to reduce the number of parameters we assume that the HMM starts in its stationary distribution (π_S,1−π_S) = (p_{U U}−1, p_SS−1)/(p_SS+p_{U U}−2) (Corollary 10.8). Under this assumption the initial distribution for the stable state is updated

in them+ 1-th step of the BWA by ˆ

π_start,S^(m+1) = pˆ^(m+1)_{U U} −1 ˆ

p^(m+1)_SS + ˆp^(m+1)_{U U} −2

(9.12)

9. A Hidden Markov Model

and for the unstable state we obtain ˆπ^(m+1)_start,U = 1−πˆ_start,S^(m+1).

For the entries of the transition matrix, Lagrange maximization of the second line of Q

Now, we investigate the term in the last line (9.11) of the Q-function (which we term Q^∗(ΘHMM|Θ^(m)_HMM)) and distinguish between the assumption of inverse Gaussian and Gamma-distributed dominance times.

Parameter estimation for IG distributions

In the case of inverse Gaussian distributed observations we obtain Q^∗

We maximize the first and the second line (for the third and fourth line all calculations can be done similarly). Derivating partially gives

∂Q^∗

9. A Hidden Markov Model

Plugging this in (9.13) and setting to zero yields

0 = 1

We plug this estimator in (9.14)

For the unstable dominance times we obtain similar results

9. A Hidden Markov Model

Parameter estimation for Gamma distributions

Assuming Gamma-distributed stable dominance times and exponentially distributed dominance times in the unstable state and substitutingpS =µ²_S/σ²_S, θS=µS/σ_S² (compare Remark 2.3)

In the latter display we made use of the weighted means

d_w_S :=

DerivatingQ^∗(Θ_HMM|Θ^(m)_HMM) partially with respect toµ_U and setting the derivative to zero yields the well-known ML estimator for the Exponential distribution

Partially derivatingQ^∗(ΘHMM|Θ^(m)_HMM) with respect to θS and setting the derivative to zero gives

θˆ_S =p_S/d_w_S.

Finding an approximate ML estimate of pS is more tricky. Again following Minka (2002) (compare Section 9.3.2) and applying a ”generalized Newton” principle, we update ˆpSnew

iteratively

9. A Hidden Markov Model

until the change inp_S gets sufficiently small. As starting value ˆ

pS = 1

2(logd_w_S −logd_w_S) is used (Minka, 2002).

The ML estimators for µS and σS are obtained by reparametrization ˆ

µ^(m+1)_S = pˆS

θˆS

=dw_S

σ_S^(m+1)= pˆS

θˆ_S².

The next lemma states that the derivations of the forward and backward variables as well as ofγj(i) and ξ_jk(i) are correct (recall page 91).

Lemma 9.12. Correctness of the BWA It holds forj∈ {S, U}

αj(1) =πstart,jf_µ_j_,σ_j(d1) and αj(i+ 1) =f_µ_j_,σ_j(di+1) X

k∈{S,U}

αk(i)pkj for i= 1, . . . , n−1,

b) β_j(n) = 1 andβ_j(i) =P

k∈{S,U}p_jkf_µ_k_,σ_k(d_i+1)β_k(i+ 1) for i=n−1, . . . ,1, c) γj(i) = _α ^α^j^(i)β^j⁽ⁱ⁾

S(i)βS(i)+αU(i)βU(i),

d) ξ_j,k(i) = ^P ^α^j^(i)p^jk^β^k^(dⁱ⁺¹^)f^k^(dⁱ⁺¹⁾

j∈{S,U}

k∈{S,U}αj(i)pjkβj(di+1)fk(di+1).

Proof: a) The claim is shown inductively with the case i= 1 being trivial. Fori→i+ 1 it holds

α_j(i+ 1) =P(D₁ⁱ⁺¹ =dⁱ⁺¹₁ , Y_i+1=j|Θ_HMM)

=P(Di+1=di+1|D₁ⁱ =dⁱ₁, Yi+1 =j,ΘHMM)P(D₁ⁱ =dⁱ₁, Yi+1 =j|Θ_HMM)

=fµj,σj(di+1) X

k∈{S,U}

P(D₁ⁱ =dⁱ₁, Yi=k|Θ_HMM)P(Yi+1 =j|Y_i=k,ΘHMM)

=f_µ_j_,σ_j(d_i+1) X

k∈{S,U}

α_k(i)p_kj,

where in the third line the conditional independence and Markov property have been applied.

In the fourth line, the definitions ofαk(i) and pjk have been plugged in.

b), c) and d) follow by similar elementary calculations using the Markov and independence properties of the HMM.

9. A Hidden Markov Model

Computational issues of the BWA: Scaling

Note thatα_j(i) essentially is the sum of terms each being a product

i−1

k=1

p_y_k_,y_k+1

i−1

k=1

f_µ_yk_,σ_yk(d_k)

! .

All terms withp are smaller than one and are often even close to zero. Moreover, the terms with f are typically close to zero. Thus, with increasing ithe forward variable αj(i) heads to zero which leads to computational problems. Similar problems are observable for the backward variableβj(i). Scaling offers a solution here. We need to find a scaling coefficient c(i) depending only oni(and not onj) that is multiplied withαj(i) andβj(i) in each step and cancels out at the end of computation.

We follow the ideas of Rabiner (1989); Turner (2008) and setc(i) :=αS(i) +αU(i) and divide the unscaled values ofαj(i) and βj(i) in each step of the forward- and backward-algorithm by c(i) to obtain normalized values ˜α_j(i),β˜_j(i). Formally,

α^∗_j(1) :=π_start,jf_µ_j_,σ_j(d₁), c(i) :=α^∗_S(i) +α^∗_U(i), α˜_j(i) :=α^∗_j(i)/c(i), α^∗_j(i) :=f_µ_j_,σ_j(d_i) X

k∈{S,U}

α_k(i−1)˜p_kj fori= 2, . . . , n, β˜j(n) = 1/cn and ˜βj(i) = X

k∈{S,U}

pjkf_µ_k_,σ_k(di+1) ˜βk(i+ 1)/c(i) for i=n−1, . . . ,1.

To deriveγ_j(i) andξ_j,k(i) and consequently to update parameters, we always use the normalized values ˜αj(i),β˜j(i) in the practical implementation (instead of αj(i), βj(i)). Note, however, that – as it was intended – the scaling cancels out in the derivation ofγ_j(i) andξ_j,k(i), and therefore the resulting estimates in each iteration step are identical for the unscaled and the scaled version of the BWA. We show this forγj(i) and refer for more details to Rabiner (1989).

It holds

γj(i) = α˜j(i) ˜βj(i)

α_S(i) ˜β_S(i) + ˜α_U(i) ˜β_U(i)

k=1(1/ck)αj(i)Qn

k=i+1(1/ck)βj(i) Qi

k=1(1/c_k)αS(i)Qn

k=i+1(1/c_k)βS(i) +Qi

k=1(1/c_k)αU(i)Qn

k=i+1(1/c_k)βU(i)

= (Qn

k=1(1/ck))αj(i)βj(i) (Q_n

k=1(1/c_k)) (α_S(i)β_S(i) +α_U(i)β_U(i)) = αj(i)βj(i)

α_S(i)β_S(i) +α_U(i)β_U(i). The likelihood then derives as (Turner, 2008)

L(d|Θ_HMM) =α_S(n) +α_U(n) =

i=1

c(i)( ˜α_S(n) + ˜α_U(n)) =

i=1

c(i), yielding

`(d|Θ_HMM) = logL(d|Θ_HMM) =

i=1

log(c(i)).

Recall that the stopping rule for the BWA we use here is defined as: Stop the BWA if the improvement in the log-likelihood from the last iteration to the present one is smaller than δ_stop= 0.005 or if 1000 iterations were computed.

9. A Hidden Markov Model

Starting values and constraints

As starting values µ^(s)_S , σ^(s)_S , µ^(s)_U , σ_U^(s), p^(s)_SS, p^(s)_{U U} for the Baum-Welch algorithm we chose, in correspondence with the data set, p^(s)_SS = p^(s)_{U U} = 0.5; µ^(s)_U = 4; σ^(s)_U = 5 (assuming inverse Gaussian distributed dominance times). In order to reduce the probability that the Baum-Welch algorithm will be captured in a local extremum, we chose ten equidistant values forµ^(s)_S ranging between 60 and 0.95 max_i d_i, and for each value ofµ^(s)_S we choose ten equidistant values for σ_S^(s) between 10 and 1.1µ^(s)_S . Very irregular stable distributions with a CV larger than 1.1 are not reasonable as consequently the stable and unstable dominance times are not separated clearly. Moreover, a mean of stable dominance times larger than the maximum length of dominance times is not reasonable. Out of the resulting one hundred sets of parameter estimates we chose the parameter set with the highest log-likelihood (satisfying also the constraints A)-C) below).

If the response pattern shows only dominance times larger than 30 seconds we reduce the model to the stable phase. The parameters µS and σS are derived by ML as described in Section 9.3.1.1, and we setp_SS := 1. If the dominance time are only smaller than 30 seconds, we only estimateµ_U and σ_U by ML and usep_{U U} := 1.

For subjects with relatively clear distinction between long and short dominance times this procedure yields reasonable estimates. For subjects with less clear distinction, we added the following constraints based on the idea that short dominance times should not affect estimation of the stable parameters and long dominance times should not affect estimation of unstable parameters. Note that in continuous presentation where only one state exists, about 90% of the dominance times are shorter than 15 seconds, while only about two percent are larger than 30 seconds. Therefore, we require the following conditions

A) ˆσ_S >1 B) ˆµ_S ≥0.98ˆµ₁₅ C) ˆµ_S<1.02ˆµ₇₅ with ˆµ_k:= (1/Pn

i=11di>k)Pn

i=1d_i1di>k if any dominance time is larger than kseconds and ˆ

µ_k = k else. A) prevents that not just the largest dominance time is estimated as stable and all others are categorized as unstable (which may increase the likelihood). B) prevents dominance times smaller than 15 seconds to be considered for the estimation ofµ_S. Third, we require C) such that rather stable dominance times longer than 75 seconds are not classified as unstable.

Instead of rejecting the result of the BWA for a given set of starting values when the conditions A)-C) are not fulfilled one may also stop the updating procedure immediately when parameters not satisfying the constraints are estimated and then take the last parameters that are not outside the parameter range as result of the BWA. This is slightly less robust but leads for the great majority of cases to the same results and has also comparable estimation precision properties.

For the HMM with Gamma-distributed dominance times, we use as starting value for the Exponential distributionµ^(s)_U = 5. All other starting values and the constraints are identical to the inverse Gaussian-HMM.

To derive confidence intervals for the HMM parameters (block) bootstrap approaches are thinkable (e.g., Efron and Tibshirani, 1994; Scholz, 2007).

9. A Hidden Markov Model

IG distribution: UMVU inspired estimators

As explained in Section 9.3.1, the ML estimator ofσ for the IG distribution is biased. Hence, the BWA estimates of the standard deviations in the stable and the unstable state are also biased as they are based on the ML principle. Applying UMVU estimators would lead to unbiased estimators ofσ_S andσ_U. However, note that the corresponding ML estimators are a kind of weighted means (equations (9.15) and (9.16)) as

σ^(m+1)_j = v u u t

ˆ µ³_j P_n

i=1γ_j(i)

i=1

γj(i) 1

d_i − 1 ˆ µ_j

for j ∈ {S, U} and with γj(i) as the probability of being in state j at time i given the estimated parameters and all observations (resulting from the BWA). The derivation of UMVU estimators for ˆσ_S and ˆσ_U being weighted means is an open question. Here, we give a first idea how UMVU inspired estimators may be included in the BWA without claiming theoretical correctness. We just apply an intuitive idea.

The usual BWA estimators ˆΘ_HMMare used as initial points for the UMVU inspired estimation.

Definevj :=Pn

i=1γj(i)(1/di−1/ˆµj) and ˜nj := (Pn

i=1γj(i))²/Pn

i=1γj(i)². vj plays the role ofvin the traditional UMVU estimation (compare equation (9.5)). The weighting factor ˜nj is motivated by the variance of a random variableZ:=P

w_iX_i/P

w_i where w_i ≥0 are weights and Xi i.i.d. with variance σ². It holds Var(Z) =P

w_i²σ²/(P

wi)² as can be shown by a short derivation. In our case theXi are the _d¹

i −_µ_ˆ¹

j andwi is γj(i). Inspired by the UMVU estimator forσ (Corollary 9.10) we define the UMVU inspired estimator forσj, j ∈ {S, U}as

σ^UMVU_j := Γ(˜nj−1)/2

√

2Γ(˜nj/2)(ˆµ³_jvj)^1/2×F 1

4,3 4;n˜j

2 ;−µˆjvj

˜ n_j

. (9.17)

The estimates of µS, µU, pSS, pU U remain unchanged. Thus, the only difference compared to the traditional Baum-Welch algorithm is that we re-estimate the estimators for the standard deviations in the end of the estimation procedure using equation (9.17). In case of only stable or only unstable dominance times, we use the usual UMVU estimator ofσ (given in Corollary 9.10). In Section 9.6.3.2, we compare the bias of the traditional BWA and the UMVU inspired estimates ofσ_S and σ_U empirically.

Remark on the estimation implementation

The estimation of model parameters speeds up remarkably when outsourcing parts of the code from the statistical package Rto the widely used programming language C++. When estimating the Hidden Markov Model the forward- and backward algorithm are typically performed in loops, which are not recommended to use inR. Therefore, we suggest to perform these algorithms inC++using the weights of the inverse Gaussian or the Gamma distribution for each data point as input.

Im Dokument Stochastic models for variability changes in neuronal point processes (Seite 98-108)