JOONAS SOVAPairwise Markov Models

(1)

Tartu 2021 1 ISSN 1024-4212

DISSERTATIONES MATHEMATICAE UNIVERSITATIS TARTUENSIS

136

JOONAS SOVA

Pairwise Markov Models

(2)

DISSERTATIONES MATHEMATICAE UNIVERSITATIS TARTUENSIS 136

(3)

DISSERTATIONES MATHEMATICAE UNIVERSITATIS TARTUENSIS 136

JOONAS SOVA

Pairwise Markov Models

(4)

Institute of Mathematics and Statistics, Faculty of Science and Technology, Uni- versity of Tartu, Tartu, Estonia.

Dissertation has been accepted for the commencement of the degree of Doc- tor of Philosophy (PhD) in mathematical statistics on 18^th of June, 2021, by the Council of the Institute of Mathematics and Statistics, Faculty of Science and Technology, University of Tartu.

Supervisor:

Prof. J¨uri Lember

Institute of Mathematics and Statistics University of Tartu

Estonia

Opponents:

Prof. Pavel Chigansky Faculty of Social Sciences Hebrew University of Jerusalem Isreal

Prof. Evgeny Verbitskiy Mathematical Institute Leiden University Netherlands

The defense will take place at 25^th of August, 2021, at 14:15 in Narva mnt 18-1007, Tartu, Estonia.

ISSN 1024-4212

ISBN 978-9949-03-663-9i(print) ISBN 978-9949-03-664-6i(pdf) Copyright:iJoonasiSova,i2021

University of Tartu Press http://www.tyk.ee

(5)

Acknowledgments

My deepest gratitude goes to my supervisor J¨uri Lember. Without his immense dedication and passion for science, tireless work ethics and endless patience as a teacher this dissertation would have never been possible. I would also like to thank The Institute of Mathematics and Statistics of Tartu University for providing the funding that allowed to complete this work. Finally I would like to thank my family, and my colleagues both from academia and industry for their support and encouragement.

(7)

Publications

This dissertation is based on the following articles:

[I] J. Lember and J. Sova. “Existence of infinite Viterbi path for pairwise Markov models”. Stochastic Processes and their Applications 130.3 (2020), pp. 1388–1425

[II] J. Lember and J. Sova. “Regenerativity of viterbi process for pairwise markov models”. Journal of Theoretical Probability 34.1 (2021), pp. 1–33 [III] J. Lember and J. Sova. “Exponential forgetting of smoothing distributions

for pairwise Markov models”. Electronic Journal of Probability (2021), to appear

The author’s contribution in all three papers is in working jointly with the co- author to develop the theory and write the text for publication. Publication of this dissertation has been supported by the Estonian Research Council grant PRG865.

(8)

Introduction

Markovian latent variable models are a great success story of modern statistics.

Nowadays there is increasing prevalence of data where the classic assumptions of independence cannot be assumed, and so the classic statistical methodology fails to be effective. In contrast, the Markovian latent variable models have been shown to provide efficient and highly adaptable methodologies for dealing with various types of statistical problems related to complex and inter-dependent data. Here we explore a wide class of such models, namely the “pairwise Markov models” (PMM’s). PMM is simply defined as a latent variable model where the latent or hidden layer and observed layer both constitute a Markov chain. As such, the PMM encompasses several well-known and widely applied models, like hidden Markov models, autoregressive switching models, hidden Markov models with dependent noise and many more.

The purpose of this thesis is to give an overview of the key results in the three papers listed above, all of which investigate certain aspects of the PMM.

It is often the danger in papers of technical nature that the main driving ideas are overshadowed by the prevalence of technical minutiae. Here my main goal is to present the key ideas and results of the three papers as accessibly as possible, while “hiding” the more technical aspects of the proofs as well as some of the results which are less significant in terms of scientific contribution or novelty.

In Paper I the main subject of interest is the Viterbi path – the maximum likelihood estimate of the hidden layer of the PMM. The question of the stability of the Viterbi path is non-trivial, because adding a single observation to our sample can in theory change the whole path estimate. We study the asymptotic path-convergence of the Viterbi classifier on several levels of abstraction – start- ing from the general PMM up to examples of specific parametrized models. In particular, we prove that under general conditions it is possible to extend the Viterbi path estimate to infinity. This in turn enables to ensure the existence of the infinite Viterbi encoding of the observation sequence, or what is termed the “Viterbi process”.

Paper II continues the study of the stability of the Viterbi classifier beyond the question of path-convergence itself. More specifically, we show that based on concepts developed in Paper I it is possible to construct a series of regeneration times for the PMM such that the Viterbi process depends on the observations up to each regeneration time only. This in turn enables to derive strong laws of large numbers and central limit theorems for the Viterbi classifier.

The subject matter of Paper III departs from the previous two papers and is no longer related to the Viterbi estimation. Rather, we study certain forgetting properties of the smoothing probabilities of the PMM. The main novelty here is the condition which ensures the exponential forgetting rate for the smoothing probabilities. In fact, we demonstrate through several examples how this condition is much more lenient than several other known conditions for similar purposes in the setting of finite hidden state space. Interestingly, this same condition is also prominent in Paper I for ensuring the existence of the Viterbi process, even though its application there is completely different.

The structure of the thesis is as follows: Section 1 gives some background information on the pairwise Markov models, including hidden Markov models;

Section 2 introduces the notation and the theoretical framework that is used

(9)

throughout the thesis; Section 3 gives some background information on the Viterbi estimation and the Viterbi classifier; Section 4 introduces several neces- sary concepts and definitions that are used for the study of the stability of the Viterbi classifier in Papers I and II; and Sections 5-7 give the summaries of the key results in Papers I-III.

1 Pairwise Markov models

Hidden Markov models (HMM) have been called “one of the most successful statistical modeling ideas that have came up in the last fifty years” [1]. Indeed, the applications of such models in different fields have been so wide-ranging, that I will not attempt to list them here. The appeal of HMM’s for data re- searchers can be attributed to several factors. On the one hand, based on the overall Markovian structure of the HMM, several estimation methods have been developed that suit the needs of various types of statistical problems. Examples of such methods are the Baum-Welch algorithm (also known as the Expectation Minimization or the EM-algorirthm), forward-backward algorithm, Viterbi algorithm, and so on. On the other hand the observed process of the HMM (without the latent layer) is not generally Markovian and can have a highly complex and rich dependence structure. Therefore, in today’s data-driven world HMM’s have become increasingly relevant as a data inference tool in situations where the observed process cannot be assumed to follow the assumptions of classic statistics such as independence or even the Markovian structure.

Simply put, an HMM can be viewed as a Markov chain with some “added noise”. More rigorously, we can consider a Markov chain Y = {Yk} taking values in a finite state space. For each k the conditional distribution of the observationXk givenYk =iis determined by a densityfi, called theemission density. Requiring that, for each k, Xk is conditionally independent of all other random variables givenYk, the law of the observation processX ={Xk} is now fully described by the distribution of Y1, transition matrix of Y and densitiesf_i. For some sample sizen, (X₁, . . . , X_n) is the observed sample, while the sequence (Y₁, . . . , Y_n) is latent or in other words hidden from the observer, hence the term “hidden Markov model”. Data inference is done based on the observed sample (X₁, . . . , X_n) only. As stated, the sample (X₁, . . . , X_n) does not generally follow the i.i.d. law nor is it a Markov chain, and can have a rather complex dependence structure. However, conditionally given (Y1, . . . , Yn) the sample elementsX1, . . . , Xnare independent of each other. Moreover, assuming that the hidden chain follows a stationary distributionπ(i), the distribution of eachXk can be easily expressed asP(Xk ∈A) =P

iπ(i)R

Afi(x) dx. (Here we assumed that thatXk are continuous random variables, but in the discrete case the integral would simply be replaced by a sum.) Thus when Y is stationary, then the marginal distributions of theX-process are simply given by mixtures of the densitiesfi over the stationary distribution π.

It is also helpful to think of the HMM in terms of stochastic representation as follows. For each stateiletξk(i) be random variables having densityfi, and assume that allξ_k(i) independent of each other and independent of the hidden chainY. ThenX_k can be defined through the stochastic equation

X_k =ξ_k(Y_k).

(10)

For comparison, we can consider a more general model

Xk =α(Yk)X_k−1+ξk(Yk). (1) Where α(i) are constants. This type of model is called a “Markov switching model” or “autoregressive switching model” in literature [1, 2, 3, 4]. Here we refer to this model as “linear Markov switching model” owing to the linear link in (1). This model is different from HMM in that it allows conditional dependence betweenXk−1 andXk givenYk, and becomes HMM in the special case where α(i) are all zero. In both the HMM and linear Markov switching model, the variablesξk(i) can be considered as the “random noise” in the sense that they contain all the randomness of theX-process beyond the hidden chain Y. In particular, whenξ_k(i) are constants, then theX-process simply becomes the (non-random) encoding of the discrete Markov chainY.

To investigate the difference of the two models further, I have generated a sample from both of them. Figure 1a displays 200 simulated observations from an HMM with a two-state hidden chain. The transition matrix of the hidden Markov chain is taken to be symmetric: the transition probability of maintaining the same state is 0.95 and switching the sate is 0.05. The emission densities are taken to be normal with both standard deviations equal to 1 and mean values for the hidden states 1 and 2 equal to 0 and 2, respectively. The hidden sates 1 and 2 are respectively indicated by the light gray and darker gray background color. As can be seen from the figure, the observations from the HMM inherit their overall discrete step-wise structure from the hidden chain. Cutting out the dark gray areas would result in an i.i.d. standard normal sample, and similarly cutting out the lighter areas would leave a i.i.d. normal sample with mean 2.

Figure 1a presents 200 observations from a two-state linear Markov switching model. The hidden chain is generated just like in the HMM-case with the 0.95 probability of maintaining the same state and 0.05 probability of switching the sate. The random noiseξ_k(i) are taken to be normal with mean 0 and standard deviation ¹₄for both states. The parametersα(i) are taken to be 1.01 and 0.5 for i= 1 andi= 2, respectively. (Here the model and its parameters are specified in a way that will ensure the overall stability of the observations in a certain sense.

We will touch more on the stability conditions of the linear Markov switching model later.) As we can see, the behavior of the linear switching model is very different from that of the HMM. When the hidden state is 1, this model ensures that the observations have a tendency to autoregressively move away from 0, either to positive or negative direction. On the other hand, if the hidden state is 2, the observation start to move back towards the 0 value.

As we have seen, two different types of Markovian latent variable models can exhibit a very different behavior. However, these two models are only special cases of a much vaster class of latent variable models, namely the pairwise Markov models (PMM’s). A PMM is simply any model such that the pairs of observations and hidden states constitute a Markov chain. If we follow the notation introduced above for the observation process{X_k}and hidden process {Y_k}, then saying that the model is PMM simply means that {(X_k, Y_k)} is a Markov chain. A PMM thus retains the general Markovian structure of both the HMM and the linear Markov switching model. This is a very useful property, since it means that many of the estimation tools used for HMM, such as the Viterbi algorithm or the EM-algorithm, can be implemented for this model as

(11)

−2 0 2 4

0 50 100 150 200

k Xk

Yk 1 2

(a)

−2 0 2

0 50 100 150 200

k Xk

Yk 1 2

(b)

Figure 1: Simulations from an HMM (a) and linear Markov switching model (b)

well. On the other hand, unlike the HMM and the linear Markov switching model, for PMM the hidden process {Y_k} is no longer necessarily a Markov chain. It can be shown, however, that conditionally given {X_k}, {Y_k} is a non-homogeneous Markov chain, and vice versa.

The term “pairwise Markov chain” we have adopted from Pieczynski et al.

who have used it in a series of papers to study such models [5, 6, 7, 8, 9]. We use this term to emphasize the much more general nature of this model when compared to a simple HMM. It should be noted, however, that this distinction is not always so clear in the scientific literature, and the term “hidden Markov model” is sometimes applied more generally than in the classic sense used here.

In the next section we will introduce the exact theoretical framework and notation that applies to all three papers, as well as some definitions commonly used in the study of general-state Markov chains.

2 Theoretical framework and notation

As mentioned, we consider a two-dimensional Markov chain {(Xk, Yk)}k≥1= ((X1, Y1),(X2, Y2), . . .).

The processX ={Xk}k≥1 is called the observation process, and its elements take values from the observation-spaceX. We assume that X is Polish (sep- arable completely metrizable) and equipped with its Borel σ-field B(X). The process Y = {X_k}_k≥1 is the hidden or latent process, and its elements take values from a finite state-spaceY ={1,2, . . . ,|Y|}. BothX andY are defined on the probability space (Ω,F, P)

We denote Zk = (Xk, Yk) for each k, Z ={Zk}_k≥1 and Z =X × Y. Thus the pairwise Markov chainZ is taking values from the product spaceZ. We equipZwith product topology τ×2^Y, whereτ denotes the topology ofX and 2^Y denotes the discrete topology on Y. Furthermore, Z is equipped with its Borelσ-fieldB(Z) =B(X)⊗2^Y, which is the smallestσ-field containing sets of the formA×B, whereA∈ B(X) andB∈2^Y.

(12)

Let nowµbe aσ-finite measure onB(X), and letcbe the counting measure on 2^Y. We assume that the transition kernel ofZ admits a densityq(z⁰|z) with respect to measureµ×c. This means that the transition kernel of Z expresses as

P(Z2∈C|Z1=z⁰) = Z

C

q(z|z⁰)µ×c(dz), z⁰ ∈ Z, C∈ B(Z). (2) Here, the mapping

q:Z²→[0,∞), (z, z⁰)7→q(z|z⁰)

is a measurable non-negative function such that for each z⁰ ∈ Z the function z 7→q(z|z⁰) is a probability density function with respect to product measure µ×c. Since every C ∈ B(Z) is of the form C = ∪_j∈YAj × {j} for some Aj ∈ B(X), then taking (x⁰, i) =z⁰, (2) can be rewritten as

P(Z₂∈C|Z₁=z⁰) =X

j∈Y

P(X₂∈A_j, Y₂=j|X₁=x⁰, Y₁=i)

=X

j∈Y

Z

Aj

q(x, j|x⁰, i)µ(dx).

We also assume thatZ1has density with respect to product measureµ×c.

Then, for every n, the random vector Z1:n has a density with respect to the measure (µ×c)ⁿ. For every vector (a1, . . . , an) we shall adopt the notation a1:n. With a slight abuse of notation the letter p will be used to denote the various joint and conditional densities. Thusp(zk) =p(xk, yk) is the density of Zk determined atzk = (xk, yk), p(z1:n) =p(z1)Qn

k=2q(zk|z_k−1) is the density of Z1:n determined at z1:n, p(z2:n|z1) = Qn

k=2q(zk|zk−1) stands for the conditional density and so on. Sometimes it is convenient to use other symbols besidesxk, yk, zk as the arguments of some density; in that case we indicate the corresponding probability law using the equality sign, for example

p(x2:n, y2:n|x1=x, y1=i) =q(x2, y2|x, i)

n

Y

k=3

q(xk, yk|xk−1, yk−1), n≥3.

The notation Pz(·) will represent the probability measure, when the initial distribution of Z is the Dirac measure onz ∈ Z (i.e. Pz(A) =P(A|Z1 =z)).

Likewise, if T = g(Z) for any measurable mapping g:Z^∞ → R, then Ez[T]

denotes the expectation ofT conditioned on{Z1=z}. See [10] for more details on the construction of the probability space, above probability measures and the conditional expectation for the general state Markov chain.

Classification of pairwise Markov models Returning now to the HMM- case, we can see that PMM becomes HMM under the following conditions: if p(y2|x1, y1) does not depend on x1, and p(x2|y2, x1, y1) does not depend on neitherx1 nory1. Indeed, in that case, denoting

pij =p(y2=j|y1=i), fj(x) =p(x2=x|y2=j),

(13)

the transition kernel density factorizes into

q(x, j|x⁰, i) =p(x2=x|y2=j, x1=x⁰, y1=i)p(y2=j|x1=x⁰, y1=i)

=p_ijf_j(x).

Here the density functionsf_j (with respect to µ) are again the emission densities, like they were introduced in the first subsection, and pij are transition probabilities of the hidden Markov chainY. The dependence structure of the HMM is represented in Figure 2a. The arrows in the figure represent a simplest possible scheme by which the HMM can be generated. We can imagine that we have simulated a pair (Xk, Yk) from our specified HMM, and we now need to simulate the next pair (Xk+1, Yk+1). Since Y is a Markov chain, Yk+1 can be simulated based on the transition matrix (pij) and value of Yk alone – this is represented in the graph by a single arrow fromYk toYk+1. Likewise, once we have simulated the value ofYk+1, the value ofXk+1 can be simulated based on Y_k+1 only, which is represented by another arrow fromY_k+1 toX_k+1. Indeed, as we mentioned previously, the conditional distribution ofX_k+1givenY_k+1=i has the densityf_i.

Ifp(y₂|x₁, y₁) does not depend onx₁, andp(x₂|y₂, x₁, y₁) does not depend ony₁, then we callZ aMarkov switching model. Thus HMM’s constitute a sub- class of Markov switching models. In case of Markov switching model, denoting

fj(x|x⁰) =p(x2=x|y2=j, x1=x⁰), the transition kernel density becomes

q(x, j|x⁰, i) =p_ijf_j(x|x⁰).

Again, it is not difficult to confirm that in for the Markov switching model, just like in case of HMM,Y is also a homogeneous Markov chain with transition matrix (pij). The linear Markov switching model (1) from which we simulated observations earlier is a special case of this type model withfj(x|x⁰) =hj(x− α(j)x⁰), wherehj denote the densities of the random noise variablesξk(j) and µis Lebesgue measure. In the case of Markov switching model, it is no longer possible to simulate the variableXk+1based on theYk+1 variable only. Indeed, to simulate Xk+1, the value of the previous observation Xk is also required.

In particular, the conditional density of Xk+1 given Xk =x⁰ and Yk+1 =j is given byfj(·|x⁰). This dynamic is represented in Figure 2b, where there is an additional arrow pointing fromXk toXk+1 when compared to the HMM-case.

In the most general case of the PMM, however, the simulation scheme for the Markov switching model may not apply either anymore. Figure 2c shows one of the possible ways howY_k+1 andX_k+1 can be simulated in this case. We can see that according to this scheme,Y_k+1 is simulated based on X_k and Y_k. Then the next observationX_k+1is simulated based on all three variablesY_k,X_k andYk+1. An alternative approach would be to generateXk+1 beforeYk+1, in which case the arrow betweenXk+1 andYk+1 would be flipped. In this general case,Y may no longer be a Markov chain. It is not difficult to see, however, that conditionally givenY1:n,X1:n is always a (generally non-homogeneous) Markov chain and vice versa.

Harris chains In all three papers we construct some set of vectorsB ⊆ Z^r, and then for our proofs to work we need the Markov chainZ to return to B

(14)

Y_k Y_k+1 X_k X_k+1

. . . .

(a) Hidden Markov model (HMM)

Y_k Y_k+1 X_k X_k+1

. . . .

(b) Markov switching model

Y_k Y_k+1 X_k X_k+1

. . . .

(c) Pairwise Markov model (PMM)

Figure 2: Directed dependence graphs of different types of PMM’s infinitely often a.s. In other words, we want the particular constructed setBto satisfyP(Z∈B i.o.) = 1, where we denote

{Z ∈B i.o.}= (_∞

\

k=1

∞

[

l=k

{Z_l:l+r−1∈B}

) .

Of course, if Z is ergodic in the sense of ergodic theory and P(Z1:r ∈ B) >

0, then indeed P(Z ∈ B i.o.) = 1 according to Birkhoff’s ergodic theorem.

However, the notion of ergodicity in case of general state space Markov chains is a rather abstract one, so we have relied on the theory of Harris recurrent Markov chains instead. The advantage of this theory is that it has a set of well-developed and powerful tools for deriving concrete yet general stability conditions for any specific model. We shall now introduce some key terms from the theory of Harris reccurent chains which will be used in the three papers.

For a much more comprehensive overview of Harris chains see [10].

Markov chainZ is calledϕ-irreduciblefor someσ-finite measureϕonB(Z), ifϕ(A)>0 impliesP∞

k=2Pz(Zk ∈A)>0 for allz ∈ Z. IfZ is ϕ-irreducible, then there exists [10, Prop. 4.2.2.] a maximal irreducibility measure ψ in the sense that for any other irreducibility measure ϕthe measureψ dominatesϕ, ψ ϕ. The symbol ψ will be reserved to denote the maximal irreducibility measure ofZ. Chain Z is calledHarris recurrent when it isψ-irreducible and ψ(A)> 0 impliesPz(Zk ∈Ai.o.) = 1 for all z ∈ Z. Note that if Z is Harris recurrent, then Z returns infinitely often a.s. to any set A ∈ B(Z) satisfying ψ(A)>0. However, our goal was to characterize the infinite recurrence of vector sets, not single-element sets. Indeed, there are some technical details that need to be worked out before one can deduce from Harris recurrence the recurrence of some vector set. This has essentially been done in the proof of [I, Prop.

2.2] which links the infinite recurrence of single-element sets to that of vector sets. Similarly, [III, Prop. 2.1] characterizes the ψ-irreducibility and Harris recurrence of the overlappingr-block Markov chain (Z1:r, Z2:r+1, . . .) through theψ-irreducibility and Harris recurrence ofZ itself.

(15)

We have demonstrated in Figure 1a how the behavior of HMM is largely governed by its discrete hidden chain Y. Thus, it is not surprising that any HMM with irreducible hidden chain isψ-irreducible and Harris recurrent. Here the maximal irreducibility measureψ is defined by

ψ(A× {i}) =µ(A∩Gi), A∈ B(X), i∈ Y,

where we denoteGi ={x∈ X |fi(x)>0}. This is a rather trivial consequence of the theory of Harris recurrent Markov chains, but the formal proof is given under [I, Lem. A.2]. For more complex models proving Harris recurrence might be more difficult and the conditions need to be derived for each model separately.

For example, it can be shown that for the linear Markov switching model (1) with Gaussian noise the sufficient conditions for Harris recurrence are thatY is irreducible and maxi∈YP

j∈Ypij|α(j)|<1 (see [I, Lem. 4.3]). Thus the model used for the simulations in Figure 1b is Harris recurrent, because in this case

max

i∈{1,2}

X

j∈{1,2}

pij|α(j)|= 0.95·1.01 + 0.05·0.5 = 0.9845.

The maximal irreducibility measure for model (1) with Gaussian noise isµ×c, whereµis Lebesgue measure. Therefore, for the particular model in the example X will enter any interval infinitely many times with probability one.

In the next two sections we shall introduce the Viterbi classifier and all the related concepts that are used in Papers I and II. This content is not relevant to Paper III which does not deal with the Viterbi estimation.

3 Viterbi classifier

In many practical applications of an HMM or PMM, the goal of the data analysis is to estimate the hidden pathY1:n based on the observationsX1:n(ω) =x1:n. This is referred to as thesegmentation problem. The most popular estimate is probably the path with maximum likelihood, defined by

v(x1:n) = arg max

y1:n

p(y1:n, x1:n).

The mapping v: Xⁿ → Yⁿ is called the Viterbi classifier and the estimate v(X1:n) is referred to as Viterbi path or alignment (also maximum a posteriori path or alignment).

The Viterbi classifier maximizes the probability of estimating the whole hidden sequence correctly, that is

P(Y1:n=v(X1:n)) = sup

g

P(Y1:n =g(X1:n)), (3) where the supremum is taken over all measurable mappings of the formg:Xⁿ→ Yⁿ. Indeed, for any same-length vectorsaandb, letI(a=b) denote 1 ifa=b and 0 otherwise. For any classifierg we have

P(Y1:n=v(X1:n)) = Z

Xⁿ

X

y_1:n

I(y1:n=v(x1:n))·p(y1:n, x1:n)µⁿ(dx1:n)

≥ Z

Xⁿ

X

y_1:n

I(y1:n=g(x1:n))·p(y1:n, x1:n)µⁿ(dx1:n)

=P(Y1:n =g(X1:n)).

(16)

Viterbi algorithm The notion of the Viterbi path would not have any practical relevance, if there was no way to calculate it from the observed data. Note that the number of possible hidden paths is|Y|ⁿ, a quantity that grows expo- nentially inn. Therefore it is not possible to apply the brute force algorithm to directly calculate the Viterbi path for any reasonably sized sample. Luckily, there is a well-known dynamic programming algorithm – called the Viterbi algorithm – which calculates the Viterbi path in linear time with respect to n.

This algorithm utilizes the Markov property in a rather simple and straightforward way to obtain the path with maximum likelihood. In its standard form the algorithm moves from the beginning of the observed sequence to the end, calculating for each statejand each positionkthe maximum possible likelihood up tokwhile also remembering the state that leads to the maximum likelihood value. Then the algorithm backtracks from the end to the beginning again to construct the Viterbi path based on the memorized states.

More formally, atk= 1 the algorithm calculates the valuesδ₁(j) =p(x₁, y₁= j), and then for eachk= 2, . . . , nandj∈ Y it calculates

δk(j) = max

i∈Y δ_k−1(i)q(xk, j|x_k−1, i), (4) γ_k(j) = arg max

i∈Y

δ_k−1(i)q(x_k, j|xk−1, i). (5) Thus δk(j) = δ_k−1(γk(j))q(xk, j|x_k−1, γk(j)). Once the final δn(j) and γn(j) have been calculated, the maximum likelihood is given by max_j∈Yδn(j) and the Viterbi path (v1, . . . , vn) can be constructed by backtracking as follows:

v_n= arg max

j∈Y

δ_n(j), v_n−1=γn(vn), v_n−2=γ_n−1(v_n−1), ...

v1=γ2(v2).

Note that the algorithm only relies on the Markov property ofZ, and therefore can be utilized with any PMM regardless of the specific model.

There might be many paths which achieve the maximal likelihood, so the Viterbi path is not necessarily unique. Note, however, that if each application of the arg max function is based on some fixed ordering onY, then the Viterbi algorithm chooses the colexicographically maximal path based on the same ordering. That is, if there are several paths achieving the maximum likelihood, the algorithm will choose the colexicographically first one. In the above algorithm, the procedure runs from the first index to the last, and then returns to the first one. Alternatively, one can reverse the algorithm, so that procedure starts and ends with the last index n. In that case the natural tie-breaking scheme will not be colexicographical any more, but lexicographical. The reversed algorithm is essentially symmetrical to the one above, but its precise description is given for the sake of completeness in Appendix A.

Decision theoretic analysis To gain a better understanding of the Viterbi classifier, it is also useful investigate it from the viewpoint of decision theory. In

(17)

that framework a loss function is assigned, which penalizes the estimate based on how badly it misses the correct value of the estimand. The risk function is then defined as the expected value of the loss function, and the best classifier with respect to the specified loss function is the one that minimizes its risk.

Thus, the choice of the loss function determines the optimal estimate in terms of its risk. For the Viterbi classifier, the loss of classifying the true pathy1:n as y⁰_1:n is simply defined as

Lv(y1:n, y_1:n⁰ ) =I(y1:n6=y⁰_1:n),

whereI(a6=b) denotes 1 ifa6=b and 0 otherwise. In other words the loss is 0 only if it is absolutely correct and 1 otherwise. Then, equivalently to (3), the Viterbi classifier achieves the smallest risk over all classifiers:

EL_v(Y_1:n, v(X_1:n)) = inf

g EL_v(Y_1:n, g(X_1:n))

= 1−sup

g

P(Y_1:n =g(X_1:n)).

The loss function Lv could be criticized for being overly absolutist. For example, the estimate which is different from the true path in only one position but correct in all the other ones would be seen as a very good one by anyone, but for the loss functionL_v it is as bad as getting all positions wrong. From that perspective, the loss function that better conforms to practical reality is thepointwise loss defined by

Lp(y1:n, y_1:n⁰ ) = 1 n

n

X

k=1

I(yk6=y⁰_k). (6) This function simply counts the average number of misclassified positions, assigning loss of 1 only when all positions are misclassified and loss of 0 if none are.

The classifier that has the minimal expected pointwise loss (i.e. pointwise risk), is called thepointwise a posteriori (PMAP) classifier. The PMAP classifier determines each position of the whole path individually, so that at each position the probability of having the correct state is maximal. However, because the PMAP classifier is only concerned with each position locally, it is susceptible of producing path estimates with very small overall probability, or indeed with zero probability. In terms of their loss functions, the Viterbi classifier is the po- lar opposite of the PMAP classifier: the former being concerned only with the whole path globally while the latter is only concerned with each state locally.

It is possible to compromise between these two estimators by maximizing the probabilities of correctly classifying all pairs, all triplets, etc. See [11, 12] for the description of the dynamic programming algorithms for different types of classifiers and their risk-based analysis.

Despite their potential drawbacks, the Viterbi and PMAP classifiers remain the overwhelming favorites for estimating the hidden path among data analysts.

This popularity can largely be attributed to the simplicity, intuitive appeal and ease of implementation of both estimators.

4 Infinite Viterbi path

To get a better understanding of the Viterbi classifier, it is useful to study its behavior when the sample sizengoes to infinity. For example, in the previous

(18)

section we criticized its loss function for assigning constant loss regardless of the number of misclassified states. However, this does not necessarily imply that the Viterbi classifier is bad at estimating states at single positions. In reality, its behavior will vary from model to model, and in some instances its pointwise risk may be close to the optimal one achieved by the PMAP classifier. To investigate this further, we would like to study the limit

limn Lp(Y1:n, v(X1:n)) (7) whereL_p is the pointwise loss function defined in (6). If such a limit exists and is a constant, it would quantify the overall pointwise misclassification rate of the Viterbi classifier. This rate could then be compared to the analogous rate of the PMAP classifier – again, assuming that it exists –, and would thereby give a sense of how far the Viterbi classifier is from the optimal PMAP classifier in terms of its ability to correctly classify states at individual positions. The limit (7) would depend on the specific model, but could be estimated through simulations for each model. It turns out that such a limit does exist a.s. under general conditions (in particular, it is implied by [II, Th. 4]), but it takes quite a lot of preparatory work to arrive to that point.

Indeed, the asymptotic study of the Viterbi path is non-trivial by the fact that adding a single observation to our sample can theoretically change the path estimate at any position. More formally, it is not necessarily the case that v(x_1:n) is the same vector than the n first elements of v(x_1:n+1). Intuitively, adding a single element to the end of our observation sequence should not affect the front part of our path estimate in any significant way, and if it does, this would generally be viewed as a pathological behavior of the model. Fortunately, in practice such pathological behavior usually does not occur, and the front part of the Viterbi path stabilizes rather quickly as the size of n grows. To illustrate this phenomenon, I have simulated 50 observations from an HMM. The transition matrix for the hidden Markov chain Y was taken to be symmetric, with 0.6 probability of maintaining the same state and with 0.4 probability of switching the state. The emission densities were taken to be normal with both standard deviations equal to 1 and mean values for the states 1 and 2 equal to 0 and 1, respectively. The observations from the HMM along with the hidden states are displayed in Figure 3a. Figure 3b displayes the corresponding Viterbi pathsv(x_1:n) overn= 2, . . . ,50. We can see that while there are some fluctuations of the Viterbi estimates on some positions asnincreases, those are all localized to the end part of the Viterbi paths. The remaining front part quickly stabilizes into a fixed pattern.

Thus there is empirical evidence to support the idea that for some models the firsttelements of the Viterbi path will stay the same under any sufficiently large sample sizen. If this is true of anyt, then the infinite Viterbi pathv_1:∞can be defined as follows. In the below definitionv(x1:n)1:tare the first telements of then-elemental vectorv(x1:n).

Definition 1. Letx_1:∞be a realization ofX. The sequencev_1:∞∈ Y^∞is called the infinite Viterbi path ofx_1:∞ if for anyt≥1there existsm(t)≥tsuch that

v(x1:n)1:t=v1:t, ∀n≥m(t).

The goal of Paper I is to prove that under general conditions infinite Viterbi path exists for almost every realization of X. Conversely, there are models

(19)

−1 0 1 2 3

0 10 20 30 40 50

k Xk

Yk

1 2

(a)

0

10

20

30

40

50

0 10 20 30 40 50

t

n 1

2

(b)

Figure 3: Simulations from an HMM (a) and corresponding Viterbi paths with increasingn(b)

where there is no such path for almost any realization ofX. Below is a simple example of such a 2-state HMM.

Example 1. Consider a two-state HMM with emission densities being equal, f1=f2. Let the transition matrix of the hidden chain be

1 2

1 1

10 9 10

2 8

10 2 10

! .

Note that because the emission densities are equal, thenX andY are independent. Thusp(x1:n, y1:n) =p(x1:n)p(y1:n) and the Viterbi path is the one that maximizes the probability p(y1:n). Let the initial distribution of the hidden chain be (₁₀₀⁴⁹,₁₀₀⁵¹), so that there is a slightly higher probability that the chain will start with 2 rather than 1. Observe now that the Viterbi path for any sample size n will be either 1212... or 2121... Indeed, the probability of switching state is always larger than maintaining one, so the Viterbi path must always alternate between 1 and 2. We can express the likelihoods of both possible paths forn≥3 as

p(y1:n = 1212...) = (₄₉

100·₁₀⁹ · ₁₀⁸ ·₁₀⁹(n−2)/2

, ifnis even

49

100· ₁₀⁸ · ₁₀⁹(n−1)/2

, ifnis odd and

p(y1:n= 2121...) = (₅₁

100·₁₀⁸ · ₁₀⁸ · ₁₀⁹(n−2)/2

, ifnis even

51

100· ₁₀⁸ ·₁₀⁹(n−1)/2

, ifnis odd .

Therefore, because ₁₀₀⁴⁹ · ₁₀⁹ > ₁₀₀⁵¹ · ₁₀⁸, the Viterbi paths will be 1212... and 2121... for the even and oddn, respectively. This shows that there is no infinite Viterbi path for any realization ofX.

(20)

Note that if we had used here the stationary distributionπ(i) for the initial distribution, the example would not have worked quite as well, because by the equality π(1)₁₀⁹ =π(2)₁₀⁸ the likelihoods for the paths 1212... and 2121...

would have been equal for the evenn. Further, becauseπ(2)> π(1), then for odd n the Viterbi path would always be 2121..., so in that case the existence of the infinite Viterbi path would have depended on the tie-breaking scheme of the Viterbi classifier. In particular, under colexicographic scheme induced by ordering 21 the infinite Viterbi path would not exist, but under the reverse ordering 12 it would.

In the above exampleX was independent ofY, which is clearly not a realis- tic scenario for data analysis. Below is a different HMM example whereX does depend onY and, furthermore, the initial distribution can be chosen to be stationary regardless of the tie-breaking scheme. In this example the hidden chain is also irreducible and aperiodic. It is known that any HMM with irreducible, stationary and aperiodic hidden chain is ergodic (see e.g. [13, 14, 15]), hence this example demonstrates how infinite Viterbi path may fail to exist even for models with very stable probabilistic behavior. This in turn further underlines the need for special theory for dealing with the long-run behavior of the Viterbi classifier.

Example 2. Consider a 4-state HMM with the observation space R and the following transition matrix for the hidden chain:







1 2 3 4

1 3

4 0 ¹₄ 0

2 0 ³₄ 0 ¹₄

3 1

2 1

2 0 0

4 1

2 1

2 0 0





 .

Take emission densities as follows: f1 and f2 are both uniform on the interval [0,1],f3 is uniform on interval [0,¹₄] andf4 is uniform on interval [³₄,1]. Hence f1=f2=I[0,1],f3= 4·I[0,¹₄] andf4= 4·I[³₄,1], whereIA denotes the indicator function on setA. For the sake of elegance, we set the initial distribution ofY to be the stationary distribution, which can be calculated to be (₁₀⁴,₁₀⁴,₁₀¹,₁₀¹).

Hence most of the time the hidden chain will spend in state space{1,2}, but occasionally it will make a detour to the space{3,4}.

Note that moving from state 1 to 2 is only possible through state 3, and moving from state 2 to 1 is only possible through state 4. Also note that the Viterbi path will never gothroughstates 3 and 4, because staying in either state 1 or 2 will always a greater likelihood. Indeed, for allx1:3∈[0,1]³and ally1:3∈ {(1,1,1),(2,2,2)}we havep(x2:3, y2:3|x1, y1) = ₁₆⁹, while on the other hand for all x1:3 ∈ R³ and for all y1:3 ∈ Y × {3,4} × Y we have p(x2:3, y2:3|x1, y1) ≤ 4·¹₄·¹₂ = ¹₂ <₁₆⁹.

It is possible, however, for the last element of the Viterbi path to enter the state space{3,4}. Indeed, note that the single-step likelihoods for transitioning

(21)

from states 1 and 2 express as

p1jfj(x) =







1, ifj = 3 andx∈[0,¹₄]

3

4, ifj = 1 andx∈[0,1]

0, otherwise and

p2jfj(x) =







1, ifj= 4 and x∈[³₄,1]

3

4, ifj= 2 andx∈[0,1]

0, otherwise

,

and so this implies that the last element of the Viterbi pathv(x1:n) will be 3 if xn∈[0,¹₄], and 4 ifxn ∈[³₄,1].

In conclusion, assuming for the sake of concreteness a colexicographic ordering scheme based on the ordering 12. . ., we have that the whole Viterbi path expresses as

v(x_1:n) =







11...13, ifx_n∈[0,¹₄] 22...24, ifxn∈[³₄,1]

11...11, otherwise .

Because almost every observation sequence x_1:∞ goes through intervals [0,¹₄] and [³₄,1] infinitely many times, it follows that for any fixed position k and for increasingn=k, k+ 1, . . . thek^th element of the Viterbi pathv(x_1:n) will alternate between 1 and 2 infinitely often. Thus the infinite Viterbi path does not exist for almost any realization of the observation process.

Nodes and barriers The above examples show that the infinite Viterbi path may not exist for every model. We shall now turn our attention to the other direction to try to understand when it does exist. For everyn≥2 andi, j∈ Y denote

pij(x1:n) = max

y_1:n:y₁=i,y_n=jp(x2:n, y2:n|x1, y1). (8) Next, let us fix the observation sequence x_1:∞ and denote for all k ≥ 1 and i∈ Y

δk(i) = max

y_1:k:y_k=ip(x1:k, y1:k).

This notation is consistent with the same notation used in the description of the Viterbi algorithm in (4). According to its definition the existence of the infinite Viterbi path means that for every timet, there exists a time m≥t such that the firstt elements of v(x1:n) are fixed as soon asn ≥m. The following is a sufficient conditionm to be such a time: for every two statesj, s∈ Y and for somei∈ Y,

δ_t(i)p_ij(xt:m)≥δ_t(s)p_sj(xt:m), (9) Indeed, there might be several states besidesi satisfying (9), but the ties can always be broken in favor of the Viterbi path passing statei at position t, so

(22)

that

v(x1:n)1:t= arg max

y_1:t:y_t=i

p(x1:t, y1:t), ∀n≥m. (10) In other words, the ties can be always be broken so that the first t elements of the Viterbi path remain constant for every sample size n ≥ m. However, the tie-breaking scheme that achieves this is not generally (co)lexicographic¹ – the natural tie-breaking scheme of the Viterbi algorithm. Therefore it is more practical to consider the following slightly strengthened version of the above condition: for every two statesj, s∈ Y the inequality (9) is strict for anyjand s6=i for which the left side of the inequality is positive. This latter condition will ensure that (10) always holds under any (co)lexigographic ordering scheme.

These observations have been combined into

Definition 2. Let x1:mbe a vector of observations. If inequalities (9) hold for any pair of statesj ands, then the timetis called ani-node of orderr=m−t.

Timet is called astrongi-node of orderr, if it is an i-node of orderr, and the inequality (9) is strict for anyjands6=ifor which the left side of the inequality is positive. We callt a node of order rif for somei, it is ani-node of orderr.

Suppose now that there exists an infinite sequence ofi-nodesu₁< u₂<· · ·. We call two nodes u_k−1 and u_k separated, if u_k ≥ u_k−1 +r. If the nodes u_k−1 anduk are not separated or not strong, then it might not be possible to break the ties in favor of i at both nodes – see [II, Ex. 4]. However, since from an unseparated sequence of nodes it is always possible to pick a separated subsequence, then there is no loss of generality in assuming thatu1< u2<· · · are all separated, and in that case the infinite Viterbi path can be constructed piecewise as follows. Take

v1:u₁ = arg max

y1:u1:yu1=i

p(x1:u₁, y1:u₁) and for allk≥2 take

v_u_k−1:u_k= arg max

y_{uk−1 :uk}:y_uk−1=y_uk=i

p(x_u_k−1_+1:u_k, y_u_k−1_+1:u₁|xuk−1, y_u_k−1).

Denote nowu(n) = max{uk ≤n−r|k≥1}and define the Viterbi path up to sample sizenas

(v_1:u(n),arg max

y_u(n)+1:n

p(x_u(n)+1:n, y_u(n)+1:n|x_u(n), y_u(n)=i)). (11) By the assumption that the nodesu₁ < u₂ <· · · are all separated and by the definition of a node, this path is well-defined as the one with maximal likelihood.

Since for anynthe firstu(n) elements of the Viterbi path arev_1:u(n), and since lim_nu(n) =∞, it follows immediately thatv_1:∞ is the infinite Viterbi path of x_1:∞.

If the nodes u1 < u2 < · · · are not strong, then the path (11) will not necessarily be (co)lexicographically first among all the paths with maximal likelihood. Therefore the piecewise construction will not be generally in alignment

1Here and henceforth we use the adjective “(co)lexicographic” for an ordering which is either lexicographic or colexicographic.

(23)

with the natural ordering of the Viterbi algorithm. On the other hand, if the nodesu₁ < u₂<· · · are all strong (not necessarily separated) and all arg max functions are based on a (co)lexicographic ordering scheme, then it can be easily verified that (11) is the (co)lexicographically first one among all paths with maximal likelihood based on the same ordering scheme. For that reason we would like to work with strong nodes only. Fortunately, the requirement for strong nodes vs. simply nodes does not turn out to be restrictive. Indeed, this should not be surprising, considering that the difference between them is merely in the strictness of the equality (9).

Whether a time t is a node of order r or not depends in general on the sequencex1:t+r. Sometimes, however, there is some small block of observations that guarantees the existence of a node regardless of the other observations.

The following example, which is based on [I, Ex. 3], illustrates this.

Example 3. Suppose that there exists a statei∈ Y such that for any triplet u, j, s∈ Y

q(xt, i|x_t−1, u)q(xt+1, j|xt, i)≥q(xt, s|x_t−1, u)q(xt+1, j|xt, s). (12) Then for allj, s∈ Y

δt(i)q(xt+1, j|xt, i) = max

u δ_t−1(u)q(xt, i|x_t−1, u)q(xt+1, j|xt, i)

≥max

u δ_t−1(u)q(xt, s|x_t−1, u)q(xt+1, j|xt, s)

=δt(s)q(xt+1, j|xt, s),

and sotis ani-node of order 1. Whether (12) holds or not, depends on triplet (xt−1, xt, xt+1). In case of HMM, (12) is equivalent to

puifi(xt)·pij≥pusfs(xt)·psj. (13) For a more concrete example, consider a 2-state HMM with both emission densitiesf₁ and f₂ being some continuous densities which are positive on the interval [0,1] and zero everywhere else. Also, let the transition matrix of hidden chain be symmetric with probability of maintaining the state equal top∈(¹₂,1) and the probability of switching the state equal to 1−p. Taking for the sake of concretenessi= 1, the inequalities (13) hold for allu, s, j∈ {1,2}and some x=xt∈[0,1] if and only if

max

x∈[0,1]

f₁(x)

f2(x)≥ p²

(1−p)². (14)

The inequality (14) imposes a rather strict requirement on the maximal ratio of the emission densities. For example, ifp= ₁₀⁹, it requires that this ratio must be at least as great as 81, which is quite extreme. It is therefore evident that the inequalities (12) do not necessarily yield general or useful conditions for specific models.

While using a sequence of three observations to obtain a node might not be the most fruitful approach, the general concept of a sequence of observations generating a node is still a useful one. This concept is captured in the following

(24)

Definition 3. Given i∈ Y, b_1:M is called a (strong) i-barrier of orderr and length M, if for any x_1:∞ ∈ X^∞ and m ≥ M satisfying x_m−M_+1:m = b_1:M, m−r is a (strong)i-node of orderr.

Hence, if (12) holds, then the triplet (x_t−1, xt, xt+1) is an i-barrier of order 1 and length 3. The advantage of working with the concept of barriers rather than simply nodes is that it provides a straightforward mathematical avenue to ensure the existence of infinitely many nodes. Indeed, if the observation sequence contains infinitely many i-barriers of order r, then there must exist an infinite sequence ofi-nodes of orderr, and so the infinite Viterbi path must exist by the piecewise construction.

Viterbi process The notion of the infinite Viterbi path for a single realization of the observation process X can be naturally extended to the infinite Viterbi path of the processX itself. This extension is called theViterbi process.

Formally this process is defined as follows. Let V ={Vk}k≥1 be some random process taking values from state spaceY. We assume thatV is defined on the same probability space asZ, namely (Ω,F, P).

Definition 4. The process V is called the Viterbi processif there exists a set Ω⁰∈ F such thatP(Ω⁰) = 1and for allω∈Ω⁰ the sequenceV(ω)is the infinite Viterbi path of X(ω).

For eachω∈Ω the infinite Viterbi path ofX(ω) is well-defined by Definition 1, if it exists. However, the set{ω∈Ω|V(ω) is the infinite Viterbi path ofX(ω)}

might not be a measurable one, so the above definition simply requires that the complement of this set must be within some event with zero probability. Sup- pose now that there exists a set X^∗ ⊆ X^M such that each of its element is a strongi-barrier of order r, andP(X ∈ X^∗ i.o.) = 1. Let Ω⁰ ={X ∈ X^∗ i.o.}.

Because the observation sequenceX(ω) contains infinitely many strongi-nodes of orderrfor allω∈Ω⁰, it is now straightforward to verify the existence of the Viterbi processV.

Indeed, formally for each k ≥ 1 the random variableVk can be defined as follows. TakeT(k) = min{m−r|X_m−M+1:m∈ X^∗, m≥k+r}. ThusT(k)≥k, and we have by definition of a barrier thatT(k)(ω) is a strong i-node of order rfor allω ∈Ω⁰. For each k, the random variableVk is then defined as thek^th element of the random-length vector v(X1:T(k)+r). We already know from the piecewise construction of the Viterbi path that for allω ∈Ω⁰ the Viterbi path up toT(n)(ω) is fixed for all n≥T(k)(ω) +r, and soVk is well-defined as the k^thelement of the Viterbi process given that it is a measurable random variable.

To verify that V_k can indeed chosen to be measurable note that assuming (co)lexicographic ordering scheme based on some fixed ordering onY, we have that the mapping

x_1:n7→arg max

y_1:n

p(x_1:n, y_1:n) (15)

is measurable for alln≥1. It follows that for anyi∈ Y,

{Vk=i} ∩Ω⁰=∪_n≥k+r{Vk =i, T(k) +r=n} ∩Ω⁰ ∈ F.

Finally, to make sure thatVkis formally a mapping on the whole space Ω, define Vk(ω) = 1 for allω∈Ω^0c. Since then{Vk =i} ∩Ω⁰^c is equal to Ω^0cifi= 1 and

(25)

to∅otherwise, we have

{V_k =i}= {V_k =i} ∩Ω⁰

∪ {V_k =i} ∩Ω⁰^c

∈ F.

ThusV_k is a well-defined random variable.

When the barriers inX^∗are not strong, then the construction of the Viterbi process is also possible, but is slightly more complicated. In that case the tie- breaking for the mappingX1:n7→v(X1:n) is no longer fixed and will depend on the position of nodes prior tonand thereby also on the random sequenceX1:n. Even so, it is easy to show based on the piecewise construction of the infinite Viterbi path that the Viterbi process still exists. However, the corresponding ordering scheme will not be (co)lexicographic anymore and therefore will not align with the ordering of the Viterbi algorithm.

The goal of Paper I is to find practical and general conditions for the existence of the barrier setX^∗. In Paper II it is proved that under general conditions on the barrier set X^∗, there exist regeneration times for the Markov chain Z which are also nodes of fixed order for almost every realization ofX. These regeneration times break theZ-process into i.i.d. cycles, and because up to each regeneration time the Viterbi path is fixed for sufficiently large sample size, then – as argued in Section 6 below – SLLN and CLT type results apply for the Viterbi classifier. Thus the Viterbi process does not only ensure the overall path-stability of the Viterbi estimation, but is also a useful tool for obtaining related asymptotic convergence properties.

History of the problem To the best of our knowledge, prior to Paper I the existence of Viterbi process has been proven for HMM’s only. The first such results were obtained in 2002 by Caliebe and R¨osler [16] (see also [17]) who essentially define the concept of nodes and prove the existence of infinitely many nodes under rather restrictive assumptions like (13). A much more general treatment of the HMM-case was given in 2010 by Lember and Koloydenko [18]

who introduce the general definitions of nodes and barriers, as are given above, and prove the existence of the Viterbi process under broad conditions. The basic ideas behind the Paper I are the same as in [18], but applying these ideas to the general PMM is far from straightforward due to the much more complex nature of the model. Indeed, from our general barrier construction theorem in Paper I we were able to strengthen the HMM-result in [18]. This strengthened result will be presented in the next section along with the discussion of its conditions.

In [19] Lember and Koloydenko also consider specifically the 2-state ergodic HMM, and prove that for such model the Viterbi process always exists if there is essential difference between the two emission densities.

In Papers I-III we are only considering the case when the state spaceY is finite, because that is the most suitable and fruitful framework for our ideas.

When the state spaceY is continuous, then the existence infinite Viterbi path as defined here becomes too restrictive, so it is more generally defined in terms of convergence of the path estimate. In [20] Chigansky and Ritov study such convergence and prove the existence of the limiting Viterbi process under certain restrictive conditions, such as the log-concavity of the emission and transition densities. More recently, Whiteley et al. [21] also study the existence of such limiting process under a different set of conditions. The latter paper is more motivated by the computational aspects and the scalability of the Viterbi path approximation via parallelization.

JOONAS SOVAPairwise Markov Models

136

JOONAS SOVA

Pairwise Markov Models

JOONAS SOVA

Pairwise Markov Models

Contents

Acknowledgments

Publications

Introduction

1 Pairwise Markov models

2 Theoretical framework and notation

3 Viterbi classifier

4 Infinite Viterbi path