Hidden Markov

(1)

- 1 - Digital Signal Processing and Pattern Recognition

Hidden Markov Models

Hidden Markov

Models

(2)

- 2 - Digital Signal Processing and Pattern Recognition

Word models so far…

A word is given by a sequence of „states“.

A state is given by an n-ary probability density function,

from which a distance measure is derived using negative logarithm.

Assumption: Independent features, normal distribution A state is given by n mean values and n variances.

…

(3)

- 3 - Digital Signal Processing and Pattern Recognition

Idea

Feature vector sequences are generated by an (unknown) system.

Training:

Use the reference feature vector sequences to build

a model of the generating system for each class (Hidden Markov Model).

Classification result: The model with highest probability.

Classification of a feature vector sequence:

For each model compute the likelihood that it generated the sequence.

(4)

- 4 - Digital Signal Processing and Pattern Recognition

For each word of the vocabulary build a model of the speaker producing the word.

Classification of a feature vector sequence x:

Example speech recognition

Determine the likelihood for each model, that it generates the given sequence x.

The classification result is the word

whose model has the highest likelihood.

(5)

- 5 - Digital Signal Processing and Pattern Recognition

Based on feature vectors:

e.g. air pressure, temperature, velocity of wind, … Based on a model with temporal context:

e.g. weather tomorrow will probably be the same as today.

Example weather prediction

Hidden markov models: combination of both.

(6)

- 6 - Digital Signal Processing and Pattern Recognition

Outline

Markov Models

Hidden Markov Models

Concatenation of Hidden Markov Models

(7)

- 7 - Digital Signal Processing and Pattern Recognition

Does random exist?

What is a model?

(8)

- 8 - Digital Signal Processing and Pattern Recognition

Model

Description of a real system

Simplifying assumptions, abstraction from „irrelevant“ details

Random

„God doesn‘t play dice“ (Einstein 1929)

Deterministic world view, Laplace‘s demon, free will?

Random and probability is a model for missing knowledge/understanding The system itself is usually deterministic

 stochastic Models (of deterministic systems)

(9)

- 9 - Digital Signal Processing and Pattern Recognition

Conditional Probabilities

Probability that a dice shows…

a number > 3 an even number

an even number under the assumption that it is > 3.

(10)

- 10 - Digital Signal Processing and Pattern Recognition

Example: Weather model

(11)

- 11 - Digital Signal Processing and Pattern Recognition

Markov Models

Observe a system at discrete points im time t = 1,2,3,…

At each point in time t the system is in one of n possible states j = 1,…,n

Wanted

Given

Probability, that the system is in state j at time t.

Initial probabilities at time t = 1:

Transition probabilities:

for and

state 1

state 2

state 3 for

(12)

- 12 - Digital Signal Processing and Pattern Recognition

Example: Weather model with two states: good, bad.

Time t-1 = yesterday, t = today.

Yesterday weather was good, hence

P

(weather today good) = 0.7

P

(weather today bad) = 0.3

Transition probabilities

In general

S₁, S₂, S₃, … is a sequence of discrete random variables (stochastic process).

We want to determine the distribution of these random variables.

good bad

(13)

- 13 - Digital Signal Processing and Pattern Recognition

Transition probabilities

are time constant, i.e. do not depend on the time t.

Probability distribution of S

_t

depends only on the state of the system at time t-1, but not on earlier states. (Markov property)

Example: if we would consider the weather from several days ago

and not just from yesterday, we could say more about the probabilities of the weather tomorrow.

However, this would lead to more complicated models!

Example: Transition probabilities are different in autumn and in summer because the weather changes more frequently in autumn.

This is neglected by the Markov model.

Markov models of technical system ignore aging of components.

Simplifying assumptions for Markov models

(14)

- 14 - Digital Signal Processing and Pattern Recognition

Transition probabilities

Initial probabilities

Probability, that the system is in state j

under the assumption that it was in state i in the previous cycle.

Independent of t!

Summary: A Markov model is given by:

state 1

state 2

state 3

(15)

- 15 - Digital Signal Processing and Pattern Recognition

Boundary conditions

Sum of initial probabilities must be 1.

Sum of transition probabilities starting at any state must be 1.

state i

(16)

- 16 - Digital Signal Processing and Pattern Recognition

Example

System with 3 states

Observed state sequence: 1,3,3,2,2,3,1,3,3,2,2,1,3,1,2,1,1,2,3,2

Wanted: transition probability

Estimation of the transition probabilities from a random sample

(17)

- 17 - Digital Signal Processing and Pattern Recognition

7 cases where S_t-1= 3:

1,3,3,2,2,3,1,3,3,2,2,1,3,1,2,1,1,2,3,2 3 cases where S_t-1= 3 and S_t = 2:

Estimation:

1,3,3,2,2,3,1,3,3,2,2,1,3,1,2,1,1,2,3,2 Example

System with 3 states

Observed state sequence: 1,3,3,2,2,3,1,3,3,2,2,1,3,1,2,1,1,2,3,2

Wanted: transition probability

Estimation of the transition probabilities from a random sample

(18)

- 18 - Digital Signal Processing and Pattern Recognition

A, B Events with probabilities P(A), P(B).

e.g. dice

A: even number B: number greater 3

A = { 2,4,6 } B = { 4,5,6 } P(A  B): Probability that A and Boccurs.

e.g. dice A  B = {4,6}

Conditional Probabilities

P(A | B): Probability that A occurs under the assumption that B is the case.

e.g. dice

die shows even number under the assumption that it is greater than 3.

Cases in which number is greater than 3: {4,5,6}

In two of these cases the number is even: {4,6}

P(A) = 3/6 = 1/2 P(B) = 3/6 = 1/2

P(A  B) = 2/6 = 1/3

≠ P(A) P(B) !

P(A|B) = 2/3

(19)

- 19 - Digital Signal Processing and Pattern Recognition

Def.: Events A and B are called independent if

If A and B are independent, then Def.: Conditional Probability

e.g. two dice, 36 possibilities

A: first die shows even number, B: second die shows number > 3.

(20)

- 20 - Digital Signal Processing and Pattern Recognition

Example: n = 10 elementary events with equal probability.

A and B are dependent!

(21)

- 21 - Digital Signal Processing and Pattern Recognition

Application to Markov models Transition probabilities

Probability that the system is at time t in state j and at time t-1 in state i:

Probability that the system is at time t in state j:

state

state state

time

(22)

- 22 - Digital Signal Processing and Pattern Recognition

Computation of Distribution of S

_t

for any time t = 1,2,3,...

state

(23)

- 23 - Digital Signal Processing and Pattern Recognition

Examples for systems which can be described by Markov models

Queuing systems

States: length of the queue

Speaker, who utters a word

Activity of vocal chords, opening of mouth, shape of lips, tongue…

Diffusion processes through a membrane

States: Molecule is left / right of the membrane

loop, next, skip transition

Extension: Continuous Time Markov Models

(24)

- 24 - Digital Signal Processing and Pattern Recognition

Hidden Markov Models (HMM)

Markov model produces in each step t a random vector.

Probability distribution of the produced vector depends on the state in which the Markov model is, i.e. each state i has a density function f_i(x).

Extension of Markov models

An external observer can see only the produced random vectors

but does not know the state of the system. The state is hidden from observation.

Hidden Markov model

System for productionof random vector sequences.

Examples when the state is „hidden“ and only featur vectors can be observed:

Warning lights in a car, machine noise, diagnostics, …

(25)

- 25 - Digital Signal Processing and Pattern Recognition

… (), (), ()

Hidden Markov Model

(state not observable)

Feature vector

sequence

(observable)

(26)

- 26 - Digital Signal Processing and Pattern Recognition

Example:

Probability density of the feature vector, which is produced in state i: Emission density.

Observed feature vector sequence

What is the likelihood, that the HMM generates a given sequence x?

Through which states does the HMM step while producing x?

Problems:

How to develop an HMM from given training sequences?

state 1

state 3 state 2

(27)

- 27 - Digital Signal Processing and Pattern Recognition

Classification of feature vector sequences with HMMs

Classification of a feature vector sequence Model assumption:

One HMM for each class:

Feature vector sequences are generated by HMMs.

Compute for each HMM the likelihood that it produces x.

Classification result: The HMM for which the likelihood is highest.

HMM for class 1 HMM for class 2

(28)

- 28 - Digital Signal Processing and Pattern Recognition

Elementary events:

Feature vector sequence x is observed: (density)

Theory: Classification with HMMs

i-th HMM ist active: (prior probability)

Events are dependent, otherwise classifcation would be impossible!

Probability that x is produced under the assumption that if i-th HMM was active:

Emission probability

Probability that i-th HMM was active, under the assumption that x was observed:

Classification probability

Connection: Bayes Law

and and

(29)

- 29 - Digital Signal Processing and Pattern Recognition

Probability that observed sequence x has been generated by i-th HMM:

Classification result: HMM with highest probability

Remaining task:

Compute the probability, that a given HMM  produces the observed sequence x or for a given HMM .

if

(30)

- 30 - Digital Signal Processing and Pattern Recognition

Likelihood of the random vector which is produced in state i (Emission density):

Observed feature vector sequence:

Thereby traversed states:

Likelihood that x is produced, if state sequence s is traversed:

Likelihood that s is traversed:

Likelihood that x is produced and s is traversed:

Likelihood that x is produced:

Transition probabilities of the Markov model

Bayes

Problem: Sum over all possible

state sequences of length T !

(31)

- 31 - Digital Signal Processing and Pattern Recognition

More efficient way:

Iterative computation for t = 1,2,...,T

(32)

- 32 - Digital Signal Processing and Pattern Recognition

State

(33)

- 33 - Digital Signal Processing and Pattern Recognition

Problem:

Solution:

Distance

Very small values , underflow with floating point arithmetic.

Take logarithm of .

Distance corresponds to negative logarithm of likelihood.

(34)

- 34 - Digital Signal Processing and Pattern Recognition

Notation: Likelihood → Distance

Distances for t=1.

(35)

- 35 - Digital Signal Processing and Pattern Recognition

maximum approximation

Step from t-1 to t.

(36)

- 36 -

Digital Signal Processing and Pattern Recognition

(37)

- 37 - Digital Signal Processing and Pattern Recognition

Most likely state sequence

Backpointer

(38)

- 38 - Digital Signal Processing and Pattern Recognition

Special case: HMM for speech recognition

state transition only in „time direction“

loop, next, skip

(39)

- 39 - Digital Signal Processing and Pattern Recognition

Special case: HMM for speech recognition

(40)

- 40 - Digital Signal Processing and Pattern Recognition

Viterbi Training of HMMs

(Normal distribution for emissions, transition probabilities)

HMM Zustände

HMM Zustände HMM states

HMM states

(41)

- 41 - Digital Signal Processing and Pattern Recognition

Reestimation of the emission densities in the states

Match reference sequences with new HMM (Viterbi Algorithm)

Estimate new emission- and transition probabilities from new segmemtations Iterierate

Reestimation of the transition probabilities

e.g.:

(42)

- 42 - Digital Signal Processing and Pattern Recognition

Concatenation of Hidden Markov Models

One HMM for each word of the vocabulary (whole word models) Recognition of arbitrary word sequences

Long term ECG

Speech recognition:

Examples

One HMM for a healthy heart cycle One HMM for each symptom

Machine noise

One HMM for a cycle in normal operation One HMM for malfunction

(43)

- 43 - Digital Signal Processing and Pattern Recognition

Example speech recognition

HMM for word 1

HMM for word 2

HMM for arbitrary sequences of word 1 and 2

Probabilities for word transitions: language model e.g.

p

_the,man

> p

_the,and

(44)

- 44 - Digital Signal Processing and Pattern Recognition

Hidden Markov model

HMM for word 1HMM for word 2

Viterbi algorithm (Recognition)

HMM for word 1

HMM for word 2

(45)

- 45 - Digital Signal Processing and Pattern Recognition

Recognized word sequence is obtained from optimal path: 1, 1, 2, 1, 2.

One start / end state for each word.

(46)

- 46 - Digital Signal Processing and Pattern Recognition

Hidden Markov model with silence state

Viterbi algorithm

Silence State

HMM for word 2 HMM for word 1

n words vocabulary:

Only 2n model transitions instead of n² Only one start- and end state silence Simplifications

(47)

- 47 - Digital Signal Processing and Pattern Recognition

Recognized word sequence from optimal path: 1, 2, 2.

Only one start- and end state silence.

silence

(48)

- 48 - Digital Signal Processing and Pattern Recognition

Real time recognition: output recognized word as soon as possible

Avoid computation and backtracking of optimal path during recognition

Book keeping of recognized word sequence in each state: history.

(49)

- 49 - Digital Signal Processing and Pattern Recognition

HMM for word 1HMM for word 2 silence

add word 1

to history add word 2

to history add word 2 to history

(50)

- 50 - Digital Signal Processing and Pattern Recognition

Simultaneous training of word- and silence HMM

HMM for word Silence State Silence State

Shared emission density for all silence states (in all words).

No problem if recordings begin or end with silence.

Linear segmentation e.g.: 1/3 silence, 1/3 word, 1/3 silence

(51)

- 51 - Digital Signal Processing and Pattern Recognition

Linear segmentation:

silence

word

silence

After some iterations: