- 1 - Digital Signal Processing and Pattern Recognition
Hidden Markov Models
Hidden Markov
Models
- 2 - Digital Signal Processing and Pattern Recognition
Word models so far…
A word is given by a sequence of „states“.
A state is given by an n-ary probability density function,
from which a distance measure is derived using negative logarithm.
Assumption: Independent features, normal distribution A state is given by n mean values and n variances.
…
- 3 - Digital Signal Processing and Pattern Recognition
Idea
Feature vector sequences are generated by an (unknown) system.
Training:
Use the reference feature vector sequences to build
a model of the generating system for each class (Hidden Markov Model).
Classification result: The model with highest probability.
Classification of a feature vector sequence:
For each model compute the likelihood that it generated the sequence.
- 4 - Digital Signal Processing and Pattern Recognition
For each word of the vocabulary build a model of the speaker producing the word.
Classification of a feature vector sequence x:
Example speech recognition
Determine the likelihood for each model, that it generates the given sequence x.
The classification result is the word
whose model has the highest likelihood.
- 5 - Digital Signal Processing and Pattern Recognition
Based on feature vectors:
e.g. air pressure, temperature, velocity of wind, … Based on a model with temporal context:
e.g. weather tomorrow will probably be the same as today.
Example weather prediction
Hidden markov models: combination of both.
- 6 - Digital Signal Processing and Pattern Recognition
Outline
Markov Models
Hidden Markov Models
Concatenation of Hidden Markov Models
- 7 - Digital Signal Processing and Pattern Recognition
Does random exist?
What is a model?
- 8 - Digital Signal Processing and Pattern Recognition
Model
Description of a real system
Simplifying assumptions, abstraction from „irrelevant“ details
Random
„God doesn‘t play dice“ (Einstein 1929)
Deterministic world view, Laplace‘s demon, free will?
Random and probability is a model for missing knowledge/understanding The system itself is usually deterministic
stochastic Models (of deterministic systems)
- 9 - Digital Signal Processing and Pattern Recognition
Conditional Probabilities
Probability that a dice shows…
a number > 3 an even number
an even number under the assumption that it is > 3.
- 10 - Digital Signal Processing and Pattern Recognition
Example: Weather model
- 11 - Digital Signal Processing and Pattern Recognition
Markov Models
Observe a system at discrete points im time t = 1,2,3,…
At each point in time t the system is in one of n possible states j = 1,…,n
Wanted
Given
Probability, that the system is in state j at time t.
Initial probabilities at time t = 1:
Transition probabilities:
for and
state 1
state 2
state 3 for
- 12 - Digital Signal Processing and Pattern Recognition
Example: Weather model with two states: good, bad.
Time t-1 = yesterday, t = today.
Yesterday weather was good, hence
P
(weather today good) = 0.7P
(weather today bad) = 0.3Transition probabilities
In general
S1, S2, S3, … is a sequence of discrete random variables (stochastic process).
We want to determine the distribution of these random variables.
good bad
- 13 - Digital Signal Processing and Pattern Recognition
Transition probabilities
are time constant, i.e. do not depend on the time t.
Probability distribution of S
tdepends only on the state of the system at time t-1, but not on earlier states. (Markov property)
Example: if we would consider the weather from several days ago
and not just from yesterday, we could say more about the probabilities of the weather tomorrow.
However, this would lead to more complicated models!
Example: Transition probabilities are different in autumn and in summer because the weather changes more frequently in autumn.
This is neglected by the Markov model.
Markov models of technical system ignore aging of components.
Simplifying assumptions for Markov models
- 14 - Digital Signal Processing and Pattern Recognition
Transition probabilities
Initial probabilities
Probability, that the system is in state j
under the assumption that it was in state i in the previous cycle.
Independent of t!
Summary: A Markov model is given by:
state 1
state 2
state 3
- 15 - Digital Signal Processing and Pattern Recognition
Boundary conditions
Sum of initial probabilities must be 1.
Sum of transition probabilities starting at any state must be 1.
state i
- 16 - Digital Signal Processing and Pattern Recognition
Example
System with 3 states
Observed state sequence: 1,3,3,2,2,3,1,3,3,2,2,1,3,1,2,1,1,2,3,2
Wanted: transition probability
Estimation of the transition probabilities from a random sample
- 17 - Digital Signal Processing and Pattern Recognition
7 cases where St-1 = 3:
1,3,3,2,2,3,1,3,3,2,2,1,3,1,2,1,1,2,3,2 3 cases where St-1 = 3 and St = 2:
Estimation:
1,3,3,2,2,3,1,3,3,2,2,1,3,1,2,1,1,2,3,2 Example
System with 3 states
Observed state sequence: 1,3,3,2,2,3,1,3,3,2,2,1,3,1,2,1,1,2,3,2
Wanted: transition probability
Estimation of the transition probabilities from a random sample
- 18 - Digital Signal Processing and Pattern Recognition
A, B Events with probabilities P(A), P(B).
e.g. dice
A: even number B: number greater 3
A = { 2,4,6 } B = { 4,5,6 } P(A B): Probability that A and Boccurs.
e.g. dice A B = {4,6}
Conditional Probabilities
P(A | B): Probability that A occurs under the assumption that B is the case.
e.g. dice
die shows even number under the assumption that it is greater than 3.
Cases in which number is greater than 3: {4,5,6}
In two of these cases the number is even: {4,6}
P(A) = 3/6 = 1/2 P(B) = 3/6 = 1/2
P(A B) = 2/6 = 1/3
≠ P(A) P(B) !
P(A|B) = 2/3
- 19 - Digital Signal Processing and Pattern Recognition
Def.: Events A and B are called independent if
If A and B are independent, then Def.: Conditional Probability
e.g. two dice, 36 possibilities
A: first die shows even number, B: second die shows number > 3.
- 20 - Digital Signal Processing and Pattern Recognition
Example: n = 10 elementary events with equal probability.
A and B are dependent!
- 21 - Digital Signal Processing and Pattern Recognition
Application to Markov models Transition probabilities
Probability that the system is at time t in state j and at time t-1 in state i:
Probability that the system is at time t in state j:
state
state
state
state state
time
- 22 - Digital Signal Processing and Pattern Recognition
Computation of Distribution of S
tfor any time t = 1,2,3,...
state
- 23 - Digital Signal Processing and Pattern Recognition
Examples for systems which can be described by Markov models
Queuing systems
States: length of the queue
Speaker, who utters a word
Activity of vocal chords, opening of mouth, shape of lips, tongue…
Diffusion processes through a membrane
States: Molecule is left / right of the membrane
loop, next, skip transition
Extension: Continuous Time Markov Models
- 24 - Digital Signal Processing and Pattern Recognition
Hidden Markov Models (HMM)
Markov model produces in each step t a random vector.
Probability distribution of the produced vector depends on the state in which the Markov model is, i.e. each state i has a density function fi(x).
Extension of Markov models
An external observer can see only the produced random vectors
but does not know the state of the system. The state is hidden from observation.
Hidden Markov model
System for productionof random vector sequences.
Examples when the state is „hidden“ and only featur vectors can be observed:
Warning lights in a car, machine noise, diagnostics, …
- 25 - Digital Signal Processing and Pattern Recognition
… (), (), ()
Hidden Markov Model
(state not observable)
Feature vector
sequence
(observable)
- 26 - Digital Signal Processing and Pattern Recognition
Example:
Probability density of the feature vector, which is produced in state i: Emission density.
Observed feature vector sequence
What is the likelihood, that the HMM generates a given sequence x?
Through which states does the HMM step while producing x?
Problems:
How to develop an HMM from given training sequences?
state 1
state 3 state 2
- 27 - Digital Signal Processing and Pattern Recognition
Classification of feature vector sequences with HMMs
Classification of a feature vector sequence Model assumption:
One HMM for each class:
Feature vector sequences are generated by HMMs.
Compute for each HMM the likelihood that it produces x.
Classification result: The HMM for which the likelihood is highest.
HMM for class 1 HMM for class 2
- 28 - Digital Signal Processing and Pattern Recognition
Elementary events:
Feature vector sequence x is observed: (density)
Theory: Classification with HMMs
i-th HMM ist active: (prior probability)
Events are dependent, otherwise classifcation would be impossible!
Probability that x is produced under the assumption that if i-th HMM was active:
Emission probability
Probability that i-th HMM was active, under the assumption that x was observed:
Classification probability
Connection: Bayes Law
and and
- 29 - Digital Signal Processing and Pattern Recognition
Probability that observed sequence x has been generated by i-th HMM:
Classification result: HMM with highest probability
Remaining task:
Compute the probability, that a given HMM produces the observed sequence x or for a given HMM .
if
- 30 - Digital Signal Processing and Pattern Recognition
Likelihood of the random vector which is produced in state i (Emission density):
Observed feature vector sequence:
Thereby traversed states:
Likelihood that x is produced, if state sequence s is traversed:
Likelihood that s is traversed:
Likelihood that x is produced and s is traversed:
Likelihood that x is produced:
Transition probabilities of the Markov model
Bayes
Problem: Sum over all possible
state sequences of length T !
- 31 - Digital Signal Processing and Pattern Recognition
More efficient way:
Iterative computation for t = 1,2,...,T
- 32 - Digital Signal Processing and Pattern Recognition
State
- 33 - Digital Signal Processing and Pattern Recognition
Problem:
Solution:
Distance
Very small values , underflow with floating point arithmetic.
Take logarithm of .
Distance corresponds to negative logarithm of likelihood.
- 34 - Digital Signal Processing and Pattern Recognition
Notation: Likelihood → Distance
Distances for t=1.
- 35 - Digital Signal Processing and Pattern Recognition
maximum approximation
Step from t-1 to t.
- 36 -
Digital Signal Processing and Pattern Recognition
- 37 - Digital Signal Processing and Pattern Recognition
Most likely state sequence
Backpointer
- 38 - Digital Signal Processing and Pattern Recognition
Special case: HMM for speech recognition
state transition only in „time direction“
loop, next, skip
- 39 - Digital Signal Processing and Pattern Recognition
Special case: HMM for speech recognition
- 40 - Digital Signal Processing and Pattern Recognition
Viterbi Training of HMMs
(Normal distribution for emissions, transition probabilities)
HMM Zustände
HMM Zustände HMM states
HMM states
- 41 - Digital Signal Processing and Pattern Recognition
Reestimation of the emission densities in the states
Match reference sequences with new HMM (Viterbi Algorithm)
Estimate new emission- and transition probabilities from new segmemtations Iterierate
Reestimation of the transition probabilities
e.g.:
- 42 - Digital Signal Processing and Pattern Recognition
Concatenation of Hidden Markov Models
One HMM for each word of the vocabulary (whole word models) Recognition of arbitrary word sequences
Long term ECG
Speech recognition:
Examples
One HMM for a healthy heart cycle One HMM for each symptom
Machine noise
One HMM for a cycle in normal operation One HMM for malfunction
- 43 - Digital Signal Processing and Pattern Recognition
Example speech recognition
HMM for word 1
HMM for word 2
HMM for arbitrary sequences of word 1 and 2
Probabilities for word transitions: language model e.g.
p
the,man> p
the,and- 44 - Digital Signal Processing and Pattern Recognition
Hidden Markov model
HMM for word 1HMM for word 2
Viterbi algorithm (Recognition)
HMM for word 1
HMM for word 2
- 45 - Digital Signal Processing and Pattern Recognition
Recognized word sequence is obtained from optimal path: 1, 1, 2, 1, 2.
HMM for word 1HMM for word 2
One start / end state for each word.
- 46 - Digital Signal Processing and Pattern Recognition
Hidden Markov model with silence state
HMM for word 1HMM for word 2
Viterbi algorithm
Silence State
Silence State
HMM for word 2 HMM for word 1
n words vocabulary:
Only 2n model transitions instead of n2 Only one start- and end state silence Simplifications
- 47 - Digital Signal Processing and Pattern Recognition
Recognized word sequence from optimal path: 1, 2, 2.
HMM for word 1HMM for word 2
Only one start- and end state silence.
silence
- 48 - Digital Signal Processing and Pattern Recognition
Real time recognition: output recognized word as soon as possible
Avoid computation and backtracking of optimal path during recognition
Book keeping of recognized word sequence in each state: history.
- 49 - Digital Signal Processing and Pattern Recognition
HMM for word 1HMM for word 2 silence
add word 1
to history add word 2
to history add word 2 to history
- 50 - Digital Signal Processing and Pattern Recognition
Simultaneous training of word- and silence HMM
HMM for word Silence State Silence State
Shared emission density for all silence states (in all words).
No problem if recordings begin or end with silence.
Linear segmentation e.g.: 1/3 silence, 1/3 word, 1/3 silence
- 51 - Digital Signal Processing and Pattern Recognition
Linear segmentation:
silence
word
silence
After some iterations: