Intelligent Systems:
Directed Graphical models
Carsten Rother
Lernräume
• 1-2 Termine in der Woche: 09.2. --- 13.2
• Teacher:
• Holger.Heidrich@tu-dresden.de
• Oliver Groth: Oliver.Groth@tu-dresden.de
• Wo? wird noch bekannt gegeben
• Wieviele Leute wuerden kommen?
Roadmap for this lectures
• Introduction and Recap
• Directed Graphical models
• Definition
• Creating a model from textual description
• Perform queries and make decisions
• Add evidence to the model
• A few Real-world examples
• Probabilistic Programming
Notation
• Dimtri Schlesinger has used the following notation:
• Input data (discrete, continuous): 𝑥
• Output Class (discrete): 𝑘 ∈ K
• Parameters (derived during learning): 𝜃
• Posterior for recognition: 𝑝 𝑘 𝑥, 𝜃
• I will use a slightly different notation:
• Input data (discrete, continuous): 𝑧 ∈ Z
• Output Class (discrete): 𝑥 ∈ 0, … 𝐾 = K
• Parameters (derived during learning): 𝜃
• Posterior for recognition: 𝑝 𝑥 𝑧, 𝜃
Note for images we have many, e.g. 1 Million, variables.
The random variables are then: 𝒙 = 𝑥 , … , 𝑥 , 𝑥 , … 𝑥 ∈ K 𝑛
Probabilities - Reminder
• A random variable is denoted with 𝑥 ∈ {0, … , 𝐾}
• Discrete probability distribution: 𝑝(𝑥) satisfies
𝑥𝑝(𝑥) = 1
• Joint distribution of two random variables: 𝑝(𝑥, 𝑧)
• Conditional distribution: 𝑝 𝑥 𝑧
• Sum rule (marginal distribution): 𝑝 𝑧 =
𝑥𝑝(𝑥, 𝑧)
• Independent probability distribution: 𝑝 𝑥, 𝑧 = 𝑝 𝑧 𝑝 𝑥
• Product rule: 𝑝 𝑥, 𝑧 = 𝑝 𝑧 𝑥 𝑝(𝑥)
• Bayes’ rule: 𝑝 𝑥|𝑧 =
𝑝(𝑧|𝑥)𝑝 𝑥Probabilities - Reminder
Fill in the missing entries correctly.
1) 𝑝 𝑥 1 , 𝑥 2 where 𝑥 𝑖 ∈ {0,1}
2) 𝑝 𝑥 1 |𝑥 2 where 𝑥 𝑖 ∈ {0,1}
0.1 0.7 𝑥 1
𝑥 2
0 1
0 1
0.1 0.7 𝑥 1
𝑥 2
0 1
0 1
0.1 0.1
0.9 0.3
Machine Learning: Structured versus Unstructured Models
Example from previous lecture:
𝑝: R
2→ {1,2} (classification)
”Normal/Unstructured” Output Prediction:
• 𝑓/𝑝: Z 𝑚 → N (classification) ( Z can be an arbitrary set)
• 𝑓/𝑝: Z → N (classification)
What is the topic?
{Economy(0), Sports(1), Literature(2), …}
Input: Text ( Z ) Output: Category ( N
2)
𝑛
𝑛
What language is it?
{German(0), English(1), Swedish(2) …}
𝑚
Machine Learning: Structured versus Unstructured Models
”Normal/Unstructured” Output Prediction:
• 𝑓/𝑝: Z
𝑚→ R (regression)
Input: Image ( Z
𝑚) Output: Object class ( N )+ localization ( R
4)
bicycle
𝑛
Machine Learning: Structured versus Unstructured Models
Structured Output Prediction:
• 𝑓/𝑝: Z → X (for example: X = R or X = N )
Important: the elements in X do not make independent decisions
𝑛 𝑛
𝑚 𝑛
Definition (not formal)
The Output consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together.
Example: Image Labelling (Computer Vision)
K has a fixed vocabulary, e.g. K ={Wall, Picture,
Person, Clutter,…}
Important: The labeling of neighboring pixels is
highly correlated
Machine Learning: Structured versus Unstructured Models
Structured Output Prediction:
• 𝑓/𝑝: Z → X (for example: X = R or X = N )
Important: the elements in X do not make independent decisions
𝑛 𝑛
𝑚 𝑛
Example: Text Processing
Output: X
𝑚(Parse tree of a sentence) Input: Text ( Z
𝑚)
“The boy went home”
Definition (not formal)
The Output consists of several parts, and not only the parts themselves contain
information, but also the way in which the parts belong together.
Machine Learning: Structured versus Unstructured Models
Structured Output Prediction:
• 𝑓/𝑝: Z → X (for example: X = R or X = N )
Important: the elements in X do not make independent decisions
𝑛 𝑛
𝑚 𝑛
Example: Biology
Input: 3D microscopic image ( Z
𝑚)
3D Data
Definition (not formal)
The Output consists of several parts, and not only the parts themselves contain
information, but also the way in which the parts belong together.
Machine Learning: Structured versus Unstructured Models
Structured Output Prediction:
• 𝑓/𝑝: Z → X (for example: X = R or X = N )
Important: the elements in X do not make independent decisions
𝑛 𝑛
𝑚 𝑛
Example: Chemistry Definition (not formal)
The Output consists of several parts, and not only the parts themselves contain
information, but also the way in which the parts belong together.
Machine Learning – Big Picture
• Input Data: 𝒛 ∈ Z (discrete, or continuous information)
• Output Label: 𝒙 ∈ K (in our lectures mostly discrete)
• Free Variables: Θ
• Loss function: 𝐶: 𝐷 × K → R
• The Model: 𝑝 𝒙, 𝒛, Θ , 𝑓 𝒙, 𝒛, Θ
• These three steps are done in order:
1. Modelling: Write down 𝑝 𝒙, 𝒛, Θ , 𝑓(𝒙, 𝒛, Θ) or 𝑝 𝒙, 𝒛 , 𝑓(𝒙, 𝒛) 2. (Training/Learning): In case Θ present, then determine it using a
training data set, e.g. in supervised case {(𝒙
𝟏, 𝒛
𝟏),(𝒙
𝟐, 𝒛
𝟐),(𝒙
𝟑, 𝒛
𝟑),…}
3. (Testing/Prediction/Inference): Given test data 𝒛 make the optimal prediction wrt loss function 𝐶, e.g. MAP prediction:
𝒙 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝒙, 𝒛, Θ
𝑚 𝑛
𝑛
Structured Models
Write down joint probability distribution as a so-called Graphical Model:
• Directed graphical model (also called Bayesian Networks) - DGM
• Undirected graphical model (also called Markov Random Field) - UGM
• Factor graph (a specific form of an undirected graphical model) - FG
• A visualization tool for a distribution
• We use the tool where the graph is as simple as possible (as few links)
• You can convert between the three representations
(i.e. express the same joint distribution in three different ways)
• A visualization represents a family of distribution (but that is less relevant for us) Key Idea:
For us:
• We only consider DGM and FG
Structured Models - Example
Write down joint probability distribution as a so-called Graphical model:
• Directed graphical model (also called Bayesian Networks) - DGM
• Undirected graphical model (also called Markov Random Field) - UGM
• Factor graph (a specific form of an undirected graphical model) - FG
𝑝 𝑥1, 𝑥2, 𝑥3 = 1
𝑓(𝜓 𝑥1, 𝑥2, 𝑥3 𝜓 𝑥2 𝜓 𝑥4 )
𝑥
1𝑥
3𝑥
2𝑝 𝑥1, 𝑥2, 𝑥3 =
𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑝(𝑥3)
𝑥
1𝑥
3𝑥
2Output: 3 different variables X = 0,1
3Model: 𝑝 𝑥
1, 𝑥
2, 𝑥
3| 𝒛 𝑥
𝑖∈ {0,1}
𝑥
1𝑥
3𝑥
2𝑝 𝑥1, 𝑥2, 𝑥3 = 1
𝑓(𝜓 𝑥1, 𝑥2, 𝑥3 )
Structured Models - when to use what representation?
• Directed graphical model: The unknown variables have different
“meanings”
Example: MaryCalls, JohnCalls(J), AlarmOn(A), burglarInHouse(B)
• Factor Graphs: the unknown variables have all the same “meaning”
Examples: Pixel in an image, nuclei in C-Elegans (worm)
• Undirected graphical model are used, instead of Factor graphs,
when we are interested in studying “conditional independency” (not
relevant for our context)
Roadmap for this lectures
• Introduction and Recap
• Directed Graphical models
• Definition
• Creating a model from textual description
• Perform queries and make decisions
• Add evidence to the model
• A few Real-world examples
• Probabilistic Programming
Directed Graphical Model: Intro
• Since variables have a meaning it makes sense to look at conditional dependency
(Comment: in Factor graphs and undirected graphical models this does not give any insights)
• All variables that are linked by an arrow have a causal, directed relationship
Car is broken Eat
sweets
Tooth- ache Hole in
tooth
(Comment: the nodes do not have to be discrete variables, they can also be continuous, as
long as all distributions are well defined)
Directed Graphical Model: Intro
• Rewrite a distribution using product rule:
𝑝 𝑥
1, 𝑥
2, 𝑥
3= 𝑝 𝑥
1, 𝑥
2𝑥
3𝑝 𝑥
3= 𝑝 𝑥
1𝑥
2, 𝑥
3𝑝 𝑥
2𝑥
3𝑝(𝑥
3)
• Any joint distribution can be written as product of conditionals:
𝑝 𝑥
1, … , 𝑥
𝑛=
𝑖=1 𝑛
𝑝(𝑥
𝑖|𝑥
𝑖+1, … 𝑥
𝑛)
• Visualize the conditional 𝑝(𝑥
𝑖|𝑥
𝑖+1, … 𝑥
𝑛) as follows:
𝑥
𝑖+1… 𝑥
𝑛Directed Graphical Models: Examples
𝑝 𝑥1, 𝑥2, 𝑥3 =
𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑥3 𝑝(𝑥3)
𝑥
1𝑥
3𝑥
2𝑥
1𝑥
3𝑥
2𝑝 𝑥1, 𝑥2, 𝑥3 =
𝑝 𝑥1 𝑥2 𝑝 𝑥2 𝑥3 𝑝(𝑥3)
𝑝 𝑥1, 𝑥2, 𝑥3 =
𝑝 𝑥1 𝑥2, 𝑥3 𝑝(𝑥2) 𝑝(𝑥3)
𝑥
1𝑥
3𝑥
2Comments:
1. If links are absent then (often) the prediction in the model is simpler.
2. Our goal during model creation will be to find a model with as few links as possible
“all links
present” “one link is
absent”
“one link is
absent”
Directed Graphical Model: Definition
• Given an directed Graph 𝐺 = (𝑉, 𝐸), where 𝑉 is the set of nodes and 𝐸 the set of directed Edges
• A directed Graphical Model defines the family of distributions:
𝑝 𝑥 1 , … , 𝑥 𝑛 =
𝑖=1 𝑛
𝑝(𝑥 𝑖 |𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥 𝑖 ))
• The set 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥 𝑖 is a subset of varibales in {𝑥 𝑖+1 , … 𝑥 𝑛 }
• 𝑝(𝑥 𝑖 |𝑥 𝑖+1 , … 𝑥 𝑛 ) is visualized as:
Roadmap for this lectures
• Introduction and Recap
• Directed Graphical models
• Definition
• Creating a model from textual description
• Perform queries and make decisions
• Add evidence to the model
• A few Real-world examples
• Probabilistic Programming
Running Example
• Situation
• I am at work
• John calls to say that the alarm in my house is on
• Mary, who is my neighbor, did not call me
• Both, John and Mary, usually call me when the alarm is on
• The alarm is usually set off by burglars, but sometimes also by a minor earthquake
• Can we constructed the directed graphical model ?
• First Step: Identify variables that can have different “states”:
• JohnCalls(J), MaryCalls(M), AlarmOn(A), BurglarInHouse(B),
EarthQuakeHappening(E): 𝐽, 𝑀, 𝐴, 𝐵, 𝐸 ∈ {𝑡𝑟𝑢𝑒, 𝑓𝑎𝑙𝑠𝑒}
How to construct a Direct Graphical Model
1) Choose a sorting of the variables: 𝑥 𝑛 , … 𝑥 1 2) For 𝑖 = 𝑛 … 1
1) Add node 𝑥 𝑖 to network (model)
2) Select those variables which are already in the network (parents) that influence 𝑥 𝑖 (i.e. where 𝑥 𝑖 depends on):
𝑝 𝑥 𝑖 |𝑥 𝑖+1 , … , 𝑥 𝑛 = 𝑝 𝑥 𝑖 |𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥 𝑖
Running Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)
• Build the model step by step:
MaryCalls(M) 𝑝(𝑀)
Joint (so far):
Running Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J) 𝑝(𝑀)
Joint (so far):
𝑝 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝑀) 𝑝 𝐽 𝑀 = 𝑝 𝐽 ?
No. If Mary calls then John is also likely to call
because of the alarm which may be on.
Running Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J) 𝑝 𝐽 𝑀 = 𝑝 𝐽 ?
No. If Mary calls then John is also likely to call because of the alarm which may be on.
𝑝 𝐽 𝑀 𝑝(𝑀)
Joint (so far):
Running Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J)
AlarmOn(A) 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴 ?
𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝐽 ? 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝑀 ?
No. If Mary calls or John calls it is a higher chance that the alarm is on
𝑝(𝑀)
𝑝 𝐽 𝑀
Joint (so far):
Running Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J)
AlarmOn(A) 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴 ?
𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝐽 ? 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝑀 ?
No. If Mary calls or John calls it is a higher chance that the alarm is on
𝑝(𝑀)
𝑝 𝐽 𝑀
𝑝 𝐴 𝐽, 𝑀)
Joint (so far):
Running Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J)
AlarmOn(A) 𝑝 𝐵 𝐴, 𝐽, 𝑀) = 𝑝 𝐵 ?
No. The chance that there is a burglar in the house is not independent of the Alarm being on or Johns calls or Marry calls.
𝑝(𝑀)
𝑝 𝐽 𝑀
𝑝 𝐴 𝐽, 𝑀)
Joint (so far):
BurglarInHouse(B)
Running Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J)
AlarmOn(A) 𝑝 𝐵 𝐴, 𝐽, 𝑀) = 𝑝 𝐵|𝐴 ?
Yes. If we know that alarm is on then the information that Marry or John called is not relevant anymore!
BurglarInHouse(B) 𝑝(𝑀)
𝑝 𝐽 𝑀
𝑝 𝐴 𝐽, 𝑀)
𝑝(𝐵|𝐴)
Running Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J)
AlarmOn(A) 𝑝 𝐸 𝐴, 𝐽, 𝑀, 𝐵) = 𝑝 𝐸 ?
No. If Mary Calls or John Calls or Alarm
is on then the chances of an earthquake are higher
𝑝(𝑀)
𝑝 𝐽 𝑀
𝑝 𝐴 𝐽, 𝑀)
Joint (so far):
BurglarInHouse(B)
𝑝(𝐵|𝐴)
𝑝 𝐸 𝐴, 𝐽, 𝑀, 𝐵) = 𝑝 𝐸|𝐴, 𝐵 ?
Yes. If we know that alarm is on then the information that Marry or John
called is not relevant. The link B to E is present because it adds additional information. Assume the alaram is on. Then the probability for E=1 depends on the burglar. If the burglar is in the house then this could have caused the alarm being on and hence 𝑝(𝐸 = 1) is lower.
Running Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J)
AlarmOn(A) 𝑝(𝑀)
𝑝 𝐽 𝑀
𝑝 𝐴 𝐽, 𝑀)
𝑝(𝐸|𝐴) BurglarInHouse(B)
𝑝(𝐵|𝐴)
Running Example: Optimal Order
• Was that the best order to get a directed Graphical Model with as few links as possible?
• The optimal order is: 𝐵, 𝐸, 𝐴, 𝑀, 𝐽
MaryCalls(M) JohnCalls(J)
AlarmOn(A)
BurglarInHouse(B) EarthQuakeHappening(E)
How to find that order?
Temporal order, i.e.
how things happened!
Joint:
Running Example: Worst case scenario
• The worst possible ordering: 𝑀, 𝐽, 𝐸, 𝐵, 𝐴
MaryCalls(M)
JohnCalls(J)
AlarmOn(A) BurglarInHouse(B)
EarthQuakeHappening(E) This is a fully
connected
graph!
Running Example: Full Specification of Model
• Specify probabilities:
MaryCalls(M)
JohnCalls(J) AlarmOn(A)
BurglarInHouse(B) EarthQuakeHappening(E)
𝑷(𝑩 = 𝟏) 𝑷(𝑩 = 𝟎)
0.001 0.999 𝑷(𝑬 = 𝟏) 𝑷(𝑬 = 𝟎)
0.002 0.998
𝐵 𝑬 𝑷(𝑨 = 𝟏|𝑩, 𝑬) 𝑷(𝑨 = 𝟎|𝑩, 𝑬)
True True 0.95 0.05
True False 0.94 0.06
False True 0.29 0.71
False False 0.001 0.999
𝐴 𝑷(𝑴 = 𝟏|𝑨) 𝑷(𝑴 = 𝟎|𝑨)
True 0.7 0.3
𝐴 𝑷(𝑱 = 𝟏|𝑨) 𝑷(𝑱 = 𝟎|𝑨)
True 0.9 0.1
𝑝 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐽|𝐴)𝑝(𝑀|𝐴)𝑝 𝐴 𝐵, 𝐸 𝑝(𝐸)𝑝(𝐵)
Roadmap for this lectures
• Introduction and Recap
• Directed Graphical models
• Definition
• Creating a model from textual description
• Perform queries and make decisions
• Add evidence to the model
• A few Real-world examples
• Probabilistic Programming
What actions can we do?
Make queries:
• Compute probably of a model state, e.g. 𝑝(𝑥
1= 1, 𝑥
2= 0, 𝑥
3= 1)
• Compute marginal of a single variable 𝑝 𝑥
𝑖• Compute marginal of two variables: 𝑝(𝑥
𝑖, 𝑥
𝑗)
• Compute Conditional probability: 𝑝 𝑥
𝑖𝑥
𝑘= 0)
• Derive the most likely state: 𝑥
1∗, 𝑥
2∗, 𝑥
3∗= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝(𝑥
1, 𝑥
2, 𝑥
3) Make decisions:
• Add loss function to the network and the ask question about outcome Other tasks:
• Give extra Evidence to the model
• Value of information: What evidence should be we give to the model in order to get all marginal as unique as possible
• Sensitive Analyses: How does the network “behave” if I give evidence for
𝑥1, 𝑥2, 𝑥3
Running Example: Queries
• What is the probability that John calls and Mary calls and alarm is on, but there is no burglar and no earthquake?
𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1, 𝐵 = 0, 𝐸 = 0 =
𝑝 𝐵 = 0 𝑝 𝐸 = 0 𝑝 𝐴 = 1 𝐵 = 0, 𝐸 = 0 𝑝 𝐽 = 1 𝐴 = 1 𝑝 𝑀 = 1 𝐴 = 1 = 0.999 ∗ 0.998 ∗ 0.001 ∗ 0.7 ∗ 0.9 = 0.000628
• What is the probability that John calls and Mary calls and alarm is on?
𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1 =
𝐵,𝐸𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1, 𝐵, 𝐸 =
𝐵,𝐸
𝑝 𝐵 𝑝 𝐸 𝑝 𝐴 = 1 𝐵, 𝐸 𝑝 𝐽 = 1 𝐴 = 1 𝑝 𝑀 = 1 𝐴 = 1 =
0.7*0.9*(0.999 ∗ 0.998 ∗ 0.001 + 0.999 ∗ 0.002 ∗ 0.29 + 0.001 ∗ 0.998 ∗ 0.94 + 0.001 ∗ 0.002 ∗ 0.95 ) = 0.00157
𝑝 𝐽 = 1 𝐴 = 1 𝑝 𝑀 = 1 𝐴 = 1
𝐵,𝐸𝑝 𝐵 𝑝 𝐸 𝑝 𝐴 = 1 𝐵, 𝐸 =
Running Example: Queries (done in exercise)
• What is the marginal probability that John calls: 𝑃 𝐽 = 1 ?
• What is the conditional probability that John calls given that Mary calls?
𝑃 𝐽 = 1|𝑀 = 1 ?
• What is the most likely state in the network ?
How to do this in the best way?
• Make decisions (example)
Decision is to call the police (𝑑 = 1), or not to call the police (𝑑 = 0) Loss function 𝐶 𝑑 = 0, 𝐵 = 1 = 10; 𝐶 𝑑 = 0, 𝐵 = 0 = 0;
𝐶 𝑑 = 1, 𝐵 = 1 = 0; 𝐶 𝑑 = 1, 𝐵 = 0 = 2;
Bayesian Risk minimization: 𝑅 𝑑 =
𝐽,𝑀,𝐴,𝐵,𝐸𝑝
𝐽, 𝑀, 𝐴, 𝐵, 𝐸𝐶(𝑑, 𝐵) → min
𝑑
𝐽
∗, 𝑀
∗, 𝐴
∗, 𝐵
∗, 𝐸
∗= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝(𝐽, 𝑀, 𝐴, 𝐵, 𝐸)
Compute this also for different conditionals: e.g. 𝑝 𝐴, 𝐵, 𝐸 | 𝐽 = 1, 𝑀 = 1
𝐽, 𝑀, 𝐴, 𝐵, 𝐸
Queries in Network: Definition
• We call the Query:
“Normal” Inference (Prediction) or MAP inference
• We call the Query: 𝑝(𝑥 1 ), 𝑝(𝑥 2 ), 𝑝(𝑥 3 ) Probabilistic Inference
In the following we look at two different strategies:
• Exact computation via enumeration / message passing
• Approximate computation via sampling
𝑥
1∗, 𝑥
2∗, 𝑥
3∗= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝(𝑥
1, 𝑥
2, 𝑥
3)
(Probabilistic) inference by enumeration
Brute force - just add up all:
𝑥
1𝑥
3𝑥
2𝑝 𝑥
2=
𝑥1,𝑥3
𝑝 𝑥
1, 𝑥
2, 𝑥
3=
𝑥1,𝑥3
𝑝 𝑥
1|𝑥
2, 𝑥
3𝑝 𝑥
2𝑥
3𝑝(𝑥
3)
𝑝 𝑥1, 𝑥2, 𝑥3 = 𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑥3 𝑝(𝑥3)
Brute force - just maximize over all:
𝑥
2∗= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑥
1, 𝑥
2, 𝑥
3= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑥
1|𝑥
2, 𝑥
3𝑝 𝑥
2𝑥
3𝑝(𝑥
3)
𝑥1, 𝑥2, 𝑥3
Number of operations: 2
2= 4
where 𝑥
𝑖∈ {0,1}
Number of access to a probability table: 2
2∗ 3 = 12
𝑥1, 𝑥2, 𝑥3 3
(Probabilistic) inference by enumeration
In some cases you can reduce the cost of computations:
𝑝 𝑥1, 𝑥2, 𝑥3 = 𝑝 𝑥1 𝑥2 𝑝 𝑥2 𝑥3 𝑝(𝑥3)
𝑥
1𝑥
3𝑥
2𝑝 𝑥1 =
𝑥2,𝑥3
𝑝 𝑥1, 𝑥2, 𝑥3 =
𝑥2,𝑥3
𝑝 𝑥1|𝑥2 𝑝 𝑥2 𝑥3 𝑝(𝑥3) =
𝑥2
𝑝 𝑥1|𝑥2
𝑥3
𝑝 𝑥2 𝑥3 𝑝(𝑥3)
That is a function (message 𝑀(𝑥2)) that depends on 𝑥2 only. This is also the marginal 𝑝(𝑥2)
“Message” that only depends on 𝑥
2𝑥
𝑖∈ {0,1}
Number of operations: 2 + 2 = 4
Number of access to a probability table: 2 ∗ 2 + 2 ∗ 2 = 8
(Probabilistic) inference by enumeration
In some cases you can reduce the cost of computations:
𝑝 𝑥1, 𝑥2, 𝑥3 = 𝑝 𝑥1 𝑥2 𝑝 𝑥2 𝑥3 𝑝(𝑥3)
𝑥
1𝑥
3𝑥
2That is a function (message 𝑀(𝑥2)) that depends on 𝑥2 only
“Message” that only depends on 𝑥
2𝑥
𝑖∈ {0,1}
Number of operations: 2 + 2
2= 6
𝑥
1∗= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑥
1, 𝑥
2, 𝑥
3= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑥
1|𝑥
2𝑝 𝑥
2𝑥
3𝑝 𝑥
3=
𝑥1, 𝑥2, 𝑥3 𝑥1, 𝑥2, 𝑥3
𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑥
1|𝑥
2𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑥
2𝑥
3𝑝 𝑥
3𝑥1, 𝑥2 𝑥3
Algorithm for marginal computation (“Sum-Product Message Passing“) 1. Compute Messages from right to left
2.. Read out all marginals
Algorithm for optimal state computation: (“Max-Product Message Passing”) 1. Compute Messages from right to left
2. Back trace the correct solution
(Probabilistic) inference in Chains
Comments:
• This algorithm can be extended to arbitrary graphs (Junction Tree Method)
• In undirected graphical models we run similar algorithms 𝑥
1𝑥
2𝑥
3𝑥
4(done in exercise)
Reminder: HMM from lecture 3 (Michael Schröder)
• Viterbi Algorithm gives the optimal state 𝑠
• Viterbi Algorithm is a different name for Max-Product Message Passing
(Probabilistic) inference by sampling
This procedure is called ancestor sampling or prior sampling:
Goal: create one valid sample from the distribution
𝑝 𝑥
1, … , 𝑥
𝑛=
𝑖=1 𝑛
𝑝(𝑥
𝑖|𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥
𝑖))
Example: 𝑝 𝑥
1, 𝑥
2, 𝑥
3= 𝑝 𝑥
1𝑥
2, 𝑥
3𝑝 𝑥
2𝑥
3𝑝(𝑥
3)
where 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥
𝑖is a subset of varibales in {𝑥
𝑖+1, … 𝑥
𝑛}
𝑥
1𝑥
3𝑥
2Procedure:
For 𝑖 = 𝑛 … 1
Sample 𝑥
𝑖= 𝑝 𝑥
𝑖𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥
𝑖Give out (𝑥
1, … , 𝑥
𝑛)
(Probabilistic) inference by sampling - Example
Example: 𝑝 𝑥
1, 𝑥
2, 𝑥
3= 𝑝 𝑥
1𝑥
2, 𝑥
3𝑝 𝑥
2𝑥
3𝑝(𝑥
3)
𝑥
1𝑥
3𝑥
2Procedure:
For 𝑖 = 𝑛 … 1
Sample 𝑥
𝑖= 𝑝 𝑥
𝑖𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥
𝑖Give out (𝑥
1, … , 𝑥
𝑛)
𝑥
1𝑥
3𝑥
2Step 1
𝑥
1𝑥
3𝑥
2Step 2
𝑥
1𝑥
3𝑥
2Step 3
How to sample a single variable
How to sample from a general discrete probability distribution of one variable 𝑝(𝑥), 𝑥 ∈ {0,1, … , 𝑛}?
1. Define “intervals” whose lengths are proportional to 𝑝(𝑥) 2. Concatenate these intervals
3. Sample into the composed interval uniformly
4. Check, in which interval the sampled value falls in.
Below is an example for 𝑝 𝑥 ∝ {1,2,3} (three values).
1 2 3
(Probabilistic) inference by sampling
• By running the ancestor sampling often enough we get (in the limit) the true joint distribution out (“Gesetz der grossen Zahlen”)
• It is: 𝑃 𝑥
1, … , 𝑥
𝑛= 𝑃′ 𝑥
1, … , 𝑥
𝑛when 𝑁 → ∞
• The joint distribution can now be used to compute local marginal:
𝑃 𝑥
𝑖= 𝑘 =
𝒙:𝑥𝑖=𝑘𝑃(𝑥
1, … 𝑥
𝑖= 𝑘, … , 𝑥
𝑛)
• It can also be used to compute the optimal state of 𝑃 𝑥
1, … , 𝑥
𝑛• In practice this procedure only works well for smaller graphs since Procedure:
1) Let 𝑁 𝑥
1, … , 𝑥
𝑛denote the number of times we have seen a certain sample 𝑥
1, … , 𝑥
𝑛2) 𝑃′ 𝑥
1, … , 𝑥
𝑛=
𝑁 𝑥1,…,𝑥𝑛𝑁
where 𝑁 is the total number of samples
Example: ancestor sampling
Example: ancestor sampling
Example: ancestor sampling
Example: ancestor sampling
Example: ancestor sampling
Example: ancestor sampling
Roadmap for this lectures
• Introduction and Recap
• Directed Graphical models
• Definition
• Creating a model from textual description
• Perform queries and make decisions
• Add evidence to the model
• A few Real-world examples
• Probabilistic Programming
Add Evidence to the Model
• New Evidence: We observe that the alarm is on
• Question: How do we update the network?
• Idea 1:
Change 𝑝 𝐴 𝐽, 𝑀 in such a way that 𝑝 𝐴 = 1 𝐽, 𝑀 = 1 and 𝑝 𝐴 = 0 𝐽, 𝑀 = 0 for all 𝐽, 𝑀.
But this means that 𝑝 𝐴 𝐽, 𝑀 = 𝑝(𝐴) and
• Problem: It decouples 𝑀, 𝐽 from the rest of the network since links to 𝐴 are removed.
𝑝 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐸|𝐴, 𝐵)𝑝(𝐵|𝐴)𝑝(𝐴)𝑝(𝐽|𝑀)𝑝(𝑀) MaryCalls(M)
JohnCalls(J)
AlarmOn(A) 𝑝(𝑀)
𝑝 𝐽 𝑀
EarthQuakeHappening(E) 𝑝(𝐸|𝐴, 𝐵) BurglarInHouse(B)
𝑝(𝐵|𝐴)
• New Evidence: We observe that the alarm is on
• Question: How do we update the network?
• Idea 2:
Compute new conditional:
• Advantage: all links in the network are kept and this change does effect the rest of network, e.g. the marginal 𝑝(𝑀)
Add Evidence to the Model
MaryCalls(M)
JohnCalls(J)
AlarmOn(A) 𝑝(𝑀)
𝑝 𝐽 𝑀
𝑝 𝐴 = 1 𝐽, 𝑀)
EarthQuakeHappening(E) 𝑝(𝐸|𝐴, 𝐵) BurglarInHouse(B)
𝑝(𝐵|𝐴)
𝑝 𝑀, 𝐽, 𝐵, 𝐸 | 𝐴 = 1 =
𝑝 𝐸 𝐴, 𝐵 𝑝 𝐵 𝐴 𝑝 𝐴 = 1 𝐽, 𝑀 𝑝 𝐽 𝑀 𝑝 𝑀 /𝑝(𝐴 = 1)
Add Evidence to the Model – simple Example
Smoker(𝑠)
Cancer(𝑐)
𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)
0.5 0.5
𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)
True 0.2 0.8
False 0.1 0.9
Marginal: 𝑝 𝑠 = 0 = 0.5 ; 𝑝 𝑠 = 1 = 0.5
𝑝 𝑐 = 0 = 𝑝 𝑐 = 0 𝑠 = 0 𝑝 𝑠 = 0 + 𝑝 𝑐 = 0 𝑠 = 1 𝑝 𝑠 = 1 = 0.85 𝑝 𝑐 = 1 = 0.15
Joint: 𝑝 𝑐, 𝑠 = 𝑝 𝑐 𝑠 𝑝(𝑠)
Add Evidence to the Model – simple Example
Smoker( 𝑠 )
Cancer(𝑐)
𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)
0.5 0.5
𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)
True 0.2 0.8
False 0.1 0.9
We know that the person smokes, i.e. 𝑠 = 1, what is the probability that he has cancer?
Joint: 𝑝 𝑐, 𝑠 = 𝑝 𝑐 𝑠 𝑝(𝑠)
New model: 𝑝(𝑐|𝑠 = 1) can be directly taken from the model
Fix this
variable
Add Evidence to the Model – simple Example
Smoker( 𝑠 )
Cancer(𝑐)
𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)
1 0
𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)
True 0.2 0.8
False irrelevant irrelevant
We know that the person smokes, i.e. 𝑠 = 1, what is the probability that he
has cancer?
Add Evidence to the Model – simple Example
Smoker( 𝑠 )
Cancer(𝑐)
𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)
0.5 0.5
𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)
True 0.2 0.8
False 0.1 0.9
We know that the person has cancer, 𝑐 = 1, what is the probability that he smokes?
p 𝑠 = 0|𝑐 = 1 =
𝑝 𝑠=0,𝑐=1𝑝 𝑐=1
=
𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)+𝑝(𝑐=1|𝑠=1)𝑝(𝑠=1)
= 0.05/0.15 = 1/3 p 𝑠 = 1|𝑐 = 1 =
𝑝 𝑠=1,𝑐=1𝑝 𝑐=1
=
𝑝 𝑐=1|𝑠=1 𝑝(𝑠=1)𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)+𝑝(𝑐=1|𝑠=1)𝑝(𝑠=1)
= 0.1/0.15 = 2/3 Fix this
variable
Add Evidence to the Model – simple Example
Smoker(𝑠)
Cancer(𝑐)
𝑷(𝒔 = 𝟏|𝒄 = 𝟏) 𝑷(𝒔 = 𝟎|𝒄 = 𝟏)
2/3 1/3
𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)
True 1 0
False 1 0