Intelligent Systems:
Probabilistic Inference in Undirected and Directed Graphical models
Carsten Rother
Reminder: Random Forests
1) Get 𝑑-dimensional feature, depending on some parameters 𝜃:
2-dimensional:
We want to classify the white pixel Feature: color (e.g. red channel) of the green pixel and color (e.g. red channel) of the red pixel
Parameters: 4D (2 offset vectors) We will visualize in this way:
Reminder: Random Forests
body joint hypotheses
front view side view top view
input depth image body parts
Bodypart
Labelling Clustering
Label each pixel
Reminder: Decision Tree – Train Time
Input: all training points Input data in feature space
each point has a class label
The set of all labelled (training data) points, here 35 red and 23 blue.
Split the training set at each node
Measure 𝑝(𝑐) at each leave, it could be 3 red an 1 blue,
i.e. 𝑝(𝑟𝑒𝑑) = 0.75; 𝑝(𝑏𝑙𝑢𝑒) = 0.25 (remember, the feature space
is also optimized with 𝜃)
Random Forests – Training of features (illustration)
What does it mean to optimize over 𝜃
• For each pixel the same feature test (at one split node) will be done.
• One has to define what happens with
feature tests that reaches outside the image Goal during Training: spate red pixel (class 1) from blue pixels (class 2)
Feature:
Value 𝑥1: what is the value of green color channel (could also be red or blue) if you look: 𝜃1 pixel right and 𝜃2 pixels up
Value 𝑥2: what is the value of green color channel (could also be red or blue) if you look: 𝜃3 pixel right and 𝜃4 pixels down
Goal: find a such a 𝜃 that it is best to separate the data 𝑝𝑜𝑠 + (𝜃1, 𝜃2)
𝑝𝑜𝑠 + (𝜃3, 𝜃4)
Image Labeling (2 classes, red and blue)
Roadmap for this lecture
• Directed Graphical Models:
• Definition and Example
• Probabilistic Inference
• Undirected Graphical Models:
• Probabilistic inference: message passing
• Probabilistic inference: sampling
• Probabilistic programming languages
• Wrap-Up
Slide credits
• J. Fürnkranz, TU Darmstadt
• Dimitri Schlesinger
Reminder: Graphical models to capture structured problems
Write probability distribution as a Graphical model:
• Directed graphical model (also called Bayesian Networks)
• Undirected graphical model (also called Markov Random Field)
• Factor graphs (which we will use predominately)
• A visualization to represent a family of distributions
• Key concept is conditional independency
• You can also convert between distributions Basic idea:
References:
- Pattern Recognition and Machine Learning [Bishop ‘08, chapter 8]
- several lectures at the Machine Learning Summer School 2009 (see video lectures)
When to use what model?
• Undirected graphical model (also called Markov Random Field)
• The individual unknown variables have all the same “meaning”
• Factor graphs (which we will use predominately)
• Same as undirected graphical model but represents the distribution more detailed
• Directed graphical model (also called Bayesian Networks)
• The unknown variables have different “meanings”
𝑑𝑖
Label only left image
𝑥1 𝑥2
𝑧
Reminder: What to infer?
• MAP inference (Maximum a posterior state):
𝑥∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑥 𝑃 𝑥 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥𝐸 𝑥
• Probabilistic Inference, so-called marginal:
𝑃 𝑥𝑖 = 𝑘 =
𝒙 | 𝑥𝑖=𝑘
𝑃(𝑥1, … 𝑥𝑖 = 𝑘, … , 𝑥𝑛)
This can be used to make a maximum marginal decision:
𝑥𝑖∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑥𝑖 𝑃 𝑥𝑖
Reminder: MAP versus Marginal - visually
Input Image Ground Truth Labeling
MAP solution 𝒙∗ (each pixel has 0,1 label)
Marginals 𝑃 𝑥𝑖
(each pixel has a probability between 0 and 1)
Reminder: MAP versus Marginal – Making Decisions
Which solution 𝑥
∗would you choose?
Space of all solutions x (sorted by pixel difference) 𝑃(𝑥|𝑧)
Reminder: How to make a decision
Question: What solution 𝑥∗should we give out?
Answer: Choose 𝑥∗which minimizes the Bayesian risk:
𝑥∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥
𝑥
𝑃 𝑥 𝑧 𝐶(𝑥, 𝑥∗) Assume model 𝑃 𝑥 𝑧 is known
𝐶 𝑥1, 𝑥2 is called the loss function
(or cost function) of comparing to results 𝑥1, 𝑥2
Reminder: MAP versus Marginals – Making Decisions
Space of all solutions x (sorted by pixel difference) 𝑃(𝑥|𝑧)
“guessed” – max marginal MAP solution
MAP global 0-1 loss: 𝐶 𝑥, 𝑥∗ = 0 if 𝑥 = 𝑥∗, otherwise 1
Max-marginal pixel-wise 0-1 loss: 𝐶 𝑥, 𝑥∗ = 𝑖|𝑥𝑖 − 𝑥𝑖∗| (hamming loss)
Reminder: What to infer?
• MAP inference (Maximum a posterior state):
𝑥∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑥 𝑃 𝑥 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥𝐸 𝑥
• Probabilistic Inference, so-called marginals:
𝑃 𝑥𝑖 = 𝑘 =
𝒙 | 𝑥𝑖=𝑘
𝑃(𝑥1, … 𝑥𝑖 = 𝑘, … , 𝑥𝑛)
This can be used to make a maximum marginal decision:
𝑥𝑖∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑥𝑖 𝑃 𝑥𝑖
• So far we did only MAP estimation in factor graphs / undirected graphical models
• Today we do probabilistic inference in directed and factor graphs / undirected graphical models
• Comment: MAP inference in directed graphical models can be done, but rarely done
Directed Graphical Model: Intro
• All variables that are linked by an arrow have a causal, directed relationship
• Since variables have a meaning it makes sense to look at
conditional dependency (in Factor graphs, undirected models this does not give any insights)
Car is broken Eat
sweets
Tooth- ache Hole in
tooth
• Comment: the nodes do not have to be discrete variables, they can also be continuous, as long as distribution is well defined
Directed Graphical Models: Intro
• Rewrite a distribution using product rule:
𝑝 𝑥1, 𝑥2, 𝑥3 = 𝑝 𝑥1, 𝑥2 𝑥3 𝑝 𝑥3 = 𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑥3 𝑝(𝑥3)
• Any joint distribution can be written as product of conditionals:
𝑝 𝑥1, … , 𝑥𝑛 =
𝑖=1 𝑛
𝑝(𝑥𝑖|𝑥𝑖+1, … 𝑥𝑛)
• Visualize the conditional 𝑝(𝑥𝑖|𝑥𝑖+1, … 𝑥𝑛) as follows:
𝑥𝑖
𝑥𝑖+1 … 𝑥𝑛
Directed Graphical Models: Examples
𝑝 𝑥1, 𝑥2, 𝑥3 =
𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑥3 𝑝(𝑥3)
𝑥1 𝑥3
𝑥2
𝑥1 𝑥3
𝑥2
𝑝 𝑥1, 𝑥2, 𝑥3 =
𝑝 𝑥1 𝑥2 𝑝 𝑥2 𝑥3 𝑝(𝑥3)
𝑝 𝑥1, 𝑥2, 𝑥3 =
𝑝 𝑥1 𝑥2, 𝑥3 𝑝(𝑥2) 𝑝(𝑥3)
𝑥1
𝑥3 𝑥2
As with undirected graphical models the absence of links is the interesting aspect!
Definition: Directed Graphical Model
• Given an directed Graph 𝐺 = (𝑉, 𝐸), where 𝑉 is the set of nodes and 𝐸 the set of directed Edges
• A directed Graphical Model defines the family of distributions:
𝑝 𝑥1, … , 𝑥𝑛 =
𝑖=1 𝑛
𝑝(𝑥𝑖|𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖))
• The set 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥𝑖 is a subset of varibales in {𝑥𝑖+1, … 𝑥𝑛} and defined via the graph, 𝑝(𝑥𝑖|𝑥𝑖+1, … 𝑥𝑛) is visualized as:
• Comment: compared to factor graphs / undirected graphical models there is no partition function 𝑓:
𝑃 𝒙 = 1
𝑓 𝜓𝐹(𝒙𝑁 𝐹 ) where 𝑓 = 𝒙 𝐹∈ॲ 𝜓𝐹(𝒙𝑁 𝐹 )
Running Example
• Situation
• I am at work
• John calls to say that the alarm in my house went off
• Mary, who is my neighbor, did not call me
• The alarm is usually set off by burglars, but sometimes also by a minor earthquake
• Can we constructed the directed graphical model ?
• Step 1: Identify variables that can have different “state”:
• JohnCalls(J), MaryCalls(M), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)
Step 2: Method to construct Direct Graphical Model
1) Choose a sorting of the variables: 𝑥𝑛, … 𝑥1 2) For 𝑖 = 𝑛 … 1
A. Add node 𝑥𝑖 to network
B. Select those parents such that:
𝑝 𝑥𝑖|𝑥𝑖+1, … , 𝑥𝑛 = 𝑝 𝑥𝑖|𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥𝑖
Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)
• Build the model step by step:
MaryCalls(M) 𝑝(𝑀)
Joint (so far):
𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝑀)
Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J) 𝑝 𝐽 𝑀 = 𝑝 𝐽 ?
No. If Mary calls than John is also likely to call
𝑝(𝑀)
Joint (so far):
𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝑀)
Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J) 𝑝 𝐽 𝑀 = 𝑝 𝐽 ?
No. If Mary calls than John is also likely to call
𝑝 𝐽 𝑀 𝑝(𝑀)
Joint (so far):
𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐽|𝑀)𝑝(𝑀)
Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J)
AlarmOn(A) 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴 ?
𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝐽 ? 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝑀 ?
No. If Mary calls or John calls it is a higher probability that the alarm is on
𝑝(𝑀)
𝑝 𝐽 𝑀
Joint (so far):
𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐽|𝑀)𝑝(𝑀)
Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J)
AlarmOn(A) 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴 ?
𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝐽 ? 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝑀 ?
No. If Mary calls or John calls it is a higher probability that the alarm is on
𝑝(𝑀)
𝑝 𝐽 𝑀
𝑝 𝐴 𝐽, 𝑀)
Joint (so far):
𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀)
Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J)
AlarmOn(A) 𝑝 𝐵 𝐴, 𝐽, 𝑀) = 𝑝 𝐵 ?
No. The chance that there is a burglar in the house depends on the AlaramOn, Johns calls or Marry calls.
𝑝(𝑀)
𝑝 𝐽 𝑀
𝑝 𝐴 𝐽, 𝑀)
Joint (so far):
𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀)
BurglarInHouse(B)
Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J)
AlarmOn(A) 𝑝 𝐵 𝐴, 𝐽, 𝑀) = 𝑝 𝐵|𝐴 ?
Yes. If we know that alarm has gone off then the information that Marry or John called is not relevant!
BurglarInHouse(B) 𝑝(𝑀)
𝑝 𝐽 𝑀
𝑝 𝐴 𝐽, 𝑀)
𝑝(𝐵|𝐴)
Joint (so far):
𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐵|𝐴)𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀)
Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J)
AlarmOn(A) 𝑝 𝐸 𝐴, 𝐽, 𝑀, 𝐵) = 𝑝 𝐸 ?
No. If Mary Calls or John Calls or Alarm
is on then the chances of an earthquake are higher
𝑝(𝑀)
𝑝 𝐽 𝑀
𝑝 𝐴 𝐽, 𝑀)
earthquakeHappening(E) Joint (so far):
𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐵|𝐴)𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀)
BurglarInHouse(B) 𝑝(𝐵|𝐴)
Example
• Let us select the ordering:
MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)
• Build the model step by step:
MaryCalls(M)
JohnCalls(J)
AlarmOn(A) 𝑝 𝐸 𝐴, 𝐽, 𝑀, 𝐵) = 𝑝 𝐸|𝐴, 𝐵 ?
Yes. If we know that alarm has gone off then the information that Marry or John called is not relevant.
𝑝(𝑀)
𝑝 𝐽 𝑀
𝑝 𝐴 𝐽, 𝑀)
earthquakeHappening(E) 𝑝(𝐸|𝐴, 𝐵) Joint (final):
𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐸|𝐴, 𝐵)𝑝(𝐵|𝐴)𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀) BurglarInHouse(B)
𝑝(𝐵|𝐴)
Example: Optimal Order
• Was that the best order to get a directed Graphical Model with as few links as possible?
• The optimal order is:
MaryCalls(M) JohnCalls(J)
AlarmOn(A)
BurglarInHouse(B) earthquakeHappening(E)
How to find that order?
Temporal order how things happened!
Joint:
𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐽|𝐴)𝑝(𝑀|𝐴)𝑝 𝐴 𝐵, 𝐸 𝑝(𝐸)𝑝(𝐵)
Example: Optimal Order
• Associate probabilities:
MaryCalls(M)
JohnCalls(J) AlarmOn(A)
BurglarInHouse(B) earthquakeHappening(E)
𝑷(𝑩 = 𝟏) 𝑷(𝑩 = 𝟎)
0.001 0.999 𝑷(𝑬 = 𝟏) 𝑷(𝑬 = 𝟎)
0.002 0.998
𝐵 𝑬 𝑷(𝑨 = 𝟏|𝑩, 𝑬) 𝑷(𝑨 = 𝟎|𝑩, 𝑬)
True True 0.95 0.05
True False 0.94 0.06
False True 0.29 0.71
False False 0.001 0.999
𝐴 𝑷(𝑴 = 𝟏|𝑨) 𝑷(𝑴 = 𝟎|𝑨)
True 0.7 0.3
False 0.01 0.99
𝐴 𝑷(𝑱 = 𝟏|𝑨) 𝑷(𝑱 = 𝟎|𝑨)
True 0.9 0.1
False 0.05 0.95
Worst case scenario
Running Example
• What is the probability that John calls and Mary calls and alarm on, but there is no burglar and no earthquake
𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1, 𝐵 = 0, 𝐸 = 0 =
𝑝 𝐵 = 0 𝑝 𝐸 = 0 𝑝 𝐴 = 1 𝐵 = 0, 𝐸 = 0 𝑝 𝐽 = 1 𝐴 = 1 𝑝 𝑀 = 1 𝐴 = 1 = 0.999 ∗ 0.998 ∗ 0.001 ∗ 0.7 ∗ 0.9 = 0.000628
• What is the probability that John calls and Mary calls and alarm on
𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1 = 𝐵,𝐸 𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1, 𝐵, 𝐸 =
𝐵,𝐸𝑝 𝐵 𝑝 𝐸 𝑝 𝐴 = 1 𝐵, 𝐸 𝑝 𝐽 = 1 𝐴 = 1 𝑝 𝑀 = 1 𝐴 = 1 =
0.7*0.9*(0.999 ∗ 0.998 ∗ 0.001 + 0.999 ∗ 0.002 ∗ 0.29 + 0.001 ∗ 0.998 ∗ 0.94 + 0.001 ∗ 0.002 ∗ 0.95 ) = 0.00157
Fixing Marginal
Smoker(s)
Cancer(c)
𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)
0.5 0.5
𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)
True 0.2 0.8
False 0.1 0.9
Marginal: 𝑝 𝑠 = 0 = 0.5 ; 𝑝 𝑠 = 1 = 0.5
𝑝 𝑐 = 0 = 𝑝 𝑐 = 0 𝑠 = 0 𝑝 𝑠 = 0 + 𝑝 𝑐 = 0 𝑠 = 1 𝑝 𝑠 = 1 = 0.85 𝑝 𝑐 = 1 = 0.15
Joint: 𝑝 𝑐, 𝑠 = 𝑝 𝑐 𝑠 𝑝(𝑠)
Fixing Marginal
Smoker(s)
Cancer(c)
𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)
0.5 0.5
𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)
True 0.2 0.8
False 0.1 0.9
We know that the person has cancer, what is the probability that he smokes:
Wrong solution: just update the table for 𝑝(𝑐|𝑠) but leave 𝑝(𝑠) unchanged. We want to change 𝑝(𝑐) and not p(c|s) !
Correct solution: recomputed all probabilities under the condition 𝑐 = 1 p′ 𝑠 = 0|𝑐 = 1 = 𝑝 𝑠=0,𝑐=1
𝑝 𝑐=1 = 𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)
𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)+𝑝(𝑐=1|𝑠=1)𝑝(𝑠=1) = 0.05/0.15 = 1/3 p′ 𝑠 = 1|𝑐 = 1 = 𝑝 𝑠=1,𝑐=1
𝑝 𝑐=1 = 𝑝 𝑐=1|𝑠=1 𝑝(𝑠=1)
𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)+𝑝(𝑐=1|𝑠=1)𝑝(𝑠=1) = 0.1/0.15 = 2/3 Fix marginal
Fixing Marginal
Smoker(s)
Cancer(c)
𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)
2/3 1/3
𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)
True 1 0
False 1 0
We know that the person has cancer, what is the probability that he smokes:
Fixing Marginal
Smoker(s)
Cancer(c)
𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)
0.5 0.5
𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)
True 0.2 0.8
False 0.1 0.9
We know that the person smokes, what is the probability that he has cancer:
Two solutions (same outcome)
1) Just update 𝑝(𝑠) to p(s=0) = 0, p(s=1) = 1, and then you get a new 𝑝 𝑠, 𝑐 = 𝑝 𝑐 𝑠 𝑝 𝑠 . Now the marginal:
p(c=0) = p(c=0,s=0)+p(c=0,s=1) = p(c=0|s=0)p(s=0) + p(c=0|s=1)p(s=1) = 0.8 p(c=1) = p(c=1,s=0)+p(c=1,s=1) = p(c=1|s=0)p(s=0) + p(c=1|s=1)p(s=1) = 0.2 2) Update p(c) under the condition that s=1:
p’(c=0) = p(c=0,s=1)/p(s=1) = p(c=0|s=1)p(s=1) / p(s=1) = p(c=0|s=1) = 0.8 p’(c=1) = p(c=1,s=1)/p(s=1) = p(c=1|s=1)p(s=1) / p(s=1) = p(c=1|s=1) = 0.2 Fix marginal
Fixing Marginal
Smoker(s)
Cancer(c)
𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)
1 0
𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)
True 0.2 0.8
False irrelevant irrelevant
We know that the person smokes, what is the probability that he has cancer:
Fixing Marginal – another example
Smoker(s)
Cancer(c)
Healthy (h) environment
Joint: 𝑝 𝑐, 𝑠, ℎ = 𝑝 𝑐|𝑠, ℎ 𝑝 𝑠 𝑝(ℎ)
Fixing Marginal – another example
Smoker(s)
Cancer(c)
Healthy (h) environment
Joint: 𝑝 𝑐, 𝑠, ℎ = 𝑝 𝑐|𝑠, ℎ 𝑝 𝑠 𝑝(ℎ)
𝑝 𝑐 changes
𝑝 ℎ does not change
𝑝′ 𝑐, ℎ|𝑠 = 𝑝 𝑐, 𝑠, ℎ
𝑝(𝑠) = 𝑝 𝑐 𝑠, ℎ 𝑝 𝑠 𝑝 ℎ
𝑝(𝑠) = 𝑝 𝑐 𝑠, ℎ 𝑝(ℎ)
Examples: Medical Diagnosis
Examples: Medical Diagnosis
Examples: Medical Diagnosis
Local Marginal for each node (sometimes also called belief)
Examples: Medical Diagnosis
Examples: Medical Diagnosis
Examples: Medical Diagnosis
Examples: Medical Diagnosis
Car Diagnosis
Examples: Car Insurance
Example: Pigs Network
(“Stammbaum”)
Reminder: Image segmentation
Image z
Joint Probablity 𝑃 𝒛, 𝒙 = 𝑃 𝒛|𝒙 𝑃 𝒙
Labeling x
Samples: True image:
Most likely:
𝒙 𝒛
𝒙 𝒛
Reminder: lecture on Computer Vision
Scene type Scene geometry Object classes Object position Object orientation Object shape Depth/occlusions Object appearance Illumination Shadows Motion blur Camera effects
Generative model for images
Image
Material Light Background
3D Objects
Light coming into camera
Hybrid networks
• Directed graphical (as well as undirected ones) can be expressed with discrete and continuous variables (as long as they define correct distributions)
Image (discrete) Material
(discrete)
Light (continuous)
Background (discrete) 3D Objects
(continuous)
Light coming into camera (continuous)
Real World Applications of Bayesian Networks
Real World Applications of Bayesian Networks
What questions can we ask?
• Marginal of a single variable 𝑝(𝑥𝑖)
“we call probabilistic inference, most common query”
• Marginal of two variables: 𝑝(𝑥𝑖, 𝑥𝑗)
• Conditional queries: 𝑝 𝑥𝑖 = 0, 𝑥𝑗 = 1, 𝑥𝑘 = 0 , 𝑝(xi, xj |xk = 0)
• Optimal decisions:
• Add utility function to the network and the ask question about outcome
• Value of information:
• Which node to fix to get all marginal as unique as possible
• Sensitive Analyses:
• How does network behave if I change one marginal
Roadmap for this lecture
• Directed Graphical Models:
• Definition and Example
• Probabilistic Inference
• Undirected Graphical Models:
• Probabilistic inference: message passing
• Probabilistic inference: sampling
• Probabilistic programming languages
• Wrap-Up
Probabilistic Inference in DGM
We consider two possible principles to compute marginal (probabilistic inference):
• Inference by enumeration
• Inference by sampling
Probabilistic inference by enumeration
Brute force - just add up all:
𝑥1 𝑥3
𝑥2
𝑝 𝑥2 =
𝑥1,𝑥3
𝑝 𝑥1, 𝑥2, 𝑥3 =
𝑥1,𝑥3
𝑝 𝑥1|𝑥2, 𝑥3 𝑝 𝑥2 𝑥3 𝑝(𝑥3) 𝑝 𝑥1, 𝑥2, 𝑥3 = 𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑥3 𝑝(𝑥3)
Probabilistic inference by enumeration
In some cases you can reduce the cost of computations:
𝑝 𝑥1, 𝑥2, 𝑥3 = 𝑝 𝑥1 𝑥2 𝑝 𝑥2 𝑥3 𝑝(𝑥3)
𝑥1 𝑥3
𝑥2
𝑝 𝑥1 =
𝑥2,𝑥3
𝑝 𝑥1, 𝑥2, 𝑥3 =
𝑥2,𝑥3
𝑝 𝑥1|𝑥2 𝑝 𝑥2 𝑥3 𝑝(𝑥3) =
𝑥2
𝑝 𝑥1|𝑥2
𝑥3
𝑝 𝑥2 𝑥3 𝑝(𝑥3)
That is a function (message 𝑀(𝑥2)) that depends on 𝑥2
Probabilistic inference by sampling
This procedure is called ancestor sampling or prior sampling:
Goal: create one valid sample from the distribution
𝑝 𝑥1, … , 𝑥𝑛 =
𝑖=1 𝑛
𝑝(𝑥𝑖|𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖))
Example: 𝑝 𝑥1, 𝑥2, 𝑥3 = 𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑥3 𝑝(𝑥3)
where 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥𝑖 is a subset of varibales in {𝑥𝑖+1, … 𝑥𝑛}
𝑥1 𝑥3
𝑥2
Procedure:
For 𝑖 = 𝑛 … 1
Sample 𝑥𝑖 = 𝑝 𝑥𝑖 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥𝑖 Give out (𝑥1, … , 𝑥𝑛)
Probabilistic inference by sampling - Example
Example: 𝑝 𝑥1, 𝑥2, 𝑥3 = 𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑥3 𝑝(𝑥3)
𝑥1 𝑥3
𝑥2
Procedure:
For 𝑖 = 𝑛 … 1
Sample 𝑥𝑖 = 𝑝 𝑥𝑖 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥𝑖 Give out (𝑥1, … , 𝑥𝑛)
𝑥1 𝑥3
𝑥2
Step 1
𝑥1 𝑥3
𝑥2
Step 2
𝑥1 𝑥3
𝑥2
Step 3
How to sample a single variable
How to sample from a general discrete probability distribution of one variable 𝑝(𝑥), 𝑥 ∈ {0,1, … , 𝑛}?
1. Define “intervals” whose lengths are proportional to 𝑝(𝑥) 2. Concatenate these intervals
3. Sample into the composed interval uniformly
4. Check, in which interval the sampled value falls in.
Below is an example for 𝑝 𝑥 ∝ {1,2,3} (three values).
1 2 3
Probabilistic inference by sampling
• By running the ancestor sampling often enough we get (in the limit) the true joint distribution out (“Gesetz der grossen Zahlen”)
• It is: 𝑃 𝑥1, … , 𝑥𝑛 = 𝑃′ 𝑥1, … , 𝑥𝑛 when 𝑁 → ∞
• The joint distribution can now be used to compute local marginal:
𝑃 𝑥𝑖 = 𝑘 = 𝒙:𝑥𝑖=𝑘 𝑃(𝑥1, … 𝑥𝑖 = 𝑘, … , 𝑥𝑛) Procedure:
1) Let 𝑁 𝑥1, … , 𝑥𝑛 denote the number of times we have seen a certain sample 𝑥1, … , 𝑥𝑛
2) 𝑃′ 𝑥1, … , 𝑥𝑛 = 𝑁 𝑥1,…,𝑥𝑛
𝑁 where 𝑁 is the total number of samples
Example
Example
Example
Example
Example
Example
Roadmap for this lecture
• Directed Graphical Models:
• Definition and Example
• Probabilistic Inference
• Undirected Graphical Models:
• Probabilistic inference: message passing
• Probabilistic inference: sampling
• Probabilistic programming languages
• Wrap-Up
Reminder: Definition: Factor Graph models
• Given an undirected Graph 𝐺 = (𝑉, 𝐹, 𝐸), where 𝑉, 𝐹 are the set of nodes and 𝐸 the set of Edges
• A Factor Graph defines a family of distributions:
𝑓: partition function 𝐹: Factor
ॲ: Set of all factors
𝑁(𝐹): Neighbourhood of a factor
𝜓𝐹: function (not distribution) depending on 𝒙𝑁(𝐹) (𝜓𝐶: 𝐾|𝐶| → 𝑅 where xi ∈ 𝐾)
𝑃 𝒙 = 1
𝑓 𝐹∈ॲ
𝜓𝐹(𝒙𝑁 𝐹 ) where 𝑓 = 𝒙 𝐹∈ॲ 𝜓𝐹(𝒙𝑁 𝐹 )
Note the definition of factor is not linked to a property of a Graph (as with cliques)
Reminder: Dynamic Programming on chains
q
p r
• Pass messages from left to right
• Message is a vector with 𝐾 entries (𝐾 is the amount of labels)
• Read out the solution from final message and final unary term
• globally exact solution
• Other name: min-sum algorithm
𝑀𝑞→𝑟(𝑥𝑟)
E 𝒙 =
𝑖
𝜃𝑖 𝑥𝑖 +
𝑖,𝑗∈𝑁
𝜃𝑖𝑗 𝑥𝑖, 𝑥𝑗
Unary terms pairwise terms in a row
𝑀𝑝→𝑞(𝑥𝑞) 𝑀𝑟→𝑠(𝑥𝑠)
𝑀𝑜→𝑝(𝑥𝑝)
Comment: Dmitri Schlesinger called the messages Bellman functions.
Reminder: Dynamic Programming on chains
q
p r
𝑀𝑞→𝑟
Define the message:
𝑀𝑞→𝑟 𝑥r = min
𝑥𝑞 {Mp→q 𝑥𝑞 + 𝜃𝑞 𝑥𝑞 + 𝜃𝑞,𝑟(𝑥𝑞, 𝑥𝑟)}
Information from previous nodes
Local
information
Connection to next node
Message stores the energy up to this point for 𝑥𝑟 = 𝑘:
𝑀𝑞→𝑟 𝑥𝑟 = 𝑘 = min
𝑥1…𝑞,𝑥𝑟=𝑘 E 𝑥1, … , 𝑥𝑞, 𝑥𝑟 = 𝑘
Reminder: Dynamic Programming on chains - example
Sum-Prod Algorithm for probabilistic Inference
𝑀𝑞→𝑟 𝑥𝑟 =
𝑥𝑞
{Mp→q x𝑞 ∗ 𝜓𝑞 x𝑞 ∗ 𝜓𝑞,𝑟(x𝑞, 𝑥𝑟)}
Information from previous nodes
Local
information
Connection to next node
q
p r
𝑀𝑞→𝑟(𝑥𝑟) 𝜓𝑞 𝑥𝑞
𝜓𝑞,𝑟 𝑥𝑞, 𝑥𝑟
• The Method is called Sum-Prod Algorithm since a message is computed by product and sum operations.
𝑃 𝒙 = 1
𝑓 𝐹∈ॲ
𝜓𝐹(𝒙𝑁 𝐹 ) where 𝑓 = 𝒙 𝐹∈ॲ 𝜓𝐹(𝒙𝑁 𝐹 )
• Let us consider factor graphs:
Sum-Prod Algorithm for probabilistic Inference
Extensions
q
p r s
• The algorithm above can easy be extended to chains with arbitrary length
• The algorithm above can also be extended to compute marginal for arbitrary nodes (here 𝑞)
Message: 𝑀𝑝→𝑞(𝑥𝑞) Message: 𝑀𝑟→𝑞(𝑥𝑞)
Extensions
q
p r s
• To compute 𝑓 and all marginal we have to compute all messages that go into each node
• This is done by two passes over all nodes: from left to right and from right to left. (this is referred to as the full Sum-Prod Algorithm)
Compute all messages that go this way Compute all messages that go this way
Extensions
• Both, MAP estimation and probabilistic Inference, can easily be extended to tree structures:
• Both, MAP estimation and probabilistic Inference, can be extend to arbitrary factor graphs (with loops and higher order potentials)
Loop: 𝑥2, 𝑥3, 𝑥4
Higher (3) order factor: 𝑥2, 𝑥1, 𝑥4
Roadmap for this lecture
• Directed Graphical Models:
• Definition and Example
• Probabilistic Inference
• Undirected Graphical Models:
• Probabilistic inference: message passing
• Probabilistic inference: sampling
• Probabilistic programming languages
• Wrap-Up
Reminder: ICM - Iterated conditional mode
Gibbs Energy:
𝑥2
𝑥1
𝑥3 𝑥4
𝑥5
𝐸 𝒙 = 𝜃12 𝑥1, 𝑥2 + 𝜃13 𝑥1, 𝑥3 + 𝜃14 𝑥1, 𝑥4 + 𝜃15 𝑥1, 𝑥5 + ⋯
Reimnder: ICM - Iterated conditional mode
𝑥2
𝑥1
𝑥3 𝑥4
𝑥5
Idea: fix all variable but one and optimize for this one
Selected 𝑥1 and optimize:
Gibbs Energy:
• Can get stuck in local minima
• Depends on initialization
ICM Global min
𝐸 𝒙 = 𝜃12 𝑥1, 𝑥2 + 𝜃13 𝑥1, 𝑥3 + 𝜃14 𝑥1, 𝑥4 + 𝜃15 𝑥1, 𝑥5 + ⋯
𝐸′ 𝒙 = 𝜃12 𝑥1, 𝑥2 + 𝜃13 𝑥1, 𝑥3 + 𝜃14 𝑥1, 𝑥4 + 𝜃15 𝑥1, 𝑥5
Reminder: ICM - parallelization
• The schedule is a more complex task in graphs which are not 4-connected Normal procedure:
Step 1 Step 2 Step 3 Step 4
Parallel procedure:
Step 1 Step 2 Step 3 Step 4
Gibbs Sampling
• Task: draw a sample from a multivariate probability distribution 𝑝 𝒙 = 𝑝 𝑥1, 𝑥2, … , 𝑥𝑛 .
• This is not trivial, if the probability distribution is complex, e.g. an MRF (remember: it is usually given up to an unknown normalizing constant).
The following procedure is used:
1. Start from an arbitrary assignment 𝒙0 = (𝑥10 … 𝑥𝑛0).
2. Repeat:
1) Pick (randomly) a variable 𝑥𝑖
2) Sample a new value for 𝑥𝑖𝑡+1 according to the conditional probability distribution
𝑝 𝑥𝑖 𝑥1𝑡, … , 𝑥𝑖−1𝑡 , 𝑥𝑖+1𝑡 , … , 𝑥𝑛𝑡)
i.e. under the condition “all other variables are fixed”
Gibbs Sampling
• After each “elementary” sampling step 2) a new assignment 𝒙𝑡+1 is obtained, that differs from the previous one only by the value for 𝑥𝑖.
• After many (infinite number) iterations of 1)-2) the assignment 𝒙𝑡 follows the initial probability distribution 𝑝 𝒙 under some (mild) assumption:
The probability to pick each particular variable 𝑥𝑖 in 1) is non- zero, i.e. for an infinite number of generation steps each
variable is visited infinitely many times.
(In particular, in Computer vision applications (each variable often corresponds to a pixel) scan-line order is widely used.)
For each pair of assignments 𝒙1, 𝒙2 the probability to arrive at 𝒙2 starting from 𝒙1 is non-zero, i.e. independently on the
starting point there is a chance to sample each elementary event of the considering probability distribution.
Gibbs Sampling for MRFs
For MRFs the elementary sampling step 2) is feasible because 𝑝 𝑥𝑖 𝑥1, … , 𝑥𝑖−1, 𝑥𝑖+1, … , 𝑥𝑛 = 𝑝(𝑥𝑖|𝑥𝑁(𝑖))
where 𝑁(𝑖) is the neighborhood of 𝑖 (the Markovian property).
In particular, for Gibbs distributions of second order 𝑝 𝑥𝑖 = 𝑘 𝑥𝑁 𝑖 ∝ exp[𝜃𝑖 𝑘 +
𝑖,𝑗∈𝑁4
𝜃𝑖𝑗(𝑘, 𝑥𝑗)]
It reminds on the Iterated Conditional Mode, but now we do not decide for the best label, we sample one instead.
Reminder: How to sample a single variable
How to sample from a general discrete probability distribution of one variable 𝑝(𝑥), 𝑥 ∈ {0,1, … , 𝑛}?
1. Define “intervals” whose lengths are proportional to 𝑝(𝑥) 2. Concatenate these intervals
3. Sample into the composed interval uniformly
4. Check, in which interval the sampled value falls in.
Below is an example for 𝑝 𝑥 ∝ {1,2,3} (three values).
1 2 3
Other Samplers
• Rejection sampling
• Metropolis-Hasting Sampling
• Comment: Gibbs Sampler and Metropolis-Hasting Sampling belong to the class of MCMC samplers
Roadmap for this lecture
• Directed Graphical Models:
• Definition and Example
• Probabilistic Inference
• Undirected Graphical Models:
• Probabilistic inference: message passing
• Probabilistic inference: sampling
• Probabilistic programming languages
• Wrap-Up
Probabilistic programming Languages
• Basic Idea: associate to variables a distribution:
Normally C++: Bool coin = 0;
Probabilistic Program: Bool coin = Bernoulli(0.5);
% Bernoulli is a distribution with 2 states
• A programming language to easily program: Inference in factor graphs, and other machine learning tasks
Another Example: Two coin example
• Draw coin1 (𝑥1) and coin2 (𝑥2) independently.
• Each coin has equal probability to be head (1) or tail (0)
• New random variable 𝑧 which is True if and only if both coins are head:
𝑧 = 𝑥1& 𝑥2
Write this as a directed graphical model
𝑥1 𝑥2 𝑥𝑖 ∈ 0,1 𝑝 𝑥𝑖 = 1 = 𝑝 𝑥𝑖 = 0 = 0.5
𝑧
𝑃(𝑧 = 1|𝑥1, 𝑥2) 𝑃(𝑧 = 0|𝑥1, 𝑥2) 𝑉𝑎𝑙𝑢𝑒 𝑥1 𝑉𝑎𝑙𝑢𝑒 𝑥2
0 1 0 0
0 1 0 1
0 1 1 0
1 0 1 1
Compute Marginal:
𝑝 𝑧 =
𝑥1,𝑥2
𝑝 𝑧, 𝑥1, 𝑥2 =
𝑥1,𝑥2
𝑝 𝑧|𝑥1, 𝑥2 𝑝 𝑥1 𝑝(𝑥2)
𝑝 𝑧 = 1 = 1 ∗ 0.5 ∗ 0.5 = 0.25
𝑝 𝑧 = 0 = 1 ∗ 0.5 ∗ 0.5 + 1 ∗ 0.5 ∗ 0.5 + 1 ∗ 0.5 ∗ 0.5 = 0.75
Infer.net example: 2 coins
Program:
Run it:
Add to the program:
Conversion DGM, UGM, Factor graphs
• In Infer.Net all models are converted to factor graphs and then inference is done in factor graphs
• In practice you have one concrete distribution (not family)
• Let us first convert Direct Graphical model to undirected and then from undirected to factor graph
Directed Graphical model to undirected one
• Take the joint distribution and then replace all conditional
“symbols” with commas and then normalize to get correct distribution
Directed Graphical model: 𝑝 𝑥1, … , 𝑥𝑛 = 𝑖=1𝑛 𝑝(𝑥𝑖|𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖)) Gives UGM: 𝑝 𝑥1, … , 𝑥𝑛 = 1/𝑓 𝑖=1𝑛 𝑝(𝑥𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖))
where 𝑓 = 𝒙 𝑖=1𝑛 𝑝(𝑥𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖))
Note, the function 𝑝 𝑥𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥𝑖 is no longer a probability, so it should rather be called: 𝜓(𝑥𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖)) = 𝑝(𝑥𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖))
Comment: to go from undirected graphical model to directed one is difficult since we don’t have the individual distributions, just a big product of factors
From Directed Graphical model to undirected
This step is called “moralization”.
1) All the children to a certain node are connected up 2) All directed links are converted to undirected one
𝑝 𝑥1, 𝑥2, 𝑥3, 𝑥4 =
𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑝 𝑥3 𝑥4 𝑝(𝑥4)
𝑥1
𝑥3 𝑥2
𝑥4
𝑝 𝑥1, 𝑥2, 𝑥3, 𝑥4 = 1
𝑓(𝑝 𝑥1, 𝑥2, 𝑥3 𝑝 𝑥2 𝑝 𝑥3, 𝑥4 𝑝 𝑥4 )
𝑥1
𝑥3 𝑥2
𝑥4