Intelligent Systems:

(1)

Intelligent Systems:

Probabilistic Inference in Undirected and Directed Graphical models

Carsten Rother

(2)

Reminder: Random Forests

1) Get 𝑑-dimensional feature, depending on some parameters 𝜃:

2-dimensional:

We want to classify the white pixel Feature: color (e.g. red channel) of the green pixel and color (e.g. red channel) of the red pixel

Parameters: 4D (2 offset vectors) We will visualize in this way:

(3)

Reminder: Random Forests

body joint hypotheses

front view side view top view

input depth image body parts

Bodypart

Labelling Clustering

Label each pixel

(4)

Reminder: Decision Tree – Train Time

Input: all training points Input data in feature space

each point has a class label

The set of all labelled (training data) points, here 35 red and 23 blue.

Split the training set at each node

Measure 𝑝(𝑐) at each leave, it could be 3 red an 1 blue,

i.e. 𝑝(𝑟𝑒𝑑) = 0.75; 𝑝(𝑏𝑙𝑢𝑒) = 0.25 (remember, the feature space

is also optimized with 𝜃)

(5)

Random Forests – Training of features (illustration)

What does it mean to optimize over 𝜃

• For each pixel the same feature test (at one split node) will be done.

• One has to define what happens with

feature tests that reaches outside the image Goal during Training: spate red pixel (class 1) from blue pixels (class 2)

Feature:

Value 𝑥₁: what is the value of green color channel (could also be red or blue) if you look: ^𝜃1 pixel right and 𝜃₂ pixels up

Value 𝑥₂: what is the value of green color channel (could also be red or blue) if you look: ^𝜃3 pixel right and 𝜃₄ pixels down

Goal: find a such a 𝜃 that it is best to separate the data 𝑝𝑜𝑠 + (𝜃₁, 𝜃₂)

𝑝𝑜𝑠 + (𝜃₃, 𝜃₄)

Image Labeling (2 classes, red and blue)

(6)

Roadmap for this lecture

• Directed Graphical Models:

• Definition and Example

• Probabilistic Inference

• Undirected Graphical Models:

• Probabilistic inference: message passing

• Probabilistic inference: sampling

• Probabilistic programming languages

• Wrap-Up

(7)

Slide credits

• J. Fürnkranz, TU Darmstadt

• Dimitri Schlesinger

(8)

Reminder: Graphical models to capture structured problems

Write probability distribution as a Graphical model:

• Directed graphical model (also called Bayesian Networks)

• Undirected graphical model (also called Markov Random Field)

• Factor graphs (which we will use predominately)

• A visualization to represent a family of distributions

• Key concept is conditional independency

• You can also convert between distributions Basic idea:

References:

- Pattern Recognition and Machine Learning [Bishop ‘08, chapter 8]

- several lectures at the Machine Learning Summer School 2009 (see video lectures)

(9)

When to use what model?

• Undirected graphical model (also called Markov Random Field)

• The individual unknown variables have all the same “meaning”

• Factor graphs (which we will use predominately)

• Same as undirected graphical model but represents the distribution more detailed

• Directed graphical model (also called Bayesian Networks)

• The unknown variables have different “meanings”

𝑑_𝑖

Label only left image

𝑥₁ 𝑥₂

𝑧

(10)

Reminder: What to infer?

• MAP inference (Maximum a posterior state):

𝑥^∗ = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑥 𝑃 𝑥 = 𝑎𝑟𝑔𝑚𝑖𝑛_𝑥𝐸 𝑥

• Probabilistic Inference, so-called marginal:

𝑃 𝑥_𝑖 = 𝑘 =

𝒙 | 𝑥_𝑖=𝑘

𝑃(𝑥₁, … 𝑥_𝑖 = 𝑘, … , 𝑥_𝑛)

This can be used to make a maximum marginal decision:

𝑥_𝑖^∗ = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑥_𝑖 𝑃 𝑥_𝑖

(11)

Reminder: MAP versus Marginal - visually

Input Image Ground Truth Labeling

MAP solution 𝒙^∗ (each pixel has 0,1 label)

Marginals 𝑃 𝑥_𝑖

(each pixel has a probability between 0 and 1)

(12)

Reminder: MAP versus Marginal – Making Decisions

Which solution 𝑥

^∗

would you choose?

Space of all solutions x (sorted by pixel difference) 𝑃(𝑥|𝑧)

(13)

Reminder: How to make a decision

Question: What solution 𝑥^∗should we give out?

Answer: Choose 𝑥^∗which minimizes the Bayesian risk:

𝑥^∗ = 𝑎𝑟𝑔𝑚𝑖𝑛_𝑥

𝑥

𝑃 𝑥 𝑧 𝐶(𝑥, 𝑥^∗) Assume model 𝑃 𝑥 𝑧 is known

𝐶 𝑥₁, 𝑥₂ is called the loss function

(or cost function) of comparing to results 𝑥₁, 𝑥₂

(14)

Reminder: MAP versus Marginals – Making Decisions

Space of all solutions x (sorted by pixel difference) 𝑃(𝑥|𝑧)

“guessed” – max marginal MAP solution

MAP global 0-1 loss: 𝐶 𝑥, 𝑥^∗ = 0 if 𝑥 = 𝑥^∗, otherwise 1

Max-marginal pixel-wise 0-1 loss: 𝐶 𝑥, 𝑥^∗ = _𝑖|𝑥_𝑖 − 𝑥_𝑖^∗| (hamming loss)

(15)

Reminder: What to infer?

• MAP inference (Maximum a posterior state):

𝑥^∗ = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑥 𝑃 𝑥 = 𝑎𝑟𝑔𝑚𝑖𝑛_𝑥𝐸 𝑥

• Probabilistic Inference, so-called marginals:

𝑃 𝑥_𝑖 = 𝑘 =

𝒙 | 𝑥_𝑖=𝑘

𝑃(𝑥₁, … 𝑥_𝑖 = 𝑘, … , 𝑥_𝑛)

This can be used to make a maximum marginal decision:

𝑥_𝑖^∗ = 𝑎𝑟𝑔𝑚𝑎𝑥_𝑥_𝑖 𝑃 𝑥_𝑖

• So far we did only MAP estimation in factor graphs / undirected graphical models

• Today we do probabilistic inference in directed and factor graphs / undirected graphical models

• Comment: MAP inference in directed graphical models can be done, but rarely done

(16)

Directed Graphical Model: Intro

• All variables that are linked by an arrow have a causal, directed relationship

• Since variables have a meaning it makes sense to look at

conditional dependency (in Factor graphs, undirected models this does not give any insights)

Car is broken Eat

sweets

Tooth- ache Hole in

tooth

• Comment: the nodes do not have to be discrete variables, they can also be continuous, as long as distribution is well defined

(17)

Directed Graphical Models: Intro

• Rewrite a distribution using product rule:

𝑝 𝑥₁, 𝑥₂, 𝑥₃ = 𝑝 𝑥₁, 𝑥₂ 𝑥₃ 𝑝 𝑥₃ = 𝑝 𝑥₁ 𝑥₂, 𝑥₃ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃)

• Any joint distribution can be written as product of conditionals:

𝑝 𝑥₁, … , 𝑥_𝑛 =

𝑖=1 𝑛

𝑝(𝑥_𝑖|𝑥_𝑖+1, … 𝑥_𝑛)

• Visualize the conditional 𝑝(𝑥_𝑖|𝑥_𝑖+1, … 𝑥_𝑛) as follows:

𝑥_𝑖

𝑥_𝑖+1 … 𝑥_𝑛

(18)

Directed Graphical Models: Examples

𝑝 𝑥₁, 𝑥₂, 𝑥₃ =

𝑝 𝑥₁ 𝑥₂, 𝑥₃ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃)

𝑥₁ 𝑥₃

𝑥₂

𝑥₁ 𝑥₃

𝑥₂

𝑝 𝑥₁, 𝑥₂, 𝑥₃ =

𝑝 𝑥₁ 𝑥₂ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃)

𝑝 𝑥₁, 𝑥₂, 𝑥₃ =

𝑝 𝑥₁ 𝑥₂, 𝑥₃ 𝑝(𝑥₂) 𝑝(𝑥₃)

𝑥₁

𝑥₃ 𝑥₂

As with undirected graphical models the absence of links is the interesting aspect!

(19)

Definition: Directed Graphical Model

• Given an directed Graph 𝐺 = (𝑉, 𝐸), where 𝑉 is the set of nodes and 𝐸 the set of directed Edges

• A directed Graphical Model defines the family of distributions:

𝑝 𝑥₁, … , 𝑥_𝑛 =

𝑖=1 𝑛

𝑝(𝑥_𝑖|𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥_𝑖))

• The set 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥_𝑖 is a subset of varibales in {𝑥_𝑖+1, … 𝑥_𝑛} and defined via the graph, 𝑝(𝑥_𝑖|𝑥_𝑖+1, … 𝑥_𝑛) is visualized as:

• Comment: compared to factor graphs / undirected graphical models there is no partition function 𝑓:

𝑃 𝒙 = 1

𝑓 𝜓_𝐹(𝒙_{𝑁 𝐹} ) where 𝑓 = _𝒙 _𝐹∈ॲ 𝜓^𝐹(𝒙_{𝑁 𝐹} )

(20)

Running Example

• Situation

• I am at work

• John calls to say that the alarm in my house went off

• Mary, who is my neighbor, did not call me

• The alarm is usually set off by burglars, but sometimes also by a minor earthquake

• Can we constructed the directed graphical model ?

• Step 1: Identify variables that can have different “state”:

• JohnCalls(J), MaryCalls(M), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)

(21)

Step 2: Method to construct Direct Graphical Model

1) Choose a sorting of the variables: 𝑥_𝑛, … 𝑥₁ 2) For 𝑖 = 𝑛 … 1

A. Add node 𝑥_𝑖 to network

B. Select those parents such that:

𝑝 𝑥_𝑖|𝑥_𝑖+1, … , 𝑥_𝑛 = 𝑝 𝑥_𝑖|𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥_𝑖

(22)

Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)

• Build the model step by step:

MaryCalls(M) 𝑝(𝑀)

Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝑀)

(23)

Example

MaryCalls(M)

JohnCalls(J) 𝑝 𝐽 𝑀 = 𝑝 𝐽 ?

No. If Mary calls than John is also likely to call

𝑝(𝑀)

Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝑀)

(24)

Example

MaryCalls(M)

JohnCalls(J) 𝑝 𝐽 𝑀 = 𝑝 𝐽 ?

No. If Mary calls than John is also likely to call

𝑝 𝐽 𝑀 𝑝(𝑀)

Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐽|𝑀)𝑝(𝑀)

(25)

Example

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴 ?

𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝐽 ? 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝑀 ?

No. If Mary calls or John calls it is a higher probability that the alarm is on

𝑝(𝑀)

𝑝 𝐽 𝑀

Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐽|𝑀)𝑝(𝑀)

(26)

Example

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴 ?

𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝐽 ? 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝑀 ?

No. If Mary calls or John calls it is a higher probability that the alarm is on

𝑝(𝑀)

𝑝 𝐽 𝑀

𝑝 𝐴 𝐽, 𝑀)

Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀)

(27)

Example

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐵 𝐴, 𝐽, 𝑀) = 𝑝 𝐵 ?

No. The chance that there is a burglar in the house depends on the AlaramOn, Johns calls or Marry calls.

𝑝(𝑀)

𝑝 𝐽 𝑀

Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀)

BurglarInHouse(B)

(28)

Example

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐵 𝐴, 𝐽, 𝑀) = 𝑝 𝐵|𝐴 ?

Yes. If we know that alarm has gone off then the information that Marry or John called is not relevant!

BurglarInHouse(B) 𝑝(𝑀)

𝑝 𝐽 𝑀

𝑝(𝐵|𝐴)

Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐵|𝐴)𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀)

(29)

Example

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐸 𝐴, 𝐽, 𝑀, 𝐵) = 𝑝 𝐸 ?

No. If Mary Calls or John Calls or Alarm

is on then the chances of an earthquake are higher

𝑝(𝑀)

𝑝 𝐽 𝑀

earthquakeHappening(E) Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐵|𝐴)𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀)

BurglarInHouse(B) 𝑝(𝐵|𝐴)

(30)

Example

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐸 𝐴, 𝐽, 𝑀, 𝐵) = 𝑝 𝐸|𝐴, 𝐵 ?

Yes. If we know that alarm has gone off then the information that Marry or John called is not relevant.

𝑝(𝑀)

𝑝 𝐽 𝑀

earthquakeHappening(E) 𝑝(𝐸|𝐴, 𝐵) Joint (final):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐸|𝐴, 𝐵)𝑝(𝐵|𝐴)𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀) BurglarInHouse(B)

𝑝(𝐵|𝐴)

(31)

Example: Optimal Order

• Was that the best order to get a directed Graphical Model with as few links as possible?

• The optimal order is:

MaryCalls(M) JohnCalls(J)

AlarmOn(A)

BurglarInHouse(B) earthquakeHappening(E)

How to find that order?

Temporal order how things happened!

Joint:

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐽|𝐴)𝑝(𝑀|𝐴)𝑝 𝐴 𝐵, 𝐸 𝑝(𝐸)𝑝(𝐵)

(32)

Example: Optimal Order

• Associate probabilities:

MaryCalls(M)

JohnCalls(J) AlarmOn(A)

BurglarInHouse(B) earthquakeHappening(E)

𝑷(𝑩 = 𝟏) 𝑷(𝑩 = 𝟎)

0.001 0.999 𝑷(𝑬 = 𝟏) 𝑷(𝑬 = 𝟎)

0.002 0.998

𝐵 𝑬 𝑷(𝑨 = 𝟏|𝑩, 𝑬) 𝑷(𝑨 = 𝟎|𝑩, 𝑬)

True True 0.95 0.05

True False 0.94 0.06

False True 0.29 0.71

False False 0.001 0.999

𝐴 𝑷(𝑴 = 𝟏|𝑨) 𝑷(𝑴 = 𝟎|𝑨)

True 0.7 0.3

False 0.01 0.99

𝐴 𝑷(𝑱 = 𝟏|𝑨) 𝑷(𝑱 = 𝟎|𝑨)

True 0.9 0.1

False 0.05 0.95

(33)

Worst case scenario

(34)

Running Example

• What is the probability that John calls and Mary calls and alarm on, but there is no burglar and no earthquake

𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1, 𝐵 = 0, 𝐸 = 0 =

𝑝 𝐵 = 0 𝑝 𝐸 = 0 𝑝 𝐴 = 1 𝐵 = 0, 𝐸 = 0 𝑝 𝐽 = 1 𝐴 = 1 𝑝 𝑀 = 1 𝐴 = 1 = 0.999 ∗ 0.998 ∗ 0.001 ∗ 0.7 ∗ 0.9 = 0.000628

• What is the probability that John calls and Mary calls and alarm on

𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1 = _𝐵,𝐸 𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1, 𝐵, 𝐸 =

𝐵,𝐸𝑝 𝐵 𝑝 𝐸 𝑝 𝐴 = 1 𝐵, 𝐸 𝑝 𝐽 = 1 𝐴 = 1 𝑝 𝑀 = 1 𝐴 = 1 =

0.7*0.9*(0.999 ∗ 0.998 ∗ 0.001 + 0.999 ∗ 0.002 ∗ 0.29 + 0.001 ∗ 0.998 ∗ 0.94 + 0.001 ∗ 0.002 ∗ 0.95 ) = 0.00157

(35)

Fixing Marginal

Smoker(s)

Cancer(c)

𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)

0.5 0.5

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 0.2 0.8

False 0.1 0.9

Marginal: 𝑝 𝑠 = 0 = 0.5 ; 𝑝 𝑠 = 1 = 0.5

𝑝 𝑐 = 0 = 𝑝 𝑐 = 0 𝑠 = 0 𝑝 𝑠 = 0 + 𝑝 𝑐 = 0 𝑠 = 1 𝑝 𝑠 = 1 = 0.85 𝑝 𝑐 = 1 = 0.15

Joint: 𝑝 𝑐, 𝑠 = 𝑝 𝑐 𝑠 𝑝(𝑠)

(36)

Fixing Marginal

Smoker(s)

Cancer(c)

𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)

0.5 0.5

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 0.2 0.8

False 0.1 0.9

We know that the person has cancer, what is the probability that he smokes:

Wrong solution: just update the table for 𝑝(𝑐|𝑠) but leave 𝑝(𝑠) unchanged. We want to change 𝑝(𝑐) and not p(c|s) !

Correct solution: recomputed all probabilities under the condition 𝑐 = 1 p′ 𝑠 = 0|𝑐 = 1 = ^{𝑝 𝑠=0,𝑐=1}

𝑝 𝑐=1 = 𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)

𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)+𝑝(𝑐=1|𝑠=1)𝑝(𝑠=1) = 0.05/0.15 = 1/3 p′ 𝑠 = 1|𝑐 = 1 = ^{𝑝 𝑠=1,𝑐=1}

𝑝 𝑐=1 = 𝑝 𝑐=1|𝑠=1 𝑝(𝑠=1)

𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)+𝑝(𝑐=1|𝑠=1)𝑝(𝑠=1) = 0.1/0.15 = 2/3 Fix marginal

(37)

Fixing Marginal

Smoker(s)

Cancer(c)

𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)

2/3 1/3

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 1 0

False 1 0

We know that the person has cancer, what is the probability that he smokes:

(38)

Fixing Marginal

Smoker(s)

Cancer(c)

𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)

0.5 0.5

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 0.2 0.8

False 0.1 0.9

We know that the person smokes, what is the probability that he has cancer:

Two solutions (same outcome)

1) Just update 𝑝(𝑠) to p(s=0) = 0, p(s=1) = 1, and then you get a new 𝑝 𝑠, 𝑐 = 𝑝 𝑐 𝑠 𝑝 𝑠 . Now the marginal:

p(c=0) = p(c=0,s=0)+p(c=0,s=1) = p(c=0|s=0)p(s=0) + p(c=0|s=1)p(s=1) = 0.8 p(c=1) = p(c=1,s=0)+p(c=1,s=1) = p(c=1|s=0)p(s=0) + p(c=1|s=1)p(s=1) = 0.2 2) Update p(c) under the condition that s=1:

p’(c=0) = p(c=0,s=1)/p(s=1) = p(c=0|s=1)p(s=1) / p(s=1) = p(c=0|s=1) = 0.8 p’(c=1) = p(c=1,s=1)/p(s=1) = p(c=1|s=1)p(s=1) / p(s=1) = p(c=1|s=1) = 0.2 Fix marginal

(39)

Fixing Marginal

Smoker(s)

Cancer(c)

𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)

1 0

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 0.2 0.8

False irrelevant irrelevant

We know that the person smokes, what is the probability that he has cancer:

(40)

Fixing Marginal – another example

Smoker(s)

Cancer(c)

Healthy (h) environment

Joint: 𝑝 𝑐, 𝑠, ℎ = 𝑝 𝑐|𝑠, ℎ 𝑝 𝑠 𝑝(ℎ)

(41)

Fixing Marginal – another example

Smoker(s)

Cancer(c)

Healthy (h) environment

Joint: 𝑝 𝑐, 𝑠, ℎ = 𝑝 𝑐|𝑠, ℎ 𝑝 𝑠 𝑝(ℎ)

𝑝 𝑐 changes

𝑝 ℎ does not change

𝑝^′ 𝑐, ℎ|𝑠 = 𝑝 𝑐, 𝑠, ℎ

𝑝(𝑠) = 𝑝 𝑐 𝑠, ℎ 𝑝 𝑠 𝑝 ℎ

𝑝(𝑠) = 𝑝 𝑐 𝑠, ℎ 𝑝(ℎ)

(42)

Examples: Medical Diagnosis

(43)

Examples: Medical Diagnosis

(44)

Examples: Medical Diagnosis

Local Marginal for each node (sometimes also called belief)

(45)

Examples: Medical Diagnosis

(46)

Examples: Medical Diagnosis

(47)

Examples: Medical Diagnosis

(48)

Examples: Medical Diagnosis

(49)

Car Diagnosis

(50)

Examples: Car Insurance

(51)

Example: Pigs Network

(“Stammbaum”)

(52)

Reminder: Image segmentation

Image z

Joint Probablity 𝑃 𝒛, 𝒙 = 𝑃 𝒛|𝒙 𝑃 𝒙

Labeling x

Samples: True image:

Most likely:

𝒙 𝒛

(53)

Reminder: lecture on Computer Vision

Scene type Scene geometry Object classes Object position Object orientation Object shape Depth/occlusions Object appearance Illumination Shadows Motion blur Camera effects

(54)

Generative model for images

Image

Material Light Background

3D Objects

Light coming into camera

(55)

Hybrid networks

• Directed graphical (as well as undirected ones) can be expressed with discrete and continuous variables (as long as they define correct distributions)

Image (discrete) Material

(discrete)

Light (continuous)

Background (discrete) 3D Objects

(continuous)

Light coming into camera (continuous)

(56)

Real World Applications of Bayesian Networks

(57)

Real World Applications of Bayesian Networks

(58)

What questions can we ask?

• Marginal of a single variable 𝑝(𝑥_𝑖)

“we call probabilistic inference, most common query”

• Marginal of two variables: 𝑝(𝑥_𝑖, 𝑥_𝑗)

• Conditional queries: ^{𝑝 𝑥}_𝑖 ^{= 0, 𝑥}_𝑗 ^{= 1, 𝑥}_𝑘 ^{= 0 , 𝑝(x}_i^{, x}_j ^|x_k ^{= 0)}

• Optimal decisions:

• Add utility function to the network and the ask question about outcome

• Value of information:

• Which node to fix to get all marginal as unique as possible

• Sensitive Analyses:

• How does network behave if I change one marginal

(59)

Roadmap for this lecture

• Wrap-Up

(60)

Probabilistic Inference in DGM

We consider two possible principles to compute marginal (probabilistic inference):

• Inference by enumeration

• Inference by sampling

(61)

Probabilistic inference by enumeration

Brute force - just add up all:

𝑥₁ 𝑥₃

𝑥₂

𝑝 𝑥₂ =

𝑥₁,𝑥₃

𝑝 𝑥₁, 𝑥₂, 𝑥₃ =

𝑥₁,𝑥₃

𝑝 𝑥₁|𝑥₂, 𝑥₃ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃) 𝑝 𝑥₁, 𝑥₂, 𝑥₃ = 𝑝 𝑥₁ 𝑥₂, 𝑥₃ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃)

(62)

Probabilistic inference by enumeration

In some cases you can reduce the cost of computations:

𝑝 𝑥₁, 𝑥₂, 𝑥₃ = 𝑝 𝑥₁ 𝑥₂ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃)

𝑥₁ 𝑥₃

𝑥₂

𝑝 𝑥₁ =

𝑥₂,𝑥₃

𝑝 𝑥₁, 𝑥₂, 𝑥₃ =

𝑥₂,𝑥₃

𝑝 𝑥₁|𝑥₂ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃) =

𝑥₂

𝑝 𝑥₁|𝑥₂

𝑥₃

𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃)

That is a function (message 𝑀(𝑥₂)) that depends on 𝑥₂

(63)

Probabilistic inference by sampling

This procedure is called ancestor sampling or prior sampling:

Goal: create one valid sample from the distribution

𝑝 𝑥₁, … , 𝑥_𝑛 =

𝑖=1 𝑛

𝑝(𝑥_𝑖|𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥_𝑖))

Example: 𝑝 𝑥₁, 𝑥₂, 𝑥₃ = 𝑝 𝑥₁ 𝑥₂, 𝑥₃ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃)

where 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥_𝑖 is a subset of varibales in {𝑥_𝑖+1, … 𝑥_𝑛}

𝑥₁ 𝑥₃

𝑥₂

Procedure:

For 𝑖 = 𝑛 … 1

Sample 𝑥_𝑖 = 𝑝 𝑥_𝑖 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥_𝑖 Give out (𝑥₁, … , 𝑥_𝑛)

(64)

Probabilistic inference by sampling - Example

Example: 𝑝 𝑥₁, 𝑥₂, 𝑥₃ = 𝑝 𝑥₁ 𝑥₂, 𝑥₃ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃)

𝑥₁ 𝑥₃

𝑥₂

Procedure:

For 𝑖 = 𝑛 … 1

Sample 𝑥_𝑖 = 𝑝 𝑥_𝑖 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥_𝑖 Give out (𝑥₁, … , 𝑥_𝑛)

𝑥₁ 𝑥₃

𝑥₂

Step 1

𝑥₁ 𝑥₃

𝑥₂

Step 2

𝑥₁ 𝑥₃

𝑥₂

Step 3

(65)

How to sample a single variable

How to sample from a general discrete probability distribution of one variable 𝑝(𝑥), 𝑥 ∈ {0,1, … , 𝑛}?

1. Define “intervals” whose lengths are proportional to 𝑝(𝑥) 2. Concatenate these intervals

3. Sample into the composed interval uniformly

4. Check, in which interval the sampled value falls in.

Below is an example for 𝑝 𝑥 ∝ {1,2,3} (three values).

1 2 3

(66)

Probabilistic inference by sampling

• By running the ancestor sampling often enough we get (in the limit) the true joint distribution out (“Gesetz der grossen Zahlen”)

• It is: 𝑃 𝑥₁, … , 𝑥_𝑛 = 𝑃′ 𝑥₁, … , 𝑥_𝑛 when 𝑁 → ∞

• The joint distribution can now be used to compute local marginal:

𝑃 𝑥_𝑖 = 𝑘 = _𝒙:𝑥_𝑖_=𝑘 𝑃(𝑥₁, … 𝑥_𝑖 = 𝑘, … , 𝑥_𝑛) Procedure:

1) Let 𝑁 𝑥₁, … , 𝑥_𝑛 denote the number of times we have seen a certain sample 𝑥₁, … , 𝑥_𝑛

2) 𝑃′ 𝑥₁, … , 𝑥_𝑛 = ^{𝑁 𝑥}¹^,…,𝑥^𝑛

𝑁 where 𝑁 is the total number of samples

(67)

Example

(68)

Example

(69)

Example

(70)

Example

(71)

Example

(72)

Example

(73)

Roadmap for this lecture

• Wrap-Up

(74)

Reminder: Definition: Factor Graph models

• Given an undirected Graph 𝐺 = (𝑉, 𝐹, 𝐸), where 𝑉, 𝐹 are the set of nodes and 𝐸 the set of Edges

• A Factor Graph defines a family of distributions:

𝑓: partition function 𝐹: Factor

ॲ: Set of all factors

𝑁(𝐹): Neighbourhood of a factor

𝜓_𝐹: function (not distribution) depending on 𝒙_𝑁(𝐹) (𝜓_𝐶: 𝐾^|𝐶| → 𝑅 where x_i ∈ 𝐾)

𝑃 𝒙 = 1

𝑓 𝐹∈ॲ

𝜓_𝐹(𝒙_{𝑁 𝐹} ) where 𝑓 = _𝒙 _𝐹∈ॲ 𝜓^𝐹(𝒙_{𝑁 𝐹} )

Note the definition of factor is not linked to a property of a Graph (as with cliques)

(75)

Reminder: Dynamic Programming on chains

q

p r

• Pass messages from left to right

• Message is a vector with 𝐾 entries (𝐾 is the amount of labels)

• Read out the solution from final message and final unary term

• globally exact solution

• Other name: min-sum algorithm

𝑀_𝑞→𝑟(𝑥_𝑟)

E 𝒙 =

𝑖

𝜃_𝑖 𝑥_𝑖 +

𝑖,𝑗∈𝑁

𝜃_𝑖𝑗 𝑥_𝑖, 𝑥_𝑗

Unary terms pairwise terms in a row

𝑀_𝑝→𝑞(𝑥_𝑞) 𝑀_𝑟→𝑠(𝑥_𝑠)

𝑀_𝑜→𝑝(𝑥_𝑝)

Comment: Dmitri Schlesinger called the messages Bellman functions.

(76)

Reminder: Dynamic Programming on chains

q

p r

𝑀_𝑞→𝑟

Define the message:

𝑀_𝑞→𝑟 𝑥_r = min

𝑥_𝑞 {M_p→q 𝑥_𝑞 + 𝜃_𝑞 𝑥_𝑞 + 𝜃_𝑞,𝑟(𝑥_𝑞, 𝑥_𝑟)}

Information from previous nodes

Local

information

Connection to next node

Message stores the energy up to this point for 𝑥_𝑟 = 𝑘:

𝑀_𝑞→𝑟 𝑥_𝑟 = 𝑘 = min

𝑥_1…𝑞,𝑥_𝑟=𝑘 E 𝑥₁, … , 𝑥_𝑞, 𝑥_𝑟 = 𝑘

(77)

Reminder: Dynamic Programming on chains - example

(78)

Sum-Prod Algorithm for probabilistic Inference

𝑀_𝑞→𝑟 𝑥_𝑟 =

𝑥_𝑞

{M_p→q x_𝑞 ∗ 𝜓_𝑞 x_𝑞 ∗ 𝜓_𝑞,𝑟(x_𝑞, 𝑥_𝑟)}

Information from previous nodes

Local

information

Connection to next node

q

p r

𝑀_𝑞→𝑟(𝑥_𝑟) 𝜓_𝑞 𝑥_𝑞

𝜓_𝑞,𝑟 𝑥_𝑞, 𝑥_𝑟

• The Method is called Sum-Prod Algorithm since a message is computed by product and sum operations.

𝑃 𝒙 = 1

𝑓 𝐹∈ॲ

𝜓_𝐹(𝒙_{𝑁 𝐹} ) where 𝑓 = _𝒙 _𝐹∈ॲ 𝜓^𝐹(𝒙_{𝑁 𝐹} )

• Let us consider factor graphs:

(79)

Sum-Prod Algorithm for probabilistic Inference

(80)

Extensions

q

p r s

• The algorithm above can easy be extended to chains with arbitrary length

• The algorithm above can also be extended to compute marginal for arbitrary nodes (here 𝑞)

Message: 𝑀_𝑝→𝑞(𝑥_𝑞) Message: 𝑀_𝑟→𝑞(𝑥_𝑞)

(81)

Extensions

q

p r s

• To compute 𝑓 and all marginal we have to compute all messages that go into each node

• This is done by two passes over all nodes: from left to right and from right to left. (this is referred to as the full Sum-Prod Algorithm)

Compute all messages that go this way Compute all messages that go this way

(82)

Extensions

• Both, MAP estimation and probabilistic Inference, can easily be extended to tree structures:

• Both, MAP estimation and probabilistic Inference, can be extend to arbitrary factor graphs (with loops and higher order potentials)

Loop: 𝑥₂, 𝑥₃, 𝑥₄

Higher (3) order factor: 𝑥₂, 𝑥₁, 𝑥₄

(83)

Roadmap for this lecture

• Wrap-Up

(84)

Reminder: ICM - Iterated conditional mode

Gibbs Energy:

𝑥₂

𝑥₁

𝑥₃ 𝑥₄

𝑥₅

𝐸 𝒙 = 𝜃₁₂ 𝑥₁, 𝑥₂ + 𝜃₁₃ 𝑥₁, 𝑥₃ + 𝜃₁₄ 𝑥₁, 𝑥₄ + 𝜃₁₅ 𝑥₁, 𝑥₅ + ⋯

(85)

Reimnder: ICM - Iterated conditional mode

𝑥₂

𝑥₁

𝑥₃ 𝑥₄

𝑥₅

Idea: fix all variable but one and optimize for this one

Selected 𝑥₁ and optimize:

Gibbs Energy:

• Can get stuck in local minima

• Depends on initialization

ICM Global min

𝐸 𝒙 = 𝜃₁₂ 𝑥₁, 𝑥₂ + 𝜃₁₃ 𝑥₁, 𝑥₃ + 𝜃₁₄ 𝑥₁, 𝑥₄ + 𝜃₁₅ 𝑥₁, 𝑥₅ + ⋯

𝐸′ 𝒙 = 𝜃₁₂ 𝑥₁, 𝑥₂ + 𝜃₁₃ 𝑥₁, 𝑥₃ + 𝜃₁₄ 𝑥₁, 𝑥₄ + 𝜃₁₅ 𝑥₁, 𝑥₅

(86)

Reminder: ICM - parallelization

• The schedule is a more complex task in graphs which are not 4-connected Normal procedure:

Step 1 Step 2 Step 3 Step 4

Parallel procedure:

Step 1 Step 2 Step 3 Step 4

(87)

Gibbs Sampling

• Task: draw a sample from a multivariate probability distribution 𝑝 𝒙 = 𝑝 𝑥₁, 𝑥₂, … , 𝑥_𝑛 .

• This is not trivial, if the probability distribution is complex, e.g. an MRF (remember: it is usually given up to an unknown normalizing constant).

The following procedure is used:

1. Start from an arbitrary assignment 𝒙⁰ = (𝑥₁⁰ … 𝑥_𝑛⁰).

2. Repeat:

1) Pick (randomly) a variable 𝑥_𝑖

2) Sample a new value for 𝑥_𝑖^𝑡+1 according to the conditional probability distribution

𝑝 𝑥_𝑖 𝑥₁^𝑡, … , 𝑥_𝑖−1^𝑡 , 𝑥_𝑖+1^𝑡 , … , 𝑥_𝑛^𝑡)

i.e. under the condition “all other variables are fixed”

(88)

Gibbs Sampling

• After each “elementary” sampling step 2) a new assignment 𝒙^𝑡+1 is obtained, that differs from the previous one only by the value for 𝑥_𝑖.

• After many (infinite number) iterations of 1)-2) the assignment 𝒙^𝑡 follows the initial probability distribution 𝑝 𝒙 under some (mild) assumption:

 The probability to pick each particular variable 𝑥_𝑖 in 1) is non- zero, i.e. for an infinite number of generation steps each

variable is visited infinitely many times.

(In particular, in Computer vision applications (each variable often corresponds to a pixel) scan-line order is widely used.)

 For each pair of assignments 𝒙¹, 𝒙² the probability to arrive at 𝒙² starting from 𝒙¹ is non-zero, i.e. independently on the

starting point there is a chance to sample each elementary event of the considering probability distribution.

(89)

Gibbs Sampling for MRFs

For MRFs the elementary sampling step 2) is feasible because 𝑝 𝑥_𝑖 𝑥₁, … , 𝑥_𝑖−1, 𝑥_𝑖+1, … , 𝑥_𝑛 = 𝑝(𝑥_𝑖|𝑥_𝑁(𝑖))

where 𝑁(𝑖) is the neighborhood of 𝑖 (the Markovian property).

In particular, for Gibbs distributions of second order 𝑝 𝑥_𝑖 = 𝑘 𝑥_{𝑁 𝑖} ∝ exp[𝜃_𝑖 𝑘 +

𝑖,𝑗∈𝑁₄

𝜃_𝑖𝑗(𝑘, 𝑥_𝑗)]

It reminds on the Iterated Conditional Mode, but now we do not decide for the best label, we sample one instead.

(90)

Reminder: How to sample a single variable

How to sample from a general discrete probability distribution of one variable 𝑝(𝑥), 𝑥 ∈ {0,1, … , 𝑛}?

1. Define “intervals” whose lengths are proportional to 𝑝(𝑥) 2. Concatenate these intervals

3. Sample into the composed interval uniformly

4. Check, in which interval the sampled value falls in.

Below is an example for 𝑝 𝑥 ∝ {1,2,3} (three values).

1 2 3

(91)

Other Samplers

• Rejection sampling

• Metropolis-Hasting Sampling

• Comment: Gibbs Sampler and Metropolis-Hasting Sampling belong to the class of MCMC samplers

(92)

Roadmap for this lecture

• Wrap-Up

(93)

Probabilistic programming Languages

• Basic Idea: associate to variables a distribution:

Normally C++: Bool coin = 0;

Probabilistic Program: Bool coin = Bernoulli(0.5);

% Bernoulli is a distribution with 2 states

• A programming language to easily program: Inference in factor graphs, and other machine learning tasks

(94)

Another Example: Two coin example

• Draw coin1 (𝑥₁) and coin2 (𝑥₂) independently.

• Each coin has equal probability to be head (1) or tail (0)

• New random variable 𝑧 which is True if and only if both coins are head:

𝑧 = 𝑥₁& 𝑥₂

(95)

Write this as a directed graphical model

𝑥₁ 𝑥₂ 𝑥_𝑖 ∈ 0,1 𝑝 𝑥_𝑖 = 1 = 𝑝 𝑥_𝑖 = 0 = 0.5

𝑧

𝑃(𝑧 = 1|𝑥₁, 𝑥₂) 𝑃(𝑧 = 0|𝑥₁, 𝑥₂) 𝑉𝑎𝑙𝑢𝑒 𝑥₁ 𝑉𝑎𝑙𝑢𝑒 𝑥₂

0 1 0 0

0 1 0 1

0 1 1 0

1 0 1 1

Compute Marginal:

𝑝 𝑧 =

𝑥₁,𝑥₂

𝑝 𝑧, 𝑥₁, 𝑥₂ =

𝑥₁,𝑥₂

𝑝 𝑧|𝑥₁, 𝑥₂ 𝑝 𝑥₁ 𝑝(𝑥₂)

𝑝 𝑧 = 1 = 1 ∗ 0.5 ∗ 0.5 = 0.25

𝑝 𝑧 = 0 = 1 ∗ 0.5 ∗ 0.5 + 1 ∗ 0.5 ∗ 0.5 + 1 ∗ 0.5 ∗ 0.5 = 0.75

(96)

Infer.net example: 2 coins

Program:

Run it:

Add to the program:

(97)

Conversion DGM, UGM, Factor graphs

• In Infer.Net all models are converted to factor graphs and then inference is done in factor graphs

• In practice you have one concrete distribution (not family)

• Let us first convert Direct Graphical model to undirected and then from undirected to factor graph

(98)

Directed Graphical model to undirected one

• Take the joint distribution and then replace all conditional

“symbols” with commas and then normalize to get correct distribution

Directed Graphical model: 𝑝 𝑥₁, … , 𝑥_𝑛 = _𝑖=1^𝑛 𝑝(𝑥_𝑖|𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥_𝑖)) Gives UGM: 𝑝 𝑥₁, … , 𝑥_𝑛 = 1/𝑓 _𝑖=1^𝑛 𝑝(𝑥_𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥_𝑖))

where 𝑓 = _𝒙 _𝑖=1^𝑛 𝑝(𝑥_𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥_𝑖))

Note, the function 𝑝 𝑥_𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥_𝑖 is no longer a probability, so it should rather be called: 𝜓(𝑥_𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥_𝑖)) = 𝑝(𝑥_𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥_𝑖))

Comment: to go from undirected graphical model to directed one is difficult since we don’t have the individual distributions, just a big product of factors

(99)

From Directed Graphical model to undirected

This step is called “moralization”.

1) All the children to a certain node are connected up 2) All directed links are converted to undirected one

𝑝 𝑥₁, 𝑥₂, 𝑥₃, 𝑥₄ =

𝑝 𝑥₁ 𝑥₂, 𝑥₃ 𝑝 𝑥₂ 𝑝 𝑥₃ 𝑥₄ 𝑝(𝑥₄)

𝑥₁

𝑥₃ 𝑥₂

𝑥₄

𝑝 𝑥₁, 𝑥₂, 𝑥₃, 𝑥₄ = 1

𝑓(𝑝 𝑥₁, 𝑥₂, 𝑥₃ 𝑝 𝑥₂ 𝑝 𝑥₃, 𝑥₄ 𝑝 𝑥₄ )

𝑥₁

𝑥₃ 𝑥₂

𝑥₄