• Keine Ergebnisse gefunden

Intelligent Systems:

N/A
N/A
Protected

Academic year: 2022

Aktie "Intelligent Systems:"

Copied!
117
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Intelligent Systems:

Probabilistic Inference in Undirected and Directed Graphical models

Carsten Rother

(2)

Reminder: Random Forests

1) Get 𝑑-dimensional feature, depending on some parameters 𝜃:

2-dimensional:

We want to classify the white pixel Feature: color (e.g. red channel) of the green pixel and color (e.g. red channel) of the red pixel

Parameters: 4D (2 offset vectors) We will visualize in this way:

(3)

Reminder: Random Forests

body joint hypotheses

front view side view top view

input depth image body parts

Bodypart

Labelling Clustering

Label each pixel

(4)

Reminder: Decision Tree – Train Time

Input: all training points Input data in feature space

each point has a class label

The set of all labelled (training data) points, here 35 red and 23 blue.

Split the training set at each node

Measure 𝑝(𝑐) at each leave, it could be 3 red an 1 blue,

i.e. 𝑝(𝑟𝑒𝑑) = 0.75; 𝑝(𝑏𝑙𝑢𝑒) = 0.25 (remember, the feature space

is also optimized with 𝜃)

(5)

Random Forests – Training of features (illustration)

What does it mean to optimize over 𝜃

For each pixel the same feature test (at one split node) will be done.

One has to define what happens with

feature tests that reaches outside the image Goal during Training: spate red pixel (class 1) from blue pixels (class 2)

Feature:

Value 𝑥1: what is the value of green color channel (could also be red or blue) if you look: 𝜃1 pixel right and 𝜃2 pixels up

Value 𝑥2: what is the value of green color channel (could also be red or blue) if you look: 𝜃3 pixel right and 𝜃4 pixels down

Goal: find a such a 𝜃 that it is best to separate the data 𝑝𝑜𝑠 + (𝜃1, 𝜃2)

𝑝𝑜𝑠 + (𝜃3, 𝜃4)

Image Labeling (2 classes, red and blue)

(6)

Roadmap for this lecture

• Directed Graphical Models:

• Definition and Example

• Probabilistic Inference

• Undirected Graphical Models:

• Probabilistic inference: message passing

• Probabilistic inference: sampling

• Probabilistic programming languages

• Wrap-Up

(7)

Slide credits

• J. Fürnkranz, TU Darmstadt

• Dimitri Schlesinger

(8)

Reminder: Graphical models to capture structured problems

Write probability distribution as a Graphical model:

• Directed graphical model (also called Bayesian Networks)

• Undirected graphical model (also called Markov Random Field)

• Factor graphs (which we will use predominately)

A visualization to represent a family of distributions

Key concept is conditional independency

You can also convert between distributions Basic idea:

References:

- Pattern Recognition and Machine Learning [Bishop ‘08, chapter 8]

- several lectures at the Machine Learning Summer School 2009 (see video lectures)

(9)

When to use what model?

Undirected graphical model (also called Markov Random Field)

The individual unknown variables have all the same “meaning”

Factor graphs (which we will use predominately)

Same as undirected graphical model but represents the distribution more detailed

Directed graphical model (also called Bayesian Networks)

The unknown variables have different “meanings”

𝑑𝑖

Label only left image

𝑥1 𝑥2

𝑧

(10)

Reminder: What to infer?

• MAP inference (Maximum a posterior state):

𝑥 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑥 𝑃 𝑥 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥𝐸 𝑥

• Probabilistic Inference, so-called marginal:

𝑃 𝑥𝑖 = 𝑘 =

𝒙 | 𝑥𝑖=𝑘

𝑃(𝑥1, … 𝑥𝑖 = 𝑘, … , 𝑥𝑛)

This can be used to make a maximum marginal decision:

𝑥𝑖 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑥𝑖 𝑃 𝑥𝑖

(11)

Reminder: MAP versus Marginal - visually

Input Image Ground Truth Labeling

MAP solution 𝒙 (each pixel has 0,1 label)

Marginals 𝑃 𝑥𝑖

(each pixel has a probability between 0 and 1)

(12)

Reminder: MAP versus Marginal – Making Decisions

Which solution 𝑥

would you choose?

Space of all solutions x (sorted by pixel difference) 𝑃(𝑥|𝑧)

(13)

Reminder: How to make a decision

Question: What solution 𝑥should we give out?

Answer: Choose 𝑥which minimizes the Bayesian risk:

𝑥 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥

𝑥

𝑃 𝑥 𝑧 𝐶(𝑥, 𝑥) Assume model 𝑃 𝑥 𝑧 is known

𝐶 𝑥1, 𝑥2 is called the loss function

(or cost function) of comparing to results 𝑥1, 𝑥2

(14)

Reminder: MAP versus Marginals – Making Decisions

Space of all solutions x (sorted by pixel difference) 𝑃(𝑥|𝑧)

“guessed” – max marginal MAP solution

MAP global 0-1 loss: 𝐶 𝑥, 𝑥 = 0 if 𝑥 = 𝑥, otherwise 1

Max-marginal pixel-wise 0-1 loss: 𝐶 𝑥, 𝑥 = 𝑖|𝑥𝑖 − 𝑥𝑖| (hamming loss)

(15)

Reminder: What to infer?

• MAP inference (Maximum a posterior state):

𝑥 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑥 𝑃 𝑥 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥𝐸 𝑥

• Probabilistic Inference, so-called marginals:

𝑃 𝑥𝑖 = 𝑘 =

𝒙 | 𝑥𝑖=𝑘

𝑃(𝑥1, … 𝑥𝑖 = 𝑘, … , 𝑥𝑛)

This can be used to make a maximum marginal decision:

𝑥𝑖 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑥𝑖 𝑃 𝑥𝑖

So far we did only MAP estimation in factor graphs / undirected graphical models

Today we do probabilistic inference in directed and factor graphs / undirected graphical models

Comment: MAP inference in directed graphical models can be done, but rarely done

(16)

Directed Graphical Model: Intro

• All variables that are linked by an arrow have a causal, directed relationship

• Since variables have a meaning it makes sense to look at

conditional dependency (in Factor graphs, undirected models this does not give any insights)

Car is broken Eat

sweets

Tooth- ache Hole in

tooth

• Comment: the nodes do not have to be discrete variables, they can also be continuous, as long as distribution is well defined

(17)

Directed Graphical Models: Intro

Rewrite a distribution using product rule:

𝑝 𝑥1, 𝑥2, 𝑥3 = 𝑝 𝑥1, 𝑥2 𝑥3 𝑝 𝑥3 = 𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑥3 𝑝(𝑥3)

Any joint distribution can be written as product of conditionals:

𝑝 𝑥1, … , 𝑥𝑛 =

𝑖=1 𝑛

𝑝(𝑥𝑖|𝑥𝑖+1, … 𝑥𝑛)

Visualize the conditional 𝑝(𝑥𝑖|𝑥𝑖+1, … 𝑥𝑛) as follows:

𝑥𝑖

𝑥𝑖+1 𝑥𝑛

(18)

Directed Graphical Models: Examples

𝑝 𝑥1, 𝑥2, 𝑥3 =

𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑥3 𝑝(𝑥3)

𝑥1 𝑥3

𝑥2

𝑥1 𝑥3

𝑥2

𝑝 𝑥1, 𝑥2, 𝑥3 =

𝑝 𝑥1 𝑥2 𝑝 𝑥2 𝑥3 𝑝(𝑥3)

𝑝 𝑥1, 𝑥2, 𝑥3 =

𝑝 𝑥1 𝑥2, 𝑥3 𝑝(𝑥2) 𝑝(𝑥3)

𝑥1

𝑥3 𝑥2

As with undirected graphical models the absence of links is the interesting aspect!

(19)

Definition: Directed Graphical Model

• Given an directed Graph 𝐺 = (𝑉, 𝐸), where 𝑉 is the set of nodes and 𝐸 the set of directed Edges

• A directed Graphical Model defines the family of distributions:

𝑝 𝑥1, … , 𝑥𝑛 =

𝑖=1 𝑛

𝑝(𝑥𝑖|𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖))

• The set 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥𝑖 is a subset of varibales in {𝑥𝑖+1, … 𝑥𝑛} and defined via the graph, 𝑝(𝑥𝑖|𝑥𝑖+1, … 𝑥𝑛) is visualized as:

Comment: compared to factor graphs / undirected graphical models there is no partition function 𝑓:

𝑃 𝒙 = 1

𝑓 𝜓𝐹(𝒙𝑁 𝐹 ) where 𝑓 = 𝒙 𝐹∈ॲ 𝜓𝐹(𝒙𝑁 𝐹 )

(20)

Running Example

• Situation

• I am at work

• John calls to say that the alarm in my house went off

• Mary, who is my neighbor, did not call me

• The alarm is usually set off by burglars, but sometimes also by a minor earthquake

• Can we constructed the directed graphical model ?

• Step 1: Identify variables that can have different “state”:

• JohnCalls(J), MaryCalls(M), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)

(21)

Step 2: Method to construct Direct Graphical Model

1) Choose a sorting of the variables: 𝑥𝑛, … 𝑥1 2) For 𝑖 = 𝑛 … 1

A. Add node 𝑥𝑖 to network

B. Select those parents such that:

𝑝 𝑥𝑖|𝑥𝑖+1, … , 𝑥𝑛 = 𝑝 𝑥𝑖|𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥𝑖

(22)

Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)

• Build the model step by step:

MaryCalls(M) 𝑝(𝑀)

Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝑀)

(23)

Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J) 𝑝 𝐽 𝑀 = 𝑝 𝐽 ?

No. If Mary calls than John is also likely to call

𝑝(𝑀)

Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝑀)

(24)

Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J) 𝑝 𝐽 𝑀 = 𝑝 𝐽 ?

No. If Mary calls than John is also likely to call

𝑝 𝐽 𝑀 𝑝(𝑀)

Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐽|𝑀)𝑝(𝑀)

(25)

Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴 ?

𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝐽 ? 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝑀 ?

No. If Mary calls or John calls it is a higher probability that the alarm is on

𝑝(𝑀)

𝑝 𝐽 𝑀

Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐽|𝑀)𝑝(𝑀)

(26)

Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴 ?

𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝐽 ? 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝑀 ?

No. If Mary calls or John calls it is a higher probability that the alarm is on

𝑝(𝑀)

𝑝 𝐽 𝑀

𝑝 𝐴 𝐽, 𝑀)

Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀)

(27)

Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐵 𝐴, 𝐽, 𝑀) = 𝑝 𝐵 ?

No. The chance that there is a burglar in the house depends on the AlaramOn, Johns calls or Marry calls.

𝑝(𝑀)

𝑝 𝐽 𝑀

𝑝 𝐴 𝐽, 𝑀)

Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀)

BurglarInHouse(B)

(28)

Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐵 𝐴, 𝐽, 𝑀) = 𝑝 𝐵|𝐴 ?

Yes. If we know that alarm has gone off then the information that Marry or John called is not relevant!

BurglarInHouse(B) 𝑝(𝑀)

𝑝 𝐽 𝑀

𝑝 𝐴 𝐽, 𝑀)

𝑝(𝐵|𝐴)

Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐵|𝐴)𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀)

(29)

Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐸 𝐴, 𝐽, 𝑀, 𝐵) = 𝑝 𝐸 ?

No. If Mary Calls or John Calls or Alarm

is on then the chances of an earthquake are higher

𝑝(𝑀)

𝑝 𝐽 𝑀

𝑝 𝐴 𝐽, 𝑀)

earthquakeHappening(E) Joint (so far):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐵|𝐴)𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀)

BurglarInHouse(B) 𝑝(𝐵|𝐴)

(30)

Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), burglarInHouse(B), earthquakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐸 𝐴, 𝐽, 𝑀, 𝐵) = 𝑝 𝐸|𝐴, 𝐵 ?

Yes. If we know that alarm has gone off then the information that Marry or John called is not relevant.

𝑝(𝑀)

𝑝 𝐽 𝑀

𝑝 𝐴 𝐽, 𝑀)

earthquakeHappening(E) 𝑝(𝐸|𝐴, 𝐵) Joint (final):

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐸|𝐴, 𝐵)𝑝(𝐵|𝐴)𝑝 𝐴 𝐽, 𝑀 𝑝(𝐽|𝑀)𝑝(𝑀) BurglarInHouse(B)

𝑝(𝐵|𝐴)

(31)

Example: Optimal Order

• Was that the best order to get a directed Graphical Model with as few links as possible?

• The optimal order is:

MaryCalls(M) JohnCalls(J)

AlarmOn(A)

BurglarInHouse(B) earthquakeHappening(E)

How to find that order?

Temporal order how things happened!

Joint:

𝑃 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐽|𝐴)𝑝(𝑀|𝐴)𝑝 𝐴 𝐵, 𝐸 𝑝(𝐸)𝑝(𝐵)

(32)

Example: Optimal Order

• Associate probabilities:

MaryCalls(M)

JohnCalls(J) AlarmOn(A)

BurglarInHouse(B) earthquakeHappening(E)

𝑷(𝑩 = 𝟏) 𝑷(𝑩 = 𝟎)

0.001 0.999 𝑷(𝑬 = 𝟏) 𝑷(𝑬 = 𝟎)

0.002 0.998

𝐵 𝑬 𝑷(𝑨 = 𝟏|𝑩, 𝑬) 𝑷(𝑨 = 𝟎|𝑩, 𝑬)

True True 0.95 0.05

True False 0.94 0.06

False True 0.29 0.71

False False 0.001 0.999

𝐴 𝑷(𝑴 = 𝟏|𝑨) 𝑷(𝑴 = 𝟎|𝑨)

True 0.7 0.3

False 0.01 0.99

𝐴 𝑷(𝑱 = 𝟏|𝑨) 𝑷(𝑱 = 𝟎|𝑨)

True 0.9 0.1

False 0.05 0.95

(33)

Worst case scenario

(34)

Running Example

• What is the probability that John calls and Mary calls and alarm on, but there is no burglar and no earthquake

𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1, 𝐵 = 0, 𝐸 = 0 =

𝑝 𝐵 = 0 𝑝 𝐸 = 0 𝑝 𝐴 = 1 𝐵 = 0, 𝐸 = 0 𝑝 𝐽 = 1 𝐴 = 1 𝑝 𝑀 = 1 𝐴 = 1 = 0.999 ∗ 0.998 ∗ 0.001 ∗ 0.7 ∗ 0.9 = 0.000628

• What is the probability that John calls and Mary calls and alarm on

𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1 = 𝐵,𝐸 𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1, 𝐵, 𝐸 =

𝐵,𝐸𝑝 𝐵 𝑝 𝐸 𝑝 𝐴 = 1 𝐵, 𝐸 𝑝 𝐽 = 1 𝐴 = 1 𝑝 𝑀 = 1 𝐴 = 1 =

0.7*0.9*(0.999 ∗ 0.998 ∗ 0.001 + 0.999 ∗ 0.002 ∗ 0.29 + 0.001 ∗ 0.998 ∗ 0.94 + 0.001 ∗ 0.002 ∗ 0.95 ) = 0.00157

(35)

Fixing Marginal

Smoker(s)

Cancer(c)

𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)

0.5 0.5

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 0.2 0.8

False 0.1 0.9

Marginal: 𝑝 𝑠 = 0 = 0.5 ; 𝑝 𝑠 = 1 = 0.5

𝑝 𝑐 = 0 = 𝑝 𝑐 = 0 𝑠 = 0 𝑝 𝑠 = 0 + 𝑝 𝑐 = 0 𝑠 = 1 𝑝 𝑠 = 1 = 0.85 𝑝 𝑐 = 1 = 0.15

Joint: 𝑝 𝑐, 𝑠 = 𝑝 𝑐 𝑠 𝑝(𝑠)

(36)

Fixing Marginal

Smoker(s)

Cancer(c)

𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)

0.5 0.5

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 0.2 0.8

False 0.1 0.9

We know that the person has cancer, what is the probability that he smokes:

Wrong solution: just update the table for 𝑝(𝑐|𝑠) but leave 𝑝(𝑠) unchanged. We want to change 𝑝(𝑐) and not p(c|s) !

Correct solution: recomputed all probabilities under the condition 𝑐 = 1 p′ 𝑠 = 0|𝑐 = 1 = 𝑝 𝑠=0,𝑐=1

𝑝 𝑐=1 = 𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)

𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)+𝑝(𝑐=1|𝑠=1)𝑝(𝑠=1) = 0.05/0.15 = 1/3 p′ 𝑠 = 1|𝑐 = 1 = 𝑝 𝑠=1,𝑐=1

𝑝 𝑐=1 = 𝑝 𝑐=1|𝑠=1 𝑝(𝑠=1)

𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)+𝑝(𝑐=1|𝑠=1)𝑝(𝑠=1) = 0.1/0.15 = 2/3 Fix marginal

(37)

Fixing Marginal

Smoker(s)

Cancer(c)

𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)

2/3 1/3

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 1 0

False 1 0

We know that the person has cancer, what is the probability that he smokes:

(38)

Fixing Marginal

Smoker(s)

Cancer(c)

𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)

0.5 0.5

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 0.2 0.8

False 0.1 0.9

We know that the person smokes, what is the probability that he has cancer:

Two solutions (same outcome)

1) Just update 𝑝(𝑠) to p(s=0) = 0, p(s=1) = 1, and then you get a new 𝑝 𝑠, 𝑐 = 𝑝 𝑐 𝑠 𝑝 𝑠 . Now the marginal:

p(c=0) = p(c=0,s=0)+p(c=0,s=1) = p(c=0|s=0)p(s=0) + p(c=0|s=1)p(s=1) = 0.8 p(c=1) = p(c=1,s=0)+p(c=1,s=1) = p(c=1|s=0)p(s=0) + p(c=1|s=1)p(s=1) = 0.2 2) Update p(c) under the condition that s=1:

p’(c=0) = p(c=0,s=1)/p(s=1) = p(c=0|s=1)p(s=1) / p(s=1) = p(c=0|s=1) = 0.8 p’(c=1) = p(c=1,s=1)/p(s=1) = p(c=1|s=1)p(s=1) / p(s=1) = p(c=1|s=1) = 0.2 Fix marginal

(39)

Fixing Marginal

Smoker(s)

Cancer(c)

𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)

1 0

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 0.2 0.8

False irrelevant irrelevant

We know that the person smokes, what is the probability that he has cancer:

(40)

Fixing Marginal – another example

Smoker(s)

Cancer(c)

Healthy (h) environment

Joint: 𝑝 𝑐, 𝑠, ℎ = 𝑝 𝑐|𝑠, ℎ 𝑝 𝑠 𝑝(ℎ)

(41)

Fixing Marginal – another example

Smoker(s)

Cancer(c)

Healthy (h) environment

Joint: 𝑝 𝑐, 𝑠, ℎ = 𝑝 𝑐|𝑠, ℎ 𝑝 𝑠 𝑝(ℎ)

𝑝 𝑐 changes

𝑝 ℎ does not change

𝑝 𝑐, ℎ|𝑠 = 𝑝 𝑐, 𝑠, ℎ

𝑝(𝑠) = 𝑝 𝑐 𝑠, ℎ 𝑝 𝑠 𝑝 ℎ

𝑝(𝑠) = 𝑝 𝑐 𝑠, ℎ 𝑝(ℎ)

(42)

Examples: Medical Diagnosis

(43)

Examples: Medical Diagnosis

(44)

Examples: Medical Diagnosis

Local Marginal for each node (sometimes also called belief)

(45)

Examples: Medical Diagnosis

(46)

Examples: Medical Diagnosis

(47)

Examples: Medical Diagnosis

(48)

Examples: Medical Diagnosis

(49)

Car Diagnosis

(50)

Examples: Car Insurance

(51)

Example: Pigs Network

(“Stammbaum”)

(52)

Reminder: Image segmentation

Image z

Joint Probablity 𝑃 𝒛, 𝒙 = 𝑃 𝒛|𝒙 𝑃 𝒙

Labeling x

Samples: True image:

Most likely:

𝒙 𝒛

𝒙 𝒛

(53)

Reminder: lecture on Computer Vision

Scene type Scene geometry Object classes Object position Object orientation Object shape Depth/occlusions Object appearance Illumination Shadows Motion blur Camera effects

(54)

Generative model for images

Image

Material Light Background

3D Objects

Light coming into camera

(55)

Hybrid networks

• Directed graphical (as well as undirected ones) can be expressed with discrete and continuous variables (as long as they define correct distributions)

Image (discrete) Material

(discrete)

Light (continuous)

Background (discrete) 3D Objects

(continuous)

Light coming into camera (continuous)

(56)

Real World Applications of Bayesian Networks

(57)

Real World Applications of Bayesian Networks

(58)

What questions can we ask?

Marginal of a single variable 𝑝(𝑥𝑖)

“we call probabilistic inference, most common query”

Marginal of two variables: 𝑝(𝑥𝑖, 𝑥𝑗)

Conditional queries: 𝑝 𝑥𝑖 = 0, 𝑥𝑗 = 1, 𝑥𝑘 = 0 , 𝑝(xi, xj |xk = 0)

Optimal decisions:

Add utility function to the network and the ask question about outcome

Value of information:

Which node to fix to get all marginal as unique as possible

Sensitive Analyses:

How does network behave if I change one marginal

(59)

Roadmap for this lecture

• Directed Graphical Models:

• Definition and Example

• Probabilistic Inference

• Undirected Graphical Models:

• Probabilistic inference: message passing

• Probabilistic inference: sampling

• Probabilistic programming languages

• Wrap-Up

(60)

Probabilistic Inference in DGM

We consider two possible principles to compute marginal (probabilistic inference):

• Inference by enumeration

• Inference by sampling

(61)

Probabilistic inference by enumeration

Brute force - just add up all:

𝑥1 𝑥3

𝑥2

𝑝 𝑥2 =

𝑥1,𝑥3

𝑝 𝑥1, 𝑥2, 𝑥3 =

𝑥1,𝑥3

𝑝 𝑥1|𝑥2, 𝑥3 𝑝 𝑥2 𝑥3 𝑝(𝑥3) 𝑝 𝑥1, 𝑥2, 𝑥3 = 𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑥3 𝑝(𝑥3)

(62)

Probabilistic inference by enumeration

In some cases you can reduce the cost of computations:

𝑝 𝑥1, 𝑥2, 𝑥3 = 𝑝 𝑥1 𝑥2 𝑝 𝑥2 𝑥3 𝑝(𝑥3)

𝑥1 𝑥3

𝑥2

𝑝 𝑥1 =

𝑥2,𝑥3

𝑝 𝑥1, 𝑥2, 𝑥3 =

𝑥2,𝑥3

𝑝 𝑥1|𝑥2 𝑝 𝑥2 𝑥3 𝑝(𝑥3) =

𝑥2

𝑝 𝑥1|𝑥2

𝑥3

𝑝 𝑥2 𝑥3 𝑝(𝑥3)

That is a function (message 𝑀(𝑥2)) that depends on 𝑥2

(63)

Probabilistic inference by sampling

This procedure is called ancestor sampling or prior sampling:

Goal: create one valid sample from the distribution

𝑝 𝑥1, … , 𝑥𝑛 =

𝑖=1 𝑛

𝑝(𝑥𝑖|𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖))

Example: 𝑝 𝑥1, 𝑥2, 𝑥3 = 𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑥3 𝑝(𝑥3)

where 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥𝑖 is a subset of varibales in {𝑥𝑖+1, … 𝑥𝑛}

𝑥1 𝑥3

𝑥2

Procedure:

For 𝑖 = 𝑛 … 1

Sample 𝑥𝑖 = 𝑝 𝑥𝑖 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥𝑖 Give out (𝑥1, … , 𝑥𝑛)

(64)

Probabilistic inference by sampling - Example

Example: 𝑝 𝑥1, 𝑥2, 𝑥3 = 𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑥3 𝑝(𝑥3)

𝑥1 𝑥3

𝑥2

Procedure:

For 𝑖 = 𝑛 … 1

Sample 𝑥𝑖 = 𝑝 𝑥𝑖 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥𝑖 Give out (𝑥1, … , 𝑥𝑛)

𝑥1 𝑥3

𝑥2

Step 1

𝑥1 𝑥3

𝑥2

Step 2

𝑥1 𝑥3

𝑥2

Step 3

(65)

How to sample a single variable

How to sample from a general discrete probability distribution of one variable 𝑝(𝑥), 𝑥 ∈ {0,1, … , 𝑛}?

1. Define “intervals” whose lengths are proportional to 𝑝(𝑥) 2. Concatenate these intervals

3. Sample into the composed interval uniformly

4. Check, in which interval the sampled value falls in.

Below is an example for 𝑝 𝑥 ∝ {1,2,3} (three values).

1 2 3

(66)

Probabilistic inference by sampling

• By running the ancestor sampling often enough we get (in the limit) the true joint distribution out (“Gesetz der grossen Zahlen”)

It is: 𝑃 𝑥1, … , 𝑥𝑛 = 𝑃′ 𝑥1, … , 𝑥𝑛 when 𝑁 → ∞

The joint distribution can now be used to compute local marginal:

𝑃 𝑥𝑖 = 𝑘 = 𝒙:𝑥𝑖=𝑘 𝑃(𝑥1, … 𝑥𝑖 = 𝑘, … , 𝑥𝑛) Procedure:

1) Let 𝑁 𝑥1, … , 𝑥𝑛 denote the number of times we have seen a certain sample 𝑥1, … , 𝑥𝑛

2) 𝑃′ 𝑥1, … , 𝑥𝑛 = 𝑁 𝑥1,…,𝑥𝑛

𝑁 where 𝑁 is the total number of samples

(67)

Example

(68)

Example

(69)

Example

(70)

Example

(71)

Example

(72)

Example

(73)

Roadmap for this lecture

• Directed Graphical Models:

• Definition and Example

• Probabilistic Inference

• Undirected Graphical Models:

• Probabilistic inference: message passing

• Probabilistic inference: sampling

• Probabilistic programming languages

• Wrap-Up

(74)

Reminder: Definition: Factor Graph models

• Given an undirected Graph 𝐺 = (𝑉, 𝐹, 𝐸), where 𝑉, 𝐹 are the set of nodes and 𝐸 the set of Edges

• A Factor Graph defines a family of distributions:

𝑓: partition function 𝐹: Factor

ॲ: Set of all factors

𝑁(𝐹): Neighbourhood of a factor

𝜓𝐹: function (not distribution) depending on 𝒙𝑁(𝐹) (𝜓𝐶: 𝐾|𝐶| → 𝑅 where xi ∈ 𝐾)

𝑃 𝒙 = 1

𝑓 𝐹∈

𝜓𝐹(𝒙𝑁 𝐹 ) where 𝑓 = 𝒙 𝐹∈ॲ 𝜓𝐹(𝒙𝑁 𝐹 )

Note the definition of factor is not linked to a property of a Graph (as with cliques)

(75)

Reminder: Dynamic Programming on chains

q

p r

Pass messages from left to right

Message is a vector with 𝐾 entries (𝐾 is the amount of labels)

Read out the solution from final message and final unary term

globally exact solution

Other name: min-sum algorithm

𝑀𝑞→𝑟(𝑥𝑟)

E 𝒙 =

𝑖

𝜃𝑖 𝑥𝑖 +

𝑖,𝑗∈𝑁

𝜃𝑖𝑗 𝑥𝑖, 𝑥𝑗

Unary terms pairwise terms in a row

𝑀𝑝→𝑞(𝑥𝑞) 𝑀𝑟→𝑠(𝑥𝑠)

𝑀𝑜→𝑝(𝑥𝑝)

Comment: Dmitri Schlesinger called the messages Bellman functions.

(76)

Reminder: Dynamic Programming on chains

q

p r

𝑀𝑞→𝑟

Define the message:

𝑀𝑞→𝑟 𝑥r = min

𝑥𝑞 {Mp→q 𝑥𝑞 + 𝜃𝑞 𝑥𝑞 + 𝜃𝑞,𝑟(𝑥𝑞, 𝑥𝑟)}

Information from previous nodes

Local

information

Connection to next node

Message stores the energy up to this point for 𝑥𝑟 = 𝑘:

𝑀𝑞→𝑟 𝑥𝑟 = 𝑘 = min

𝑥1…𝑞,𝑥𝑟=𝑘 E 𝑥1, … , 𝑥𝑞, 𝑥𝑟 = 𝑘

(77)

Reminder: Dynamic Programming on chains - example

(78)

Sum-Prod Algorithm for probabilistic Inference

𝑀𝑞→𝑟 𝑥𝑟 =

𝑥𝑞

{Mp→q x𝑞 ∗ 𝜓𝑞 x𝑞 ∗ 𝜓𝑞,𝑟(x𝑞, 𝑥𝑟)}

Information from previous nodes

Local

information

Connection to next node

q

p r

𝑀𝑞→𝑟(𝑥𝑟) 𝜓𝑞 𝑥𝑞

𝜓𝑞,𝑟 𝑥𝑞, 𝑥𝑟

The Method is called Sum-Prod Algorithm since a message is computed by product and sum operations.

𝑃 𝒙 = 1

𝑓 𝐹∈

𝜓𝐹(𝒙𝑁 𝐹 ) where 𝑓 = 𝒙 𝐹∈ॲ 𝜓𝐹(𝒙𝑁 𝐹 )

Let us consider factor graphs:

(79)

Sum-Prod Algorithm for probabilistic Inference

(80)

Extensions

q

p r s

The algorithm above can easy be extended to chains with arbitrary length

The algorithm above can also be extended to compute marginal for arbitrary nodes (here 𝑞)

Message: 𝑀𝑝→𝑞(𝑥𝑞) Message: 𝑀𝑟→𝑞(𝑥𝑞)

(81)

Extensions

q

p r s

To compute 𝑓 and all marginal we have to compute all messages that go into each node

This is done by two passes over all nodes: from left to right and from right to left. (this is referred to as the full Sum-Prod Algorithm)

Compute all messages that go this way Compute all messages that go this way

(82)

Extensions

• Both, MAP estimation and probabilistic Inference, can easily be extended to tree structures:

• Both, MAP estimation and probabilistic Inference, can be extend to arbitrary factor graphs (with loops and higher order potentials)

Loop: 𝑥2, 𝑥3, 𝑥4

Higher (3) order factor: 𝑥2, 𝑥1, 𝑥4

(83)

Roadmap for this lecture

• Directed Graphical Models:

• Definition and Example

• Probabilistic Inference

• Undirected Graphical Models:

• Probabilistic inference: message passing

• Probabilistic inference: sampling

• Probabilistic programming languages

• Wrap-Up

(84)

Reminder: ICM - Iterated conditional mode

Gibbs Energy:

𝑥2

𝑥1

𝑥3 𝑥4

𝑥5

𝐸 𝒙 = 𝜃12 𝑥1, 𝑥2 + 𝜃13 𝑥1, 𝑥3 + 𝜃14 𝑥1, 𝑥4 + 𝜃15 𝑥1, 𝑥5 + ⋯

(85)

Reimnder: ICM - Iterated conditional mode

𝑥2

𝑥1

𝑥3 𝑥4

𝑥5

Idea: fix all variable but one and optimize for this one

Selected 𝑥1 and optimize:

Gibbs Energy:

Can get stuck in local minima

Depends on initialization

ICM Global min

𝐸 𝒙 = 𝜃12 𝑥1, 𝑥2 + 𝜃13 𝑥1, 𝑥3 + 𝜃14 𝑥1, 𝑥4 + 𝜃15 𝑥1, 𝑥5 + ⋯

𝐸′ 𝒙 = 𝜃12 𝑥1, 𝑥2 + 𝜃13 𝑥1, 𝑥3 + 𝜃14 𝑥1, 𝑥4 + 𝜃15 𝑥1, 𝑥5

(86)

Reminder: ICM - parallelization

The schedule is a more complex task in graphs which are not 4-connected Normal procedure:

Step 1 Step 2 Step 3 Step 4

Parallel procedure:

Step 1 Step 2 Step 3 Step 4

(87)

Gibbs Sampling

• Task: draw a sample from a multivariate probability distribution 𝑝 𝒙 = 𝑝 𝑥1, 𝑥2, … , 𝑥𝑛 .

• This is not trivial, if the probability distribution is complex, e.g. an MRF (remember: it is usually given up to an unknown normalizing constant).

The following procedure is used:

1. Start from an arbitrary assignment 𝒙0 = (𝑥10 … 𝑥𝑛0).

2. Repeat:

1) Pick (randomly) a variable 𝑥𝑖

2) Sample a new value for 𝑥𝑖𝑡+1 according to the conditional probability distribution

𝑝 𝑥𝑖 𝑥1𝑡, … , 𝑥𝑖−1𝑡 , 𝑥𝑖+1𝑡 , … , 𝑥𝑛𝑡)

i.e. under the condition “all other variables are fixed”

(88)

Gibbs Sampling

• After each “elementary” sampling step 2) a new assignment 𝒙𝑡+1 is obtained, that differs from the previous one only by the value for 𝑥𝑖.

• After many (infinite number) iterations of 1)-2) the assignment 𝒙𝑡 follows the initial probability distribution 𝑝 𝒙 under some (mild) assumption:

 The probability to pick each particular variable 𝑥𝑖 in 1) is non- zero, i.e. for an infinite number of generation steps each

variable is visited infinitely many times.

(In particular, in Computer vision applications (each variable often corresponds to a pixel) scan-line order is widely used.)

 For each pair of assignments 𝒙1, 𝒙2 the probability to arrive at 𝒙2 starting from 𝒙1 is non-zero, i.e. independently on the

starting point there is a chance to sample each elementary event of the considering probability distribution.

(89)

Gibbs Sampling for MRFs

For MRFs the elementary sampling step 2) is feasible because 𝑝 𝑥𝑖 𝑥1, … , 𝑥𝑖−1, 𝑥𝑖+1, … , 𝑥𝑛 = 𝑝(𝑥𝑖|𝑥𝑁(𝑖))

where 𝑁(𝑖) is the neighborhood of 𝑖 (the Markovian property).

In particular, for Gibbs distributions of second order 𝑝 𝑥𝑖 = 𝑘 𝑥𝑁 𝑖 ∝ exp[𝜃𝑖 𝑘 +

𝑖,𝑗∈𝑁4

𝜃𝑖𝑗(𝑘, 𝑥𝑗)]

It reminds on the Iterated Conditional Mode, but now we do not decide for the best label, we sample one instead.

(90)

Reminder: How to sample a single variable

How to sample from a general discrete probability distribution of one variable 𝑝(𝑥), 𝑥 ∈ {0,1, … , 𝑛}?

1. Define “intervals” whose lengths are proportional to 𝑝(𝑥) 2. Concatenate these intervals

3. Sample into the composed interval uniformly

4. Check, in which interval the sampled value falls in.

Below is an example for 𝑝 𝑥 ∝ {1,2,3} (three values).

1 2 3

(91)

Other Samplers

• Rejection sampling

• Metropolis-Hasting Sampling

• Comment: Gibbs Sampler and Metropolis-Hasting Sampling belong to the class of MCMC samplers

(92)

Roadmap for this lecture

• Directed Graphical Models:

• Definition and Example

• Probabilistic Inference

• Undirected Graphical Models:

• Probabilistic inference: message passing

• Probabilistic inference: sampling

• Probabilistic programming languages

• Wrap-Up

(93)

Probabilistic programming Languages

Basic Idea: associate to variables a distribution:

Normally C++: Bool coin = 0;

Probabilistic Program: Bool coin = Bernoulli(0.5);

% Bernoulli is a distribution with 2 states

A programming language to easily program: Inference in factor graphs, and other machine learning tasks

(94)

Another Example: Two coin example

• Draw coin1 (𝑥1) and coin2 (𝑥2) independently.

• Each coin has equal probability to be head (1) or tail (0)

• New random variable 𝑧 which is True if and only if both coins are head:

𝑧 = 𝑥1& 𝑥2

(95)

Write this as a directed graphical model

𝑥1 𝑥2 𝑥𝑖 ∈ 0,1 𝑝 𝑥𝑖 = 1 = 𝑝 𝑥𝑖 = 0 = 0.5

𝑧

𝑃(𝑧 = 1|𝑥1, 𝑥2) 𝑃(𝑧 = 0|𝑥1, 𝑥2) 𝑉𝑎𝑙𝑢𝑒 𝑥1 𝑉𝑎𝑙𝑢𝑒 𝑥2

0 1 0 0

0 1 0 1

0 1 1 0

1 0 1 1

Compute Marginal:

𝑝 𝑧 =

𝑥1,𝑥2

𝑝 𝑧, 𝑥1, 𝑥2 =

𝑥1,𝑥2

𝑝 𝑧|𝑥1, 𝑥2 𝑝 𝑥1 𝑝(𝑥2)

𝑝 𝑧 = 1 = 1 ∗ 0.5 ∗ 0.5 = 0.25

𝑝 𝑧 = 0 = 1 ∗ 0.5 ∗ 0.5 + 1 ∗ 0.5 ∗ 0.5 + 1 ∗ 0.5 ∗ 0.5 = 0.75

(96)

Infer.net example: 2 coins

Program:

Run it:

Add to the program:

(97)

Conversion DGM, UGM, Factor graphs

• In Infer.Net all models are converted to factor graphs and then inference is done in factor graphs

• In practice you have one concrete distribution (not family)

• Let us first convert Direct Graphical model to undirected and then from undirected to factor graph

(98)

Directed Graphical model to undirected one

• Take the joint distribution and then replace all conditional

“symbols” with commas and then normalize to get correct distribution

Directed Graphical model: 𝑝 𝑥1, … , 𝑥𝑛 = 𝑖=1𝑛 𝑝(𝑥𝑖|𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖)) Gives UGM: 𝑝 𝑥1, … , 𝑥𝑛 = 1/𝑓 𝑖=1𝑛 𝑝(𝑥𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖))

where 𝑓 = 𝒙 𝑖=1𝑛 𝑝(𝑥𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖))

Note, the function 𝑝 𝑥𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥𝑖 is no longer a probability, so it should rather be called: 𝜓(𝑥𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖)) = 𝑝(𝑥𝑖, 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥𝑖))

Comment: to go from undirected graphical model to directed one is difficult since we don’t have the individual distributions, just a big product of factors

(99)

From Directed Graphical model to undirected

This step is called “moralization”.

1) All the children to a certain node are connected up 2) All directed links are converted to undirected one

𝑝 𝑥1, 𝑥2, 𝑥3, 𝑥4 =

𝑝 𝑥1 𝑥2, 𝑥3 𝑝 𝑥2 𝑝 𝑥3 𝑥4 𝑝(𝑥4)

𝑥1

𝑥3 𝑥2

𝑥4

𝑝 𝑥1, 𝑥2, 𝑥3, 𝑥4 = 1

𝑓(𝑝 𝑥1, 𝑥2, 𝑥3 𝑝 𝑥2 𝑝 𝑥3, 𝑥4 𝑝 𝑥4 )

𝑥1

𝑥3 𝑥2

𝑥4

Referenzen

ÄHNLICHE DOKUMENTE

(energy supply model, extraction and exploration of mineral resources, water management systems, manpower and educational models, agricultural models), then we describe the

From these relations, optimality conditions, including maximum principle for primal and minimum principle for dual problems, are derived and provide a basis for computational

The problem is to determine the optimal livestock mix with the projected growth rate and corresponding development of feed production in order to obtain the maximum profit for

1.Read batch of B examples counting frequencies for every instance of state/transition features f k.. Create a thread for each class and example, B |Y|

Annette Bieniusa Programming Distributed Systems Summer Term 2018 1/ 18... What is a

• Optimal substructure: optimal solutions of subproblems can be used to find the optimal solutions of the overall problem.. Example: Finding the shortest path in

• Expert knowledge is most often available in causal form Conditional independence relations. • Careful: Generative models are

Our approach for learning loop iteration counts and execution times of processes automat- ically generates classifiers, which relate the static code features to the dynamic