Intelligent Systems:

(1)

Intelligent Systems:

Directed Graphical models

Carsten Rother

(2)

Lernräume

• 1-2 Termine in der Woche: 09.2. --- 13.2

• Teacher:

• Holger.Heidrich@tu-dresden.de

• Oliver Groth: Oliver.Groth@tu-dresden.de

• Wo? wird noch bekannt gegeben

• Wieviele Leute wuerden kommen?

(3)

Roadmap for this lectures

• Introduction and Recap

• Directed Graphical models

• Definition

• Creating a model from textual description

• Perform queries and make decisions

• Add evidence to the model

• A few Real-world examples

• Probabilistic Programming

(4)

Notation

• Dimtri Schlesinger has used the following notation:

• Input data (discrete, continuous): 𝑥

• Output Class (discrete): 𝑘 ∈ K

• Parameters (derived during learning): 𝜃

• Posterior for recognition: 𝑝 𝑘 𝑥, 𝜃

• I will use a slightly different notation:

• Input data (discrete, continuous): 𝑧 ∈ Z

• Output Class (discrete): 𝑥 ∈ 0, … 𝐾 = K

• Parameters (derived during learning): 𝜃

• Posterior for recognition: 𝑝 𝑥 𝑧, 𝜃

Note for images we have many, e.g. 1 Million, variables.

The random variables are then: 𝒙 = 𝑥 , … , 𝑥 , 𝑥 , … 𝑥 ∈ K ^𝑛

(5)

Probabilities - Reminder

• A random variable is denoted with 𝑥 ∈ {0, … , 𝐾}

• Discrete probability distribution: 𝑝(𝑥) satisfies

_𝑥

𝑝(𝑥) = 1

• Joint distribution of two random variables: 𝑝(𝑥, 𝑧)

• Conditional distribution: 𝑝 𝑥 𝑧

• Sum rule (marginal distribution): 𝑝 𝑧 =

_𝑥

𝑝(𝑥, 𝑧)

• Independent probability distribution: 𝑝 𝑥, 𝑧 = 𝑝 𝑧 𝑝 𝑥

• Product rule: 𝑝 𝑥, 𝑧 = 𝑝 𝑧 𝑥 𝑝(𝑥)

• Bayes’ rule: 𝑝 𝑥|𝑧 =

^{𝑝(𝑧|𝑥)𝑝 𝑥}

(6)

Probabilities - Reminder

Fill in the missing entries correctly.

1) 𝑝 𝑥 ₁ , 𝑥 ₂ where 𝑥 _𝑖 ∈ {0,1}

2) 𝑝 𝑥 ₁ |𝑥 ₂ where 𝑥 _𝑖 ∈ {0,1}

0.1 0.7 𝑥 ₁

𝑥 ₂

0 1

0.1 0.7 𝑥 ₁

𝑥 ₂

0 1

0.1 0.1

0.9 0.3

(7)

Machine Learning: Structured versus Unstructured Models

Example from previous lecture:

𝑝: R

²

→ {1,2} (classification)

”Normal/Unstructured” Output Prediction:

• 𝑓/𝑝: Z ^𝑚 → N (classification) ( Z can be an arbitrary set)

• 𝑓/𝑝: Z → N (classification)

What is the topic?

{Economy(0), Sports(1), Literature(2), …}

Input: Text ( Z ) Output: Category ( N

²

)

𝑛

What language is it?

{German(0), English(1), Swedish(2) …}

𝑚

(8)

Machine Learning: Structured versus Unstructured Models

”Normal/Unstructured” Output Prediction:

• 𝑓/𝑝: Z

^𝑚

→ R (regression)

Input: Image ( Z

^𝑚

) Output: Object class ( N )+ localization ( R

⁴

)

bicycle

𝑛

(9)

Machine Learning: Structured versus Unstructured Models

Structured Output Prediction:

• 𝑓/𝑝: Z → X (for example: X = R or X = N )

Important: the elements in X do not make independent decisions

𝑛 𝑛

𝑚 𝑛

Definition (not formal)

The Output consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together.

Example: Image Labelling (Computer Vision)

K has a fixed vocabulary, e.g. K ={Wall, Picture,

Person, Clutter,…}

Important: The labeling of neighboring pixels is

highly correlated

(10)

Machine Learning: Structured versus Unstructured Models

Structured Output Prediction:

• 𝑓/𝑝: Z → X (for example: X = R or X = N )

Important: the elements in X do not make independent decisions

𝑛 𝑛

𝑚 𝑛

Example: Text Processing

Output: X

^𝑚

(Parse tree of a sentence) Input: Text ( Z

^𝑚

)

“The boy went home”

Definition (not formal)

The Output consists of several parts, and not only the parts themselves contain

information, but also the way in which the parts belong together.

(11)

Machine Learning: Structured versus Unstructured Models

Structured Output Prediction:

• 𝑓/𝑝: Z → X (for example: X = R or X = N )

Important: the elements in X do not make independent decisions

𝑛 𝑛

𝑚 𝑛

Example: Biology

Input: 3D microscopic image ( Z

^𝑚

)

3D Data

Definition (not formal)

The Output consists of several parts, and not only the parts themselves contain

information, but also the way in which the parts belong together.

(12)

Machine Learning: Structured versus Unstructured Models

Structured Output Prediction:

• 𝑓/𝑝: Z → X (for example: X = R or X = N )

Important: the elements in X do not make independent decisions

𝑛 𝑛

𝑚 𝑛

Example: Chemistry Definition (not formal)

The Output consists of several parts, and not only the parts themselves contain

information, but also the way in which the parts belong together.

(13)

Machine Learning – Big Picture

• Input Data: 𝒛 ∈ Z (discrete, or continuous information)

• Output Label: 𝒙 ∈ K (in our lectures mostly discrete)

• Free Variables: Θ

• Loss function: 𝐶: 𝐷 × K → R

• The Model: 𝑝 𝒙, 𝒛, Θ , 𝑓 𝒙, 𝒛, Θ

• These three steps are done in order:

1. Modelling: Write down 𝑝 𝒙, 𝒛, Θ , 𝑓(𝒙, 𝒛, Θ) or 𝑝 𝒙, 𝒛 , 𝑓(𝒙, 𝒛) 2. (Training/Learning): In case Θ present, then determine it using a

training data set, e.g. in supervised case ^{(𝒙

_𝟏

^{, 𝒛}

_𝟏

^),(𝒙

_𝟐

^{, 𝒛}

_𝟐

^),(𝒙

_𝟑

^{, 𝒛}

_𝟑

^),…}

3. (Testing/Prediction/Inference): Given test data 𝒛 make the optimal prediction wrt loss function 𝐶, e.g. MAP prediction:

𝒙 ^∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝒙, 𝒛, Θ

𝑚 𝑛

𝑛

(14)

Structured Models

Write down joint probability distribution as a so-called Graphical Model:

• Directed graphical model (also called Bayesian Networks) - DGM

• Undirected graphical model (also called Markov Random Field) - UGM

• Factor graph (a specific form of an undirected graphical model) - FG

• A visualization tool for a distribution

• We use the tool where the graph is as simple as possible (as few links)

• You can convert between the three representations

(i.e. express the same joint distribution in three different ways)

• A visualization represents a family of distribution (but that is less relevant for us) Key Idea:

For us:

• We only consider DGM and FG

(15)

Structured Models - Example

Write down joint probability distribution as a so-called Graphical model:

• Directed graphical model (also called Bayesian Networks) - DGM

• Undirected graphical model (also called Markov Random Field) - UGM

• Factor graph (a specific form of an undirected graphical model) - FG

𝑝 𝑥₁, 𝑥₂, 𝑥₃ = 1

𝑓(𝜓 𝑥₁, 𝑥₂, 𝑥₃ 𝜓 𝑥₂ 𝜓 𝑥₄ )

𝑥

₁

𝑥

₃

𝑥

₂

𝑝 𝑥₁, 𝑥₂, 𝑥₃ =

𝑝 𝑥₁ 𝑥₂, 𝑥₃ 𝑝 𝑥₂ 𝑝(𝑥₃)

𝑥

₁

𝑥

₃

𝑥

₂

Output: 3 different variables X = 0,1

³

Model: 𝑝 𝑥

₁

, 𝑥

₂

, 𝑥

₃

| 𝒛 𝑥

_𝑖

∈ {0,1}

𝑥

₁

𝑥

₃

𝑥

₂

𝑝 𝑥₁, 𝑥₂, 𝑥₃ = 1

𝑓(𝜓 𝑥₁, 𝑥₂, 𝑥₃ )

(16)

Structured Models - when to use what representation?

• Directed graphical model: The unknown variables have different

“meanings”

Example: MaryCalls, JohnCalls(J), AlarmOn(A), burglarInHouse(B)

• Factor Graphs: the unknown variables have all the same “meaning”

Examples: Pixel in an image, nuclei in C-Elegans (worm)

• Undirected graphical model are used, instead of Factor graphs,

when we are interested in studying “conditional independency” (not

relevant for our context)

(17)

Roadmap for this lectures

• Introduction and Recap

• Directed Graphical models

• Definition

• Creating a model from textual description

• Perform queries and make decisions

• Add evidence to the model

• A few Real-world examples

• Probabilistic Programming

(18)

Directed Graphical Model: Intro

• Since variables have a meaning it makes sense to look at conditional dependency

(Comment: in Factor graphs and undirected graphical models this does not give any insights)

• All variables that are linked by an arrow have a causal, directed relationship

Car is broken Eat

sweets

Tooth- ache Hole in

tooth

(Comment: the nodes do not have to be discrete variables, they can also be continuous, as

long as all distributions are well defined)

(19)

Directed Graphical Model: Intro

• Rewrite a distribution using product rule:

𝑝 𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 𝑝 𝑥

₁

, 𝑥

₂

𝑥

₃

𝑝 𝑥

₃

= 𝑝 𝑥

₁

𝑥

₂

, 𝑥

₃

𝑝 𝑥

₂

𝑥

₃

𝑝(𝑥

₃

)

• Any joint distribution can be written as product of conditionals:

𝑝 𝑥

₁

, … , 𝑥

_𝑛

=

𝑖=1 𝑛

𝑝(𝑥

_𝑖

|𝑥

_𝑖+1

, … 𝑥

_𝑛

)

• Visualize the conditional 𝑝(𝑥

_𝑖

|𝑥

_𝑖+1

, … 𝑥

_𝑛

) as follows:

𝑥

_𝑖+1

… 𝑥

_𝑛

(20)

Directed Graphical Models: Examples

𝑝 𝑥₁, 𝑥₂, 𝑥₃ =

𝑝 𝑥₁ 𝑥₂, 𝑥₃ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃)

𝑥

₁

𝑥

₃

𝑥

₂

𝑥

₁

𝑥

₃

𝑥

₂

𝑝 𝑥₁, 𝑥₂, 𝑥₃ =

𝑝 𝑥₁ 𝑥₂ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃)

𝑝 𝑥₁, 𝑥₂, 𝑥₃ =

𝑝 𝑥₁ 𝑥₂, 𝑥₃ 𝑝(𝑥₂) 𝑝(𝑥₃)

𝑥

₁

𝑥

₃

𝑥

₂

Comments:

1. If links are absent then (often) the prediction in the model is simpler.

2. Our goal during model creation will be to find a model with as few links as possible

“all links

present” “one link is

absent”

“one link is

absent”

(21)

Directed Graphical Model: Definition

• Given an directed Graph 𝐺 = (𝑉, 𝐸), where 𝑉 is the set of nodes and 𝐸 the set of directed Edges

• A directed Graphical Model defines the family of distributions:

𝑝 𝑥 ₁ , … , 𝑥 _𝑛 =

𝑖=1 𝑛

𝑝(𝑥 _𝑖 |𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥 _𝑖 ))

• The set 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥 _𝑖 is a subset of varibales in {𝑥 _𝑖+1 , … 𝑥 _𝑛 }

• 𝑝(𝑥 _𝑖 |𝑥 _𝑖+1 , … 𝑥 _𝑛 ) is visualized as:

(22)

Roadmap for this lectures

• Introduction and Recap

• Directed Graphical models

• Definition

• Creating a model from textual description

• Perform queries and make decisions

• Add evidence to the model

• A few Real-world examples

• Probabilistic Programming

(23)

Running Example

• Situation

• I am at work

• John calls to say that the alarm in my house is on

• Mary, who is my neighbor, did not call me

• Both, John and Mary, usually call me when the alarm is on

• The alarm is usually set off by burglars, but sometimes also by a minor earthquake

• Can we constructed the directed graphical model ?

• First Step: Identify variables that can have different “states”:

• JohnCalls(J), MaryCalls(M), AlarmOn(A), BurglarInHouse(B),

EarthQuakeHappening(E): 𝐽, 𝑀, 𝐴, 𝐵, 𝐸 ∈ {𝑡𝑟𝑢𝑒, 𝑓𝑎𝑙𝑠𝑒}

(24)

How to construct a Direct Graphical Model

1) Choose a sorting of the variables: 𝑥 _𝑛 , … 𝑥 ₁ 2) For 𝑖 = 𝑛 … 1

1) Add node 𝑥 _𝑖 to network (model)

2) Select those variables which are already in the network (parents) that influence 𝑥 _𝑖 (i.e. where 𝑥 _𝑖 depends on):

𝑝 𝑥 _𝑖 |𝑥 _𝑖+1 , … , 𝑥 _𝑛 = 𝑝 𝑥 _𝑖 |𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥 _𝑖

(25)

Running Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)

• Build the model step by step:

MaryCalls(M) 𝑝(𝑀)

Joint (so far):

(26)

Running Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J) 𝑝(𝑀)

Joint (so far):

𝑝 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝑀) 𝑝 𝐽 𝑀 = 𝑝 𝐽 ?

No. If Mary calls then John is also likely to call

because of the alarm which may be on.

(27)

Running Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J) 𝑝 𝐽 𝑀 = 𝑝 𝐽 ?

No. If Mary calls then John is also likely to call because of the alarm which may be on.

𝑝 𝐽 𝑀 𝑝(𝑀)

Joint (so far):

(28)

Running Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴 ?

𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝐽 ? 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝑀 ?

No. If Mary calls or John calls it is a higher chance that the alarm is on

𝑝(𝑀)

𝑝 𝐽 𝑀

Joint (so far):

(29)

Running Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴 ?

𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝐽 ? 𝑝 𝐴 𝐽, 𝑀) = 𝑝 𝐴|𝑀 ?

No. If Mary calls or John calls it is a higher chance that the alarm is on

𝑝(𝑀)

𝑝 𝐽 𝑀

𝑝 𝐴 𝐽, 𝑀)

Joint (so far):

(30)

Running Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐵 𝐴, 𝐽, 𝑀) = 𝑝 𝐵 ?

No. The chance that there is a burglar in the house is not independent of the Alarm being on or Johns calls or Marry calls.

𝑝(𝑀)

𝑝 𝐽 𝑀

𝑝 𝐴 𝐽, 𝑀)

Joint (so far):

BurglarInHouse(B)

(31)

Running Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐵 𝐴, 𝐽, 𝑀) = 𝑝 𝐵|𝐴 ?

Yes. If we know that alarm is on then the information that Marry or John called is not relevant anymore!

BurglarInHouse(B) 𝑝(𝑀)

𝑝 𝐽 𝑀

𝑝 𝐴 𝐽, 𝑀)

𝑝(𝐵|𝐴)

(32)

Running Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝 𝐸 𝐴, 𝐽, 𝑀, 𝐵) = 𝑝 𝐸 ?

No. If Mary Calls or John Calls or Alarm

is on then the chances of an earthquake are higher

𝑝(𝑀)

𝑝 𝐽 𝑀

𝑝 𝐴 𝐽, 𝑀)

Joint (so far):

BurglarInHouse(B)

𝑝(𝐵|𝐴)

(33)

𝑝 𝐸 𝐴, 𝐽, 𝑀, 𝐵) = 𝑝 𝐸|𝐴, 𝐵 ?

Yes. If we know that alarm is on then the information that Marry or John

called is not relevant. The link B to E is present because it adds additional information. Assume the alaram is on. Then the probability for E=1 depends on the burglar. If the burglar is in the house then this could have caused the alarm being on and hence 𝑝(𝐸 = 1) is lower.

Running Example

• Let us select the ordering:

MaryCalls(M), JohnCalls(J), AlarmOn(A), BurglarInHouse(B), EarthQuakeHappening(E)

• Build the model step by step:

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝(𝑀)

𝑝 𝐽 𝑀

𝑝 𝐴 𝐽, 𝑀)

𝑝(𝐸|𝐴) BurglarInHouse(B)

𝑝(𝐵|𝐴)

(34)

Running Example: Optimal Order

• Was that the best order to get a directed Graphical Model with as few links as possible?

• The optimal order is: 𝐵, 𝐸, 𝐴, 𝑀, 𝐽

MaryCalls(M) JohnCalls(J)

AlarmOn(A)

BurglarInHouse(B) EarthQuakeHappening(E)

How to find that order?

Temporal order, i.e.

how things happened!

Joint:

(35)

Running Example: Worst case scenario

• The worst possible ordering: 𝑀, 𝐽, 𝐸, 𝐵, 𝐴

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) BurglarInHouse(B)

EarthQuakeHappening(E) This is a fully

connected

graph!

(36)

Running Example: Full Specification of Model

• Specify probabilities:

MaryCalls(M)

JohnCalls(J) AlarmOn(A)

BurglarInHouse(B) EarthQuakeHappening(E)

𝑷(𝑩 = 𝟏) 𝑷(𝑩 = 𝟎)

0.001 0.999 𝑷(𝑬 = 𝟏) 𝑷(𝑬 = 𝟎)

0.002 0.998

𝐵 𝑬 𝑷(𝑨 = 𝟏|𝑩, 𝑬) 𝑷(𝑨 = 𝟎|𝑩, 𝑬)

True True 0.95 0.05

True False 0.94 0.06

False True 0.29 0.71

False False 0.001 0.999

𝐴 𝑷(𝑴 = 𝟏|𝑨) 𝑷(𝑴 = 𝟎|𝑨)

True 0.7 0.3

𝐴 𝑷(𝑱 = 𝟏|𝑨) 𝑷(𝑱 = 𝟎|𝑨)

True 0.9 0.1

𝑝 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐽|𝐴)𝑝(𝑀|𝐴)𝑝 𝐴 𝐵, 𝐸 𝑝(𝐸)𝑝(𝐵)

(37)

Roadmap for this lectures

• Introduction and Recap

• Directed Graphical models

• Definition

• Creating a model from textual description

• Perform queries and make decisions

• Add evidence to the model

• A few Real-world examples

• Probabilistic Programming

(38)

What actions can we do?

Make queries:

• Compute probably of a model state, e.g. 𝑝(𝑥

₁

= 1, 𝑥

₂

= 0, 𝑥

₃

= 1)

• Compute marginal of a single variable 𝑝 𝑥

_𝑖

• Compute marginal of two variables: 𝑝(𝑥

_𝑖

, 𝑥

_𝑗

)

• Compute Conditional probability: 𝑝 𝑥

_𝑖

𝑥

_𝑘

= 0)

• Derive the most likely state: 𝑥

₁^∗

, 𝑥

₂^∗

, 𝑥

₃^∗

= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝(𝑥

₁

, 𝑥

₂

, 𝑥

₃

) Make decisions:

• Add loss function to the network and the ask question about outcome Other tasks:

• Give extra Evidence to the model

• Value of information: What evidence should be we give to the model in order to get all marginal as unique as possible

• Sensitive Analyses: How does the network “behave” if I give evidence for

𝑥₁, 𝑥₂, 𝑥₃

(39)

Running Example: Queries

• What is the probability that John calls and Mary calls and alarm is on, but there is no burglar and no earthquake?

𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1, 𝐵 = 0, 𝐸 = 0 =

𝑝 𝐵 = 0 𝑝 𝐸 = 0 𝑝 𝐴 = 1 𝐵 = 0, 𝐸 = 0 𝑝 𝐽 = 1 𝐴 = 1 𝑝 𝑀 = 1 𝐴 = 1 = 0.999 ∗ 0.998 ∗ 0.001 ∗ 0.7 ∗ 0.9 = 0.000628

• What is the probability that John calls and Mary calls and alarm is on?

𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1 =

_𝐵,𝐸

𝑝 𝐽 = 1, 𝑀 = 1, 𝐴 = 1, 𝐵, 𝐸 =

𝐵,𝐸

𝑝 𝐵 𝑝 𝐸 𝑝 𝐴 = 1 𝐵, 𝐸 𝑝 𝐽 = 1 𝐴 = 1 𝑝 𝑀 = 1 𝐴 = 1 =

0.70.9(0.999 ∗ 0.998 ∗ 0.001 + 0.999 ∗ 0.002 ∗ 0.29 + 0.001 ∗ 0.998 ∗ 0.94 + 0.001 ∗ 0.002 ∗ 0.95 ) = 0.00157

𝑝 𝐽 = 1 𝐴 = 1 𝑝 𝑀 = 1 𝐴 = 1

_𝐵,𝐸

𝑝 𝐵 𝑝 𝐸 𝑝 𝐴 = 1 𝐵, 𝐸 =

(40)

Running Example: Queries (done in exercise)

• What is the marginal probability that John calls: 𝑃 𝐽 = 1 ?

• What is the conditional probability that John calls given that Mary calls?

𝑃 𝐽 = 1|𝑀 = 1 ?

• What is the most likely state in the network ?

How to do this in the best way?

• Make decisions (example)

Decision is to call the police (𝑑 = 1), or not to call the police (𝑑 = 0) Loss function 𝐶 𝑑 = 0, 𝐵 = 1 = 10; 𝐶 𝑑 = 0, 𝐵 = 0 = 0;

𝐶 𝑑 = 1, 𝐵 = 1 = 0; 𝐶 𝑑 = 1, 𝐵 = 0 = 2;

Bayesian Risk minimization: 𝑅 𝑑 =

_{𝐽,𝑀,𝐴,𝐵,𝐸}

𝑝

𝐽, 𝑀, 𝐴, 𝐵, 𝐸

𝐶(𝑑, 𝐵) → min

𝑑

𝐽

^∗

, 𝑀

^∗

, 𝐴

^∗

, 𝐵

^∗

, 𝐸

^∗

= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝(𝐽, 𝑀, 𝐴, 𝐵, 𝐸)

Compute this also for different conditionals: e.g. 𝑝 𝐴, 𝐵, 𝐸 | 𝐽 = 1, 𝑀 = 1

𝐽, 𝑀, 𝐴, 𝐵, 𝐸

(41)

Queries in Network: Definition

• We call the Query:

“Normal” Inference (Prediction) or MAP inference

• We call the Query: 𝑝(𝑥 ₁ ), 𝑝(𝑥 ₂ ), 𝑝(𝑥 ₃ ) Probabilistic Inference

In the following we look at two different strategies:

• Exact computation via enumeration / message passing

• Approximate computation via sampling

𝑥

₁^∗

, 𝑥

₂^∗

, 𝑥

₃^∗

= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝(𝑥

₁

, 𝑥

₂

, 𝑥

₃

)

(42)

(Probabilistic) inference by enumeration

Brute force - just add up all:

𝑥

₁

𝑥

₃

𝑥

₂

𝑝 𝑥

₂

=

𝑥₁,𝑥₃

𝑝 𝑥

₁

, 𝑥

₂

, 𝑥

₃

=

𝑥₁,𝑥₃

𝑝 𝑥

₁

|𝑥

₂

, 𝑥

₃

𝑝 𝑥

₂

𝑥

₃

𝑝(𝑥

₃

)

𝑝 𝑥₁, 𝑥₂, 𝑥₃ = 𝑝 𝑥₁ 𝑥₂, 𝑥₃ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃)

Brute force - just maximize over all:

𝑥

₂^∗

= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑥

₁

|𝑥

₂

, 𝑥

₃

𝑝 𝑥

₂

𝑥

₃

𝑝(𝑥

₃

)

𝑥₁, 𝑥₂, 𝑥₃

Number of operations: 2

²

= 4

where 𝑥

_𝑖

∈ {0,1}

Number of access to a probability table: 2

²

∗ 3 = 12

𝑥₁, 𝑥₂, 𝑥₃ 3

(43)

(Probabilistic) inference by enumeration

In some cases you can reduce the cost of computations:

𝑝 𝑥₁, 𝑥₂, 𝑥₃ = 𝑝 𝑥₁ 𝑥₂ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃)

𝑥

₁

𝑥

₃

𝑥

₂

𝑝 𝑥₁ =

𝑥₂,𝑥₃

𝑝 𝑥₁, 𝑥₂, 𝑥₃ =

𝑥₂,𝑥₃

𝑝 𝑥₁|𝑥₂ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃) =

𝑥₂

𝑝 𝑥₁|𝑥₂

𝑥₃

𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃)

That is a function (message 𝑀(𝑥₂)) that depends on 𝑥₂ only. This is also the marginal 𝑝(𝑥₂)

“Message” that only depends on 𝑥

₂

𝑥

_𝑖

∈ {0,1}

Number of operations: 2 + 2 = 4

Number of access to a probability table: 2 ∗ 2 + 2 ∗ 2 = 8

(44)

(Probabilistic) inference by enumeration

In some cases you can reduce the cost of computations:

𝑝 𝑥₁, 𝑥₂, 𝑥₃ = 𝑝 𝑥₁ 𝑥₂ 𝑝 𝑥₂ 𝑥₃ 𝑝(𝑥₃)

𝑥

₁

𝑥

₃

𝑥

₂

That is a function (message 𝑀(𝑥₂)) that depends on 𝑥₂ only

“Message” that only depends on 𝑥

₂

𝑥

_𝑖

∈ {0,1}

Number of operations: 2 + 2

²

= 6

𝑥

₁^∗

= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑥

₁

|𝑥

₂

𝑝 𝑥

₂

𝑥

₃

𝑝 𝑥

₃

=

𝑥₁, 𝑥₂, 𝑥₃ 𝑥₁, 𝑥₂, 𝑥₃

𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑥

₁

|𝑥

₂

𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑥

₂

𝑥

₃

𝑝 𝑥

₃

𝑥₁, 𝑥₂ 𝑥₃

(45)

Algorithm for marginal computation (“Sum-Product Message Passing“) 1. Compute Messages from right to left

2.. Read out all marginals

Algorithm for optimal state computation: (“Max-Product Message Passing”) 1. Compute Messages from right to left

2. Back trace the correct solution

(Probabilistic) inference in Chains

Comments:

• This algorithm can be extended to arbitrary graphs (Junction Tree Method)

• In undirected graphical models we run similar algorithms 𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

(done in exercise)

(46)

Reminder: HMM from lecture 3 (Michael Schröder)

• Viterbi Algorithm gives the optimal state 𝑠

• Viterbi Algorithm is a different name for Max-Product Message Passing

(47)

(Probabilistic) inference by sampling

This procedure is called ancestor sampling or prior sampling:

Goal: create one valid sample from the distribution

𝑝 𝑥

₁

, … , 𝑥

_𝑛

=

𝑖=1 𝑛

𝑝(𝑥

_𝑖

|𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑥

_𝑖

))

Example: 𝑝 𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 𝑝 𝑥

₁

𝑥

₂

, 𝑥

₃

𝑝 𝑥

₂

𝑥

₃

𝑝(𝑥

₃

)

where 𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥

_𝑖

is a subset of varibales in {𝑥

_𝑖+1

, … 𝑥

_𝑛

}

𝑥

₁

𝑥

₃

𝑥

₂

Procedure:

For 𝑖 = 𝑛 … 1

Sample 𝑥

_𝑖

= 𝑝 𝑥

_𝑖

𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥

_𝑖

Give out (𝑥

₁

, … , 𝑥

_𝑛

)

(48)

(Probabilistic) inference by sampling - Example

Example: 𝑝 𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 𝑝 𝑥

₁

𝑥

₂

, 𝑥

₃

𝑝 𝑥

₂

𝑥

₃

𝑝(𝑥

₃

)

𝑥

₁

𝑥

₃

𝑥

₂

Procedure:

For 𝑖 = 𝑛 … 1

Sample 𝑥

_𝑖

= 𝑝 𝑥

_𝑖

𝑝𝑎𝑟𝑒𝑛𝑡𝑠 𝑥

_𝑖

Give out (𝑥

₁

, … , 𝑥

_𝑛

)

𝑥

₁

𝑥

₃

𝑥

₂

Step 1

𝑥

₁

𝑥

₃

𝑥

₂

Step 2

𝑥

₁

𝑥

₃

𝑥

₂

Step 3

(49)

How to sample a single variable

How to sample from a general discrete probability distribution of one variable 𝑝(𝑥), 𝑥 ∈ {0,1, … , 𝑛}?

1. Define “intervals” whose lengths are proportional to 𝑝(𝑥) 2. Concatenate these intervals

3. Sample into the composed interval uniformly

4. Check, in which interval the sampled value falls in.

Below is an example for 𝑝 𝑥 ∝ {1,2,3} (three values).

1 2 3

(50)

(Probabilistic) inference by sampling

• By running the ancestor sampling often enough we get (in the limit) the true joint distribution out (“Gesetz der grossen Zahlen”)

• It is: 𝑃 𝑥

₁

, … , 𝑥

_𝑛

= 𝑃′ 𝑥

₁

, … , 𝑥

_𝑛

when 𝑁 → ∞

• The joint distribution can now be used to compute local marginal:

𝑃 𝑥

_𝑖

= 𝑘 =

_𝒙:𝑥_𝑖_=𝑘

𝑃(𝑥

₁

, … 𝑥

_𝑖

= 𝑘, … , 𝑥

_𝑛

)

• It can also be used to compute the optimal state of 𝑃 𝑥

₁

, … , 𝑥

_𝑛

• In practice this procedure only works well for smaller graphs since Procedure:

1) Let 𝑁 𝑥

₁

, … , 𝑥

_𝑛

denote the number of times we have seen a certain sample 𝑥

₁

, … , 𝑥

_𝑛

2) 𝑃′ 𝑥

₁

, … , 𝑥

_𝑛

=

^{𝑁 𝑥}¹^,…,𝑥^𝑛

𝑁

where 𝑁 is the total number of samples

(51)

Example: ancestor sampling

(52)

Example: ancestor sampling

(53)

Example: ancestor sampling

(54)

Example: ancestor sampling

(55)

Example: ancestor sampling

(56)

Example: ancestor sampling

(57)

Roadmap for this lectures

• Introduction and Recap

• Directed Graphical models

• Definition

• Creating a model from textual description

• Perform queries and make decisions

• Add evidence to the model

• A few Real-world examples

• Probabilistic Programming

(58)

Add Evidence to the Model

• New Evidence: We observe that the alarm is on

• Question: How do we update the network?

• Idea 1:

Change 𝑝 𝐴 𝐽, 𝑀 in such a way that 𝑝 𝐴 = 1 𝐽, 𝑀 = 1 and 𝑝 𝐴 = 0 𝐽, 𝑀 = 0 for all 𝐽, 𝑀.

But this means that 𝑝 𝐴 𝐽, 𝑀 = 𝑝(𝐴) and

• Problem: It decouples 𝑀, 𝐽 from the rest of the network since links to 𝐴 are removed.

𝑝 𝑀, 𝐽, 𝐴, 𝐵, 𝐸 = 𝑝(𝐸|𝐴, 𝐵)𝑝(𝐵|𝐴)𝑝(𝐴)𝑝(𝐽|𝑀)𝑝(𝑀) MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝(𝑀)

𝑝 𝐽 𝑀

EarthQuakeHappening(E) 𝑝(𝐸|𝐴, 𝐵) BurglarInHouse(B)

𝑝(𝐵|𝐴)

(59)

• New Evidence: We observe that the alarm is on

• Question: How do we update the network?

• Idea 2:

Compute new conditional:

• Advantage: all links in the network are kept and this change does effect the rest of network, e.g. the marginal 𝑝(𝑀)

Add Evidence to the Model

MaryCalls(M)

JohnCalls(J)

AlarmOn(A) 𝑝(𝑀)

𝑝 𝐽 𝑀

𝑝 𝐴 = 1 𝐽, 𝑀)

EarthQuakeHappening(E) 𝑝(𝐸|𝐴, 𝐵) BurglarInHouse(B)

𝑝(𝐵|𝐴)

𝑝 𝑀, 𝐽, 𝐵, 𝐸 | 𝐴 = 1 =

𝑝 𝐸 𝐴, 𝐵 𝑝 𝐵 𝐴 𝑝 𝐴 = 1 𝐽, 𝑀 𝑝 𝐽 𝑀 𝑝 𝑀 /𝑝(𝐴 = 1)

(60)

Add Evidence to the Model – simple Example

Smoker(𝑠)

Cancer(𝑐)

𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)

0.5 0.5

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 0.2 0.8

False 0.1 0.9

Marginal: 𝑝 𝑠 = 0 = 0.5 ; 𝑝 𝑠 = 1 = 0.5

𝑝 𝑐 = 0 = 𝑝 𝑐 = 0 𝑠 = 0 𝑝 𝑠 = 0 + 𝑝 𝑐 = 0 𝑠 = 1 𝑝 𝑠 = 1 = 0.85 𝑝 𝑐 = 1 = 0.15

Joint: 𝑝 𝑐, 𝑠 = 𝑝 𝑐 𝑠 𝑝(𝑠)

(61)

Add Evidence to the Model – simple Example

Smoker( 𝑠 )

Cancer(𝑐)

𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)

0.5 0.5

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 0.2 0.8

False 0.1 0.9

We know that the person smokes, i.e. 𝑠 = 1, what is the probability that he has cancer?

Joint: 𝑝 𝑐, 𝑠 = 𝑝 𝑐 𝑠 𝑝(𝑠)

New model: 𝑝(𝑐|𝑠 = 1) can be directly taken from the model

Fix this

variable

(62)

Add Evidence to the Model – simple Example

Smoker( 𝑠 )

Cancer(𝑐)

𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)

1 0

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 0.2 0.8

False irrelevant irrelevant

We know that the person smokes, i.e. 𝑠 = 1, what is the probability that he

has cancer?

(63)

Add Evidence to the Model – simple Example

Smoker( 𝑠 )

Cancer(𝑐)

𝑷(𝒔 = 𝟏) 𝑷(𝒔 = 𝟎)

0.5 0.5

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 0.2 0.8

False 0.1 0.9

We know that the person has cancer, 𝑐 = 1, what is the probability that he smokes?

p 𝑠 = 0|𝑐 = 1 =

^{𝑝 𝑠=0,𝑐=1}

𝑝 𝑐=1

=

𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)

𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)+𝑝(𝑐=1|𝑠=1)𝑝(𝑠=1)

= 0.05/0.15 = 1/3 p 𝑠 = 1|𝑐 = 1 =

^{𝑝 𝑠=1,𝑐=1}

𝑝 𝑐=1

=

𝑝 𝑐=1|𝑠=1 𝑝(𝑠=1)

𝑝 𝑐=1|𝑠=0 𝑝(𝑠=0)+𝑝(𝑐=1|𝑠=1)𝑝(𝑠=1)

= 0.1/0.15 = 2/3 Fix this

variable

(64)

Add Evidence to the Model – simple Example

Smoker(𝑠)

Cancer(𝑐)

𝑷(𝒔 = 𝟏|𝒄 = 𝟏) 𝑷(𝒔 = 𝟎|𝒄 = 𝟏)

2/3 1/3

𝒔 𝑷(𝒄 = 𝟏|𝒔) 𝑷(𝒄 = 𝟎|𝒔)

True 1 0

False 1 0

We know that the person has cancer, what is the probability that

he smokes?

(65)

Examples: Medical Diagnosis

(66)

Examples: Medical Diagnosis

(67)

Examples: Medical Diagnosis

Local Marginal for each node (sometimes also called belief)

(68)

Examples: Medical Diagnosis

(69)

Examples: Medical Diagnosis

(70)

Examples: Medical Diagnosis

(71)

Examples: Medical Diagnosis

(72)

Roadmap for this lectures

• Introduction and Recap

• Directed Graphical models

• Definition

• Creating a model from textual description

• Perform queries and make decisions

• Add evidence to the model

• A few Real-world examples

• Probabilistic Programming

(73)

Car Diagnosis

(74)

Real World Applications of Bayesian Networks

(75)

Intelligent Systems: