Intelligent Systems:

(1)

Intelligent Systems:

Undirected Graphical models – Inference and Applications

Carsten Rother

(2)

Roadmap for remaining lectures

• 11.12 (1): Computer Vision – a hard case for AI

• 11.12 (2): Probability Theory

• 18.12 (1): Exercise: probability theory

18.12 (2): Decision Making (Unstructured models)

• 8.1 (1): Maximum Likelihood Principle (Unstructured models)

• 8.1 (2): Discriminative Learning (Unstructured models)

• 15.1 (1): Exercise: Learning

• 15.1 (2): Discriminative (unstructured) Models

Lecturers: Carsten Rother and Dimitri Schlesinger

(3)

Roadmap for remaining lectures

• 22.1 (1): Undirected Graphical models: Inference and Applications

• 22.1 (2): Undirected Graphical models: Inference and Applications

• 29.1 (1): Exercise: Undirected Graphical models

• 29.1 (2): Recognition in Practice

• 5.2 (1): Probabilistic Inference in Undirected and Directed Graphical models

• 5.2 (2): Wrap up; Robot localization

and Learning Interactive Systems

Lecturers: Carsten Rother and Dimitri Schlesinger

(4)

Roadmap next two lectures

• Define: Structured Models

• Formulate applications as discrete labeling problems

• Discrete Inference:

• Pixels-based: Iterative Conditional Mode (ICM)

• Line-based: Dynamic Programming (DP)

• Field-based: Graph Cut and Alpha-Expansion

• Interactive Image Segmentation

• From Generative models to

• Discriminative models to

• Discriminative function

(5)

Roadmap next two lectures

• Define: Structured Models

• Formulate applications as discrete labeling problems

• Discrete Inference:

• Pixels-based: Iterative Conditional Mode (ICM)

• Line-based: Dynamic Programming (DP)

• Field-based: Graph Cut and Alpha-Expansion

• Interactive Image Segmentation

• From Generative models to

• Discriminative models to

• Discriminative function

(6)

Machine Learning: Big Picture

”Normal” Machine Learning:

f : Z N (classification) f : Z R (regression) Input: Image, text

Output: real number(s)

f : Z X

Input: Image , text

Output: complex structure object (labelling, parse tree)

Parse tree of a sentence

Image labelling Chemical structure

Structured Output Prediction:

(7)

Structured Output Prediction

Ad hoc definition

(from [Nowozin et al. 2011])

Data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong

together.

(8)

Graphical models to capture structured problems

Write probability distribution as a Graphical model:

• Directed graphical model (also called Bayesian Networks)

• Undirected graphical model (also called Markov Random Field)

• Factor graphs (which we will use predominately)

• A visualization to represent a family of distributions

• Key concept is conditional independency

• You can also convert between distributions Basic idea:

References:

- Pattern Recognition and Machine Learning [Bishop ‘08, chapter 8]

- several lectures at the Machine Learning Summer School 2009 (see video lectures)

(9)

Notation

• Dimtri Schlesinger has used the following notation:

• Input data (discrete, continuous): 𝑥

• Output Class (discrete): 𝑘 ∈ 𝐾

• Parameters (derived during learning): 𝜃

• Posterior for recognition: 𝑝 𝑘 𝑥, 𝜃

• I will use (consistently) a different notation:

• Input data (discrete, continuous): 𝑧

• Output Class (discrete): 𝑥 ∈ 𝐾 (𝑜𝑟 𝐿)

• Parameters (derived during learning): 𝜃

• Posterior for recognition: 𝑝 𝑥 𝑧, 𝜃

Note for images we have many, e.g. 1 Million, variables.

The random variables are then: 𝐱 = (x

₁

, … , x

_i

, x

_j

, … x

_n

)

(10)

Probabilities - Reminder

• A random variable is denoted with 𝑥 ∈ {0, … , 𝐿}

• Discrete probability distribution: 𝑃(𝑥) satisfies

𝑥

𝑃(𝑥) = 1 where random variable 𝑥 ∈ {0, … , 𝐿}

• Joint distribution of two random variables: 𝑃(𝑥, 𝑧)

• Conditional distribution: 𝑃 𝑧 𝑥

• Sum rule (marginal distribution): 𝑃 𝑧 =

_𝑥

𝑃(𝑥, 𝑧)

• Independent probability distribution: 𝑃 𝑥, 𝑧 = 𝑃 𝑧 𝑃 𝑥

• Product rule: 𝑃 𝑥, 𝑧 = 𝑃 𝑧 𝑥 𝑃(𝑥)

• Bayes’ rule: 𝑃 𝑥|𝑧 =

^𝑃

𝑧 𝑥

^{𝑃 𝑥}

𝑃(𝑧)

(11)

Defining families of distributions

• 𝑃(𝑥 ₁ , 𝑥 ₂ ) general distribution

• 𝑃 ₂ (𝑥 ₁ , 𝑥 ₂ ) = 𝑃 𝑥 ₁ 𝑥 ₂ 𝑃 𝑥 ₂ = 𝑃(𝑥 ₁ )𝑃 𝑥 ₂ restricted family

• 𝑃 ₃ 𝑥 ₁ , 𝑥 ₂ = 𝑁 𝑥 ₁ 𝜇 ₁ , 𝜎 ₁ 𝑁 𝑥 ₂ 𝜇 ₂ , 𝜎 ₂ concrete realization

𝑃(𝑥

₁

, 𝑥

₂

) general distribution 𝑃

₂

𝑥

₁

, 𝑥

₂

= 𝑃 𝑥

₁

𝑃(𝑥

₂

) restricted family

𝑃

₃

𝑥

₁

, 𝑥

₂

concrete realization

(12)

Undirected Graphical models - example

Defines an unobserved variable Defines an observed variable

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

Clique: a set of nodes where ALL nodes are connected, example {𝑥

₁

, 𝑥

₂

, 𝑥

₄

}

𝑧

(13)

Undirected Graphical models - example

Defines an unobserved variable Defines an observed variable

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

𝑃 𝑥

₁

, 𝑥

₂

, 𝑥

₃

, 𝑥

₄

, 𝑥

₅

= 1

𝑓 𝜓 𝑥

₁

, 𝑥

₂

, 𝑥

₄

𝜓 𝑥

₂

, 𝑥

₃

, 𝑥

₄

𝜓 𝑥

₁

, 𝑥

₂

𝜓 𝑥

₁

, 𝑥

₂

𝜓 𝑥

₁

, 𝑥

₄

𝜓 𝑥

₄

, 𝑥

₂

𝜓 𝑥

₃

, 𝑥

₂

𝜓 𝑥

₃

, 𝑥

₄

𝜓 𝑥

₄

, 𝑥

₅

𝜓 𝑥

₁

𝜓 𝑥

₂

𝜓 𝑥

₃

𝜓 𝑥

₄

𝜓 𝑥

₅

𝑧

Maximum cliques (no node can be added): x

₁

, 𝑥

₂

, 𝑥

₄

, x

₄

, 𝑥

₂

, 𝑥

₃

, x

₄

, 𝑥

₅

Clique: a set of nodes where ALL nodes are

connected, example {𝑥

₁

, 𝑥

₂

, 𝑥

₄

}

(14)

Definition: Undirected Graphical models

• Given an undirected Graph 𝐺 = (𝑉, 𝐸), where 𝑉 is the set of nodes and 𝐸 the set of Edges

• An undirected Graphical Model defines a family of distributions:

𝑓: partition function C(G): Set of all cliques

C: a clique, i.e. a subset of variable indices.

𝜓

_𝐶

: factor (not distribution) depending on 𝒙

_𝐶

(𝜓

_𝐶

: 𝐾

^|𝐶|

→ 𝑅 where x

_i

∈ 𝐾)

Definition: Clique is a set of nodes where all nodes are linked with an edge

𝑃 𝒙 = 1

𝑓 𝐶∈𝐶(𝐺)

𝜓 _𝐶 (𝒙 _𝐶 ) where 𝑓 = _𝒙 _{𝐶∈𝐶(𝐺)} 𝜓 _𝐶 (𝒙 _𝐶 )

(15)

Comment on definition

In some books the set 𝐶(𝐺) is defined as the set of all maximum cliques only.

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

The set of families of distributions is equivalent.

For instance, a factor 𝜓 𝑥

₁

, 𝑥

₂

= (𝑥

₁

+𝑥

₂

) 𝑥

₁

can also be written as two factors: 𝜓′ 𝑥

₁

, 𝑥

₂

𝜓′ 𝑥

₁

where 𝜓′ 𝑥

₁

, 𝑥

₂

= 𝑥

₁

+ 𝑥

₂

; 𝜓′ 𝑥

₁

= 𝑥

₁

𝑃 𝑥₁, 𝑥₂, 𝑥₃, 𝑥₄, 𝑥₅ =1

𝑓 𝜓 𝑥₁, 𝑥₂, 𝑥₄ 𝜓 𝑥₂, 𝑥₃, 𝑥₄ 𝜓 𝑥₁, 𝑥₂ 𝜓 𝑥₁, 𝑥₂ 𝜓 𝑥₁, 𝑥₄ 𝜓 𝑥₄, 𝑥₂

𝜓 𝑥₃, 𝑥₂ 𝜓 𝑥₃, 𝑥₄ 𝜓 𝑥₄, 𝑥₅ 𝜓 𝑥₁ 𝜓 𝑥₂ 𝜓 𝑥₃ 𝜓 𝑥₄ 𝜓 𝑥₅

𝑃 𝑥₁, 𝑥₂, 𝑥₃, 𝑥₄, 𝑥₅ = 1

𝑓 𝜓 𝑥₁, 𝑥₂, 𝑥₄ 𝜓 𝑥₂, 𝑥₃, 𝑥₄ 𝜓 𝑥₄, 𝑥₅

Using maximum cliques.

(16)

Filter View of Undirected Graphical Models

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

𝑃 𝑥

₁

, 𝑥

₂

, 𝑥

₃

, 𝑥

₄

, 𝑥

₅

Arbitrary probability

𝑃 𝑥₁, 𝑥₂, 𝑥₃, 𝑥₄, 𝑥₅ =1

𝑓 𝜓 𝑥₁, 𝑥₂, 𝑥₄ 𝜓 𝑥₂, 𝑥₃, 𝑥₄ 𝜓 𝑥₁, 𝑥₂ 𝜓 𝑥₁, 𝑥₂ 𝜓 𝑥₁, 𝑥₄ 𝜓 𝑥₄, 𝑥₂ 𝜓 𝑥₃, 𝑥₂ 𝜓 𝑥₃, 𝑥₄ 𝜓 𝑥₄, 𝑥₅

𝜓 𝑥₁ 𝜓 𝑥₂ 𝜓 𝑥₃ 𝜓 𝑥₄ 𝜓 𝑥₅

Smaller family

When would the filter let through all distributions?

(17)

Conditional Independency

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

A

B C

Does it hold: 𝑃(𝐴, 𝐶|𝐵) = 𝑃(𝐴|𝐵) 𝑃(𝐶|𝐵) ? - Yes if all paths from 𝐴 to 𝐶 go through 𝐵

- Otherwise No

This is also written: 𝐴 𝐶 | 𝐵

(18)

Hammersley-Clifford Theorem

• Let 𝑈𝐹 be the family of distributions defined by:

𝑃 𝒙 = 1

𝑓 𝐶∈𝐶(𝐺)

𝜓 _𝐶 (𝒙 _𝐶 )

• Let 𝑈𝐼 be the set of distributions that are consistent with set of conditional independence statements that can be read from the graph

• The Theorem states that 𝑈𝐼 and 𝑈𝐹 are indentical

(19)

Definition: Factor Graph models

• Given an undirected Graph 𝐺 = (𝑉, 𝐹, 𝐸), where 𝑉, 𝐹 are the set of nodes and 𝐸 the set of Edges

• A Factor Graph defines a family of distributions:

𝑓: partition function 𝐹 : Factor

𝔽: Set of all factors

𝑁(𝐹): Neighbourhood of a factor

𝜓

_𝐹

: function (not distribution) depending on 𝒙

_𝑁(𝐹)

(𝜓

_𝐶

: 𝐾

^|𝐶|

→ 𝑅 where x

_i

∈ 𝐾)

𝑃 𝒙 = 1

𝑓 𝐹∈ 𝔽

𝜓 _𝐹 (𝒙 _{𝑁 𝐹} ) where 𝑓 = _𝒙 _𝐹∈ 𝔽 𝜓 ^𝐹 (𝒙 _{𝑁 𝐹} )

Note the definition of factor is not linked to a property of a Graph (as with cliques)

(20)

Factor Graphs - example

Defines an unobserved variable

means that these variables are in one factor Defines an observed variable

𝑃 𝑥

₁

, 𝑥

₂

, 𝑥

₃

, 𝑥

₄

, 𝑥

₅

=

¹

𝑓

𝜓 𝑥

₁

, 𝑥

₂

, 𝑥

₄

𝜓 𝑥

₂

, 𝑥

₃

𝜓 𝑥

₃

, 𝑥

₄

𝜓 𝑥

₄

, 𝑥

₅

𝜓 𝑥

₄

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

Defines a factor node

(21)

Introducing energies

𝑃 𝒙 = ¹

𝑓 𝐹∈ 𝔽 𝜓 ^𝐹 (𝒙 _{𝑁 𝐹} ) = ¹

𝑓 𝐹∈ 𝔽 exp{−𝜃 ^𝐹 (𝒙 _{𝑁 𝐹} )} =

1 𝑓 exp{ − _𝐹∈ 𝔽 𝜃 ^𝐹 𝑥 _{𝑁 𝐹} } = ¹

𝑓 exp{ −𝐸(𝒙) } The energy 𝐸 𝒙 is just a sum of factors:

E 𝒙 =

𝐹∈

𝔽

𝜃

_𝐹

𝑥

_{𝑁 𝐹}

The most likely solution 𝑥 ^∗ is reached by minimizing the energy:

𝑥 ^∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 _𝑥 𝑃 𝑥 𝑥 ^∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 _𝑥 𝐸(𝑥)

(since it is: − log 𝑃 𝑥 = log 𝑓 + 𝐸 𝒙 = constant + 𝐸(𝒙) )

(22)

Gibbs Distribution

𝑃 𝑥 = ¹

𝑓 exp{ −𝐸(𝒙) } E 𝒙 =

𝐹∈ 𝔽

𝜃 _𝐹 𝑥 _{𝑁 𝐹}

Is a so-called Gibbs distribution or Boltzmann Distribution

with energy 𝐸

(23)

Definition: Order

arity 3 arity 2

Definitions:

• Order: the arity (number of variables) of the largest factor

• Markov Random Field: Random Field with low-order factors

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

Factor graph with order 3

𝐸 𝑥 = 𝜃 𝑥

₁

, 𝑥

₂

, 𝑥

₄

+ 𝜃 𝑥

₂

, 𝑥

₃

+ 𝜃 𝑥

₃

, 𝑥

₄

+ 𝜃 𝑥

₅

, 𝑥

₄

+ 𝜃(𝑥

₄

)

arity 1

(24)

Two examples

𝑃

₁

𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 1

𝑓 exp{ 𝑥

₁

𝑥

₂

+ 𝑥

₂

𝑥

₃

+ 𝑥

₁

𝑥

₃

}

𝑃

₂

𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 1

𝑓 exp{𝑥

₁

𝑥

₂

𝑥

₃

}

We always tray to write distributions as “factorized” as possible

Note 𝑃

₂

cannot be written as a sum of pairwise energy terms

(25)

The family view of distributions

Family of all distributions

Family of all distributions with same 𝑈𝑛𝑑𝑖𝑟𝑒𝑐𝑡 𝐺𝑟𝑎𝑝ℎ𝑖𝑐𝑎𝑙 𝑀𝑜𝑑𝑒𝑙𝑠

Realization of a distribution 𝑃

₁

𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 1

𝑓 exp{ 𝑥

₁

𝑥

₂

+ 𝑥

₂

𝑥

₃

+ 𝑥

₁

𝑥

₃

}

𝑃

₁^𝐹

𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 1

𝑓 ( 𝜓 𝑥

₁

, 𝑥

₂

𝜓 𝑥

₂

, 𝑥

₃

𝜓 𝑥

₁

, 𝑥

₃

)

𝑃

₁₂^𝑈

𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 1

𝑓 𝜓 𝑥

₁

, 𝑥

₂

, 𝑥

₃

𝑃

₂

𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 1

𝑓 exp{𝑥

₁

𝑥

₂

𝑥

₃

}

𝑃

₂^𝐹

𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 1

𝑓 𝜓 𝑥

₁

, 𝑥

₂

, 𝑥

₃

Family of all distributions with same 𝑓𝑎𝑐𝑡𝑜𝑟 𝑔𝑟𝑎𝑝ℎ

𝑃₁ 𝑃₂

𝑃₁^𝐹 𝑃₂^𝐹 𝑃₁₂^𝑈

“written in maximum clique form”

(26)

Undirected Graphical Models are less precise

𝑃

₁₂^𝑈

𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 1

𝑓 𝜓 𝑥

₁

, 𝑥

₂

, 𝑥

₃

𝑃

₁^𝐹

𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 1

𝑓 ( 𝜓 𝑥

₁

, 𝑥

₂

𝜓 𝑥

₂

, 𝑥

₃

𝜓 𝑥

₁

, 𝑥

₃

) 𝑃

₂^𝐹

𝑥

₁

, 𝑥

₂

, 𝑥

₃

= 1

𝑓 𝜓 𝑥

₁

, 𝑥

₂

, 𝑥

₃

(27)

Easy to convert between the two representations

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

Make sure that every factor is represented by a clique

Family of distributions with this

factor graph

Family of distributions with this undirected graphical model

Convert a factor graph in such a way that the family of distributions of the

undirected graphical model covers all possible distributions of this factor graph:

(28)

Easy to convert between the two representations

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

Make sure that every clique has an associated factor

Family of distributions of this undirected graphical model and factor graph is the same

Convert an undirected graphical model in such a way that the family of

distributions of the factor graph covers all possible distributions of this

undirected graphical model:

(29)

Easy to convert between the two representations

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

Make sure that every clique has an associated factor

Convert an undirected graphical model in such a way that the family of distributions of the factor graph covers all possible distributions of this undirected graphical model:

Comment: this is also correct, but not minimal

representation

(30)

Easy to convert between the two representations

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

Make sure that every clique has an associated factor

Convert an undirected graphical model in such a way that the family of distributions of the factor graph covers all possible distributions of this undirected graphical model:

Comment: this is not correct

𝑥

₁

𝑥

₂

𝑥

₃

𝑥

₄

𝑥

₅

(31)

What to infer?

• MAP inference (Maximum a posterior state):

𝑥 ^∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 _𝑥 𝑃 𝑥 = 𝑎𝑟𝑔𝑚𝑖𝑛 _𝑥 𝐸 𝑥

• Probabilistic Inference, so-called marginals:

𝑃 𝑥 _𝑖 = 𝑘 =

𝒙 | 𝑥

_𝑖

=𝑘

𝑃(𝑥 ₁ , … 𝑥 _𝑖 = 𝑘, … , 𝑥 _𝑛 )

This can be used to make a maximum marginal decision:

𝑥 _𝑖 ^∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 _𝑥

_𝑖

𝑃 𝑥 _𝑖

(32)

MAP versus Marginals - visually

Input Image Ground Truth Labeling

MAP solution 𝒙

^∗

(each pixel has 0,1 label)

Marginals 𝑃 𝑥

_𝑖

(each pixel has a probability

between 0 and 1)

(33)

MAP versus Marginals – Making Decisions

Which solution 𝑥 ^∗ would you choose?

Space of all solutions x (sorted by pixel difference) 𝑃(𝑥|𝑧)

Input image 𝑧

(34)

Reminder: How to make a decision

Question: What solution 𝑥 ^∗ should we give out?

Answer: Choose 𝑥 ^∗ which minimizes the Bayesian risk:

𝑥 ^∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 _𝑥

𝑥

𝑃 𝑥 𝑧 𝐶(𝑥, 𝑥 ^∗ ) Assume model 𝑃 𝑥 𝑧 is known

𝐶 𝑥 ₁ , 𝑥 ₂ is called the loss function

(or cost function) of comparing to results 𝑥 ₁ , 𝑥 ₂

(35)

Maximum A-Posterior Solution (MAP)

Space of all solutions x (sorted by pixel difference) 𝑃(𝑥|𝑧)

MAP solution takes globally optimal solution.

(36)

The Cost Function behind MAP

𝑥 ^∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 _𝑥

𝑥

𝑃 𝑥 𝑧 𝐶(𝑥, 𝑥 ^∗ )

= 𝑎𝑟𝑔𝑚𝑖𝑛 _𝑥 1 − 𝑃 𝑥 = 𝑥 ^∗ 𝑧 = 𝑎𝑟𝑔𝑚𝑎𝑥 _𝑥 𝑃(𝑥|𝑧) Choose: 𝐶 𝑥, 𝑥 ^∗ = 0 if 𝑥 = 𝑥 ^∗ , otherwise 1

The MAP estimation optimizes a “global 0-1 loss”

(37)

The Cost Function behind Max Marginals

Probabilistic Inference give marginal. We can take the max- marginal solution:

𝑥 _𝑖 ^∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 _𝑥

_𝑖

𝑃 𝑥 _𝑖

(where 𝑃 𝑥

_𝑖

= 𝑘 =

_{𝒙 | 𝑥}

𝑖=𝑘

𝑃(𝑥

₁

, … 𝑥

_𝑖

= 𝑘, … , 𝑥

_𝑛

)

This represents the decision with minimum Bayesian Risk:

𝑥 ^∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 _𝑥

𝑥

𝑃 𝑥 𝑧 𝐶(𝑥, 𝑥 ^∗ ) where 𝐶 𝑥, 𝑥 ^∗ = _𝑖 |𝑥 _𝑖 − 𝑥 _𝑖 ^∗ |

This is a pixel-wise error, called “hamming loss”

(38)

Maximum A-Posterior Solution (MAP)

Space of all solutions x (sorted by pixel difference) 𝑃(𝑥|𝑧)

Maximum Marginal solution: 𝑥 _𝑖 ^∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 _𝑥

_𝑖

𝑃 𝑥 _𝑖

“guessed”

(39)

This lecture: Discrete Inference in Order two models

𝑃 𝑥 =

¹

𝑓

exp{ −𝐸(𝒙) }

E 𝒙 =

𝑖

𝜃

_𝑖

𝑥

_𝑖

+

𝑖,𝑗

𝜃

_𝑖𝑗

𝑥

_𝑖

, 𝑥

_𝑗

+

𝑖,𝑗,𝑘

𝜃

_{𝑖,𝑗,𝑘}

𝑥

_𝑖

, 𝑥

_𝑗

, 𝑥

_𝑘

+ … Gibbs distribution

Unary terms Pairwise terms Higher-order terms

MAP inference: 𝑥

^∗

= 𝑎𝑟𝑔𝑚𝑎𝑥

_𝑥

𝑃 𝑥 = 𝑎𝑟𝑔𝑚𝑖𝑛

_𝑥

𝐸 𝑥

Label space: binary 𝑥

_𝑖

∈ 0,1 or multi-label 𝑥

_𝑖

∈ 0, … , 𝐾

We only look at energies with unary and pairwise factors

(40)

Roadmap next two lectures

• Define: Structured Models

• Formulate applications as discrete labeling problems

• Discrete Inference:

• Pixels-based: Iterative Conditional Mode (ICM)

• Line-based: Dynamic Programming (DP)

• Field-based: Graph Cut and Alpha-Expansion

• Interactive Image Segmentation

• From Generative models to

• Discriminative models to

• Discriminative function

(41)

Example in Biology

Conditional graphical models for protein structural motif recognition.

Liu Y, Carbonell J, Gopalakrishnan V, Weigele P.

(42)

Example in Biology

A Discrete Chain Graph Model for 3d+t Cell Tracking with High

Misdetection Robustness. Bernhard X. Kausler, Martin Schiegg, Bjoern Andres,

(43)

Examples: Order

4-connected;

pairwise MRF

Higher-order RF

𝐸(𝑥) = 𝜃

_𝑖𝑗

(𝑥

_𝑖

, 𝑥𝑗)

𝑖, 𝑗Є𝑁₄

higher(8)-connected;

pairwise MRF

Order 2 Order 2 Order n

𝐸(𝑥) = 𝜃

_𝑖𝑗

(𝑥

_𝑖

, 𝑥𝑗)

“Pairwise energy” “higher-order energy”

𝐸(𝑥) = 𝜃

_𝑖𝑗

(𝑥

_𝑖

, 𝑥𝑗)

𝑖, 𝑗Є𝑁₈ 𝑖, 𝑗Є𝑁₄

+𝜃(𝑥

₁

, … , 𝑥

_𝑛

)

(44)

Stereo Vision (all Details will come in CV 1)

• Gray Images are given

• Transform to yellow images Rectification -

Corresponding points are now

on same scanline

(45)

Correspondence Search now in 1D

What is the right match?

(46)

Stereo Camera - Geometry

disparity

• Disparity zero means that 3D point is at infinity

• Large Disparity means that 3D point is close to camera

• From Disparities you can get out depth map

(47)

Stereo Matching – Formulate as MRF

d=4 d=0

Ground truth depth Image – left(a) Image – right(b)

• Images rectified

• Ignore occlusion

𝐸(𝒅): {0, … , 𝐷}

^𝑛

→ 𝑅

Energy:

Labels: 𝑑

_𝑖

disparity (shift) of pixel 𝑖

𝑑

_𝑖

Label only left image

(48)

Stereo Matching - Energy

• Unary terms (many options)

Patch-Cost for a pixel i with disparty 𝑑

_𝑖

is:

𝜃

_𝑖

𝑑

_𝑖

=

𝑗∈𝑁_𝑖

𝐼

_𝑗^𝑙

− 𝐼

_𝑗−𝑑^𝑟 _𝑖 ²

Left: 𝐼

^𝑙

E 𝒅 =

𝑖

𝜃

_𝑖

𝑑

_𝑖

+

𝑖,𝑗

𝜃

_𝑖𝑗

𝑑

_𝑖

, 𝑑

_𝑗

Right: 𝐼

^𝑟

𝑑_𝑖

𝜃_𝑖(𝑑𝑖)

𝜃_𝑖𝑗(𝑑_𝑖, 𝑑𝑗)

Sum of squared differences in a window (SSD cost)

(49)

Stereo Matching Energy - Smoothness

[Olga Veksler PhD thesis, Daniel Cremers et al.]

cos t

No truncation (global min.)

𝑑_𝑖

𝜃_𝑖(𝑑𝑖)

𝜃_𝑖𝑗(𝑑_𝑖, 𝑑𝑗)

𝜃

_𝑖𝑗

𝑑

_𝑖

, 𝑑

_𝑗

= |𝑑

_𝑖

− 𝑑

_𝑗

|

|𝑑

_𝑖

− 𝑑

_𝑗

|

(50)

Stereo Matching Energy - Smoothness

discontinuity preserving potentials [Blake&Zisserman’83,’87]

cos t

No truncation (global min.)

with truncation (NP-hard optimization)

𝜃

_𝑖𝑗

𝑑

_𝑖

, 𝑑

_𝑗

= min( 𝑑

_𝑖

− 𝑑

_𝑗

, 𝜏)

|𝑑

_𝑖

− 𝑑

_𝑗

|

(51)

Stereo Matching: Simplified Random Fields

No MRF; Block Matching Pixel independent (WTA)

No horizontal links

Efficient since independent chains

Ground truth Pairwise MRF

[Boykov et al. ‘01]

(52)

Image Segmentation

𝜃

_𝑖𝑗

(𝑥

_𝑖

, 𝑥

_𝑗

) 𝑥

_𝑗

𝜃

_𝑖

(𝑥

_𝑖

) 𝑥

_𝑖

E 𝒙 =

𝑖

𝜃

_𝑖

𝑥

_𝑖

+

𝑖,𝑗

𝜃

_𝑖𝑗

𝑥

_𝑖

, 𝑥

_𝑗

Binary Label: 𝑥

_𝑖

∈ {0,1}

(53)

Unary term

Optimum with unary terms only

Dark means likely background

Dark means likely foreground

𝜃 _𝑖 (𝑥 _𝑖 = 0) 𝜃 _𝑖 (𝑥 _𝑖 = 1)

Encode that red/purple colors are more likely

foreground and yellow/dark colors are more

likely background – derivation next lecture

(54)

Pairwise term - Reminder

Most likely Most likely Intermediate likely

“Ising Prior”

most unlikely

This models the assumption that the object is spatially coherent 𝜃 _𝑖𝑗 𝑥 _𝑖 , 𝑥 _𝑗 = |𝑥 _𝑖 − 𝑥 _𝑗 |

When is 𝜃 _𝑖𝑗 (𝑥 _𝑖 , 𝑥 _𝑗 ) small, i.e. likely configuration ?

(55)

Texture Synthesis

Input

Output

[Kwatra et. al. Siggraph ‘03 ]

b a

O

1

E 𝒙 =

𝑖

𝜃_𝑖 𝑥_𝑖 +

𝑖,𝑗

𝜃_𝑖𝑗 𝑥_𝑖, 𝑥_𝑗

Binary Label: 𝑥

_𝑖

∈ {0,1}

𝜃_𝑖 𝑥_𝑖

𝜃

_𝑖𝑗

𝑥

_𝑖

, 𝑥

_𝑗

= 0 𝑖𝑓 𝑥

_𝑖

= 𝑥

_𝑗

𝜃

_𝑖

𝑥

_𝑖

= ∞ if image does not

exist at pixel I otherwise 0

(56)

Texture Synthesis

Input

Output

b a

O

1

E 𝒙 =

𝑖

𝑖,𝑗

Binary Label: 𝑥

_𝑖

∈ {0,1}

𝜃_𝑖 𝑥_𝑖

a b

i j i j

Good case: Bad case:

𝐸 𝒙 =

𝒊𝒋

𝒙

_𝒊

− 𝒙

_𝒋

( 𝑎

_𝑖

− 𝑏

_𝑖

+ 𝑎

_𝑗

− 𝑏

_𝑗

)

(57)

Panoramic Stitching

Use identical energy

(58)

Panoramic Stitching

Use identical energy

(59)

Interactive Digital Photomontage

[Agarwala et al., Siggraph 2004]

(60)

Interactive Digital Photomontage

[Agarwala et al., Siggraph 2004]

(61)

Face Freeman Image Quilting

[A. Efros and W. T Freeman, Image quilting for texture synthesis and transfer, SIGGRAPH 2001]

• Unary term matches “dark” rice pixels to dark face pixels

• Place source image at random positions on the output canvas

• You can also use dynamic program (see article)

source image

Output canvas

(62)

Video Synthesis

Output Input

Video

Video (duplicated)

A 3D labeling problem Same pairwise terms but now

in 𝑥, 𝑦, and 𝑡(𝑡𝑖𝑚𝑒)-direction

(63)

Image Retargeting

http://swieskowski.net/carve/

(64)

Image Retargeting

(65)

Image Retargeting

E 𝒙 =

𝑖

𝑖,𝑗

Binary Label: 𝑥

_𝑖

∈ {0,1}

𝜃_𝑖 𝑥_𝑖

Force label 0

Force label 1

Cut (sketched)

Label 0 Label 1

Goal: from each scan-line take out exactly one pixel. This gives then the new image (less one pixel in x-direction).

Cut should go through places with

low image gradient

(66)

First Idea

𝜃_𝑖 𝑥_𝑖

Force label 0

Force label 1

Violates our constraint to take exactly one pixel per scanline

Label 1 Label 0

𝒙

_𝒊

𝒙

_𝒋

value

0 0 0

0 1

^Iⁱ ^{− I}^j

1 0

^Iⁱ ^{− I}^j

1 1 0

All pairwise terms may look like this:

1,0 transition here

𝑖 𝑗

(67)

The correct graph

𝜃_𝑖 𝑥_𝑖

Force label 0

Force label 1

Does not violate our constraint

Label 1 Label 0

No 1,0 transition possible

𝒙

_𝒊

𝒙

_𝒋

value

0 0 0

0 1

^Iⁱ^{− I}^j

1 0 ∞

1 1 0

All horizontal

pairwise terms look like this:

𝒙

_𝒊

𝒙

_𝒋

value

0 0 0

0 1

^Iⁱ^{− I}^j

1 0

^Iⁱ^{− I}^j

1 1 0

All vertical pairwise terms look like this:

∞ means a very large number

[Improved Seam Carving for Video Retargeting, Rubinstein et al, Siggraph ‘08]

(68)

Extension – Scene Carving

Image Hand drawn depth-ordering

Normal seam carving Scene carving

(69)

Examples: Order

4-connected;

pairwise MRF

Higher-order RF

𝐸(𝑥) = 𝜃

_𝑖𝑗

(𝑥

_𝑖

, 𝑥𝑗)

𝑖, 𝑗Є𝑁₄

higher(8)-connected;

pairwise MRF

Order 2 Order 2 Order n

𝐸(𝑥) = 𝜃

_𝑖𝑗

(𝑥

_𝑖

, 𝑥𝑗)

“Pairwise energy” “higher-order energy”

𝐸(𝑥) = 𝜃

_𝑖𝑗

(𝑥

_𝑖

, 𝑥𝑗)

+𝜃(𝑥

₁

, … , 𝑥

_𝑛

)

(70)

Avoid Discretization artefacts

Larger connectivity can model true Euclidean length (also other metric possible)

Eucl.

Length of the paths:

4-con.

5.65 8 1

8-con.

6.28 6.28

5.08 6.75

Can you choose edge weights in such a way

that the dark yellow and blue segmentation

have different length?

(71)

Avoid Discretization artefacts

4-connected Euclidean

8-connected Euclidean (MRF)

8-connected

geodesic (CRF)

(72)

Examples: Order

4-connected;

pairwise MRF

Higher-order RF

𝐸(𝑥) = 𝜃

_𝑖𝑗

(𝑥

_𝑖

, 𝑥𝑗)

𝑖, 𝑗Є𝑁₄

higher(8)-connected;

pairwise MRF

Order 2 Order 2 Order n

𝐸(𝑥) = 𝜃

_𝑖𝑗

(𝑥

_𝑖

, 𝑥𝑗)

“Pairwise energy” “higher-order energy”

𝐸(𝑥) = 𝜃

_𝑖𝑗

(𝑥

_𝑖

, 𝑥𝑗)

+𝜃(𝑥

₁

, … , 𝑥

_𝑛

)

(73)

Advanced Object recognition

• Many other examples: ObjCut Kumar et. al. ’05; Deformable Part Model Felzenszwalb et al.; CVPR ’08; PoseCut Bray et al. ’06,

LayoutCRF Winn et al. ’06

• Maximizing / Marginalization over hidden variables

“parts”

“instance label”

“instance”

[LayoutCRF Winn et al. ’06]

(74)

Roadmap next two lectures

• Define: Structured Models

• Formulate applications as discrete labeling problems

• Discrete Inference:

• Pixels-based: Iterative Conditional Mode (ICM)

• Line-based: Dynamic Programming (DP)

• Field-based: Graph Cut and Alpha-Expansion

• Interactive Image Segmentation

• From Generative models to

• Discriminative models to

• Discriminative function

(75)

Inference – Big Picture (this will be done in Computer Vision 2)

• Combinatorial Optimization

• Binary, pairwise MRF: Graph cut, BHS (QPBO)

• Multiple label, pairwise: move-making; transformation

• Binary, higher-order factors: transformation

• Multi-label, higher-order factors:

move-making + transformation

• Dual/Problem Decomposition

• Decompose (NP-)hard problem into tractable once.

Solve with e.g. sub-gradient technique

• Local search / Genetic algorithms

• ICM, simulated annealing

(76)

Inference – Big Picture (this will be done in Computer Vision 2)

• Message Passing Techniques

• Methods can be applied to any model in theory (higher order, multi-label, etc.)

• DP, BP, TRW, TRW-S

• LP-relaxation

• Relax original problem (e.g. {0,1} to [0,1])

and solve with existing techniques (e.g. sub-gradient)

• Can be applied any model (dep. on solver used)

• Connections to message passing (TRW) and combinatorial

optimization (QPBO)

(77)

Function Minimization: The Problems

• Which functions are exactly solvable?

• Approximate solutions of NP-hard problems

(78)

Function Minimization: The Problems

• Which functions are exactly solvable?

Boros Hammer [1965], Kolmogorov Zabih[ECCV 2002, PAMI 2004] , Ishikawa [PAMI 2003], Schlesinger [EMMCVPR 2007], Kohli Kumar Torr [CVPR2007, PAMI 2008] , Ramalingam Kohli Alahari Torr [CVPR 2008] , Kohli Ladicky Torr [CVPR 2008, IJCV 2009] , Zivny Jeavons [CP 2008]

• Approximate solutions of NP-hard problems

Schlesinger[1976 ], Kleinberg and Tardos[FOCS 99], Chekuri et al.[2001], Boykov et al. [PAMI 2001], Wainwright et al. [NIPS 2001], Werner [PAMI 2007], Komodakis [PAMI 2005], Lempitsky et al. [ICCV 2007], Kumar et al. [NIPS 2007], Kumar et al. [ICML 2008], Sontag and Jakkola[NIPS 2007], Kohli et al. [ICML 2008], Kohli et al. [CVPR 2008, IJCV 2009], Rother et al. [2009]

(79)

ICM - Iterated conditional mode

Gibbs Energy:

𝑥

₂

𝑥

₁

𝑥

₃

𝑥

₄

𝑥

₅

𝐸 𝒙 = 𝜃

₁₂

𝑥

₁

, 𝑥

₂

+ 𝜃

₁₃

𝑥

₁

, 𝑥

₃

+

𝜃

₁₄

𝑥

₁

, 𝑥

₄

+ 𝜃

₁₅

𝑥

₁

, 𝑥

₅

+ ⋯

(80)

ICM - Iterated conditional mode

𝑥

₂

𝑥

₁

𝑥

₃

𝑥

₄

𝑥

₅

Idea: fix all variable but one and optimize for this one

Selected 𝑥

₁

and optimize:

Gibbs Energy:

• Can get stuck in local minima

• Depends on initialization

ICM Global min

𝐸 𝒙 = 𝜃

₁₂

𝑥

₁

, 𝑥

₂

+ 𝜃

₁₃

𝑥

₁

, 𝑥

₃

+ 𝜃

₁₄

𝑥

₁

, 𝑥

₄

+ 𝜃

₁₅

𝑥

₁

, 𝑥

₅

+ ⋯

𝐸′ 𝒙 = 𝜃

₁₂

𝑥

₁

, 𝑥

₂

+ 𝜃

₁₃

𝑥

₁

, 𝑥

₃

+

𝜃

₁₄

𝑥

₁

, 𝑥

₄

+ 𝜃

₁₅

𝑥

₁

, 𝑥

₅

(81)

ICM - parallelization

• The schedule is a more complex task in graphs which are not 4-connected Normal procedure:

Step 1 Step 2 Step 3 Step 4

Parallel procedure:

Step 1 Step 2 Step 3 Step 4

(82)

ICM - Iterated conditional mode

Extension / related techniques

• Simulated annealing

• Block ICM (see excersise)

• Gibbs sampling (see later)

• Lazy Flipper: MAP Inference in Higher-Order Graphical Models by Depth-limited

Exhaustive Search

[Bjoern Andres, Joerg H. Kappes, Ullrich Koethe, Fred A. Hamprecht]

(83)

Roadmap next two lectures

• Define: Structured Models

• Formulate applications as discrete labeling problems

• Discrete Inference:

• Pixels-based: Iterative Conditional Mode (ICM)

• Line-based: Dynamic Programming (DP)

• Field-based: Graph Cut and Alpha-Expansion

• Interactive Image Segmentation

• From Generative models to

• Discriminative models to

• Discriminative function

(84)

What is dynamic programming?

dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It will

examine all possible ways to solve the problem and will find the

optimal solution.

(85)

Dynamic Programming on chains

No MRF; Block Matching Pixel independent (WTA)

No horizontal links

Efficient since independent chains

Ground truth Pairwise MRF

[Boykov et al. ‘01]

(86)

Dynamic Programming on chains

q

p r

• Pass messages from left to right

• Message is a vector with 𝐾 entries (𝐾 is the amount of labels)

• Read out the solution from final message and final unary term

• globally exact solution

• Other name: min-sum algorithm

𝑀

_𝑞→𝑟

(𝑥

_𝑟

)

E 𝒙 =

𝑖

𝜃

_𝑖

𝑥

_𝑖

+

𝑖,𝑗∈𝑁

𝜃

_𝑖𝑗

𝑥

_𝑖

, 𝑥

_𝑗

Unary terms pairwise terms in a row

𝑀

_𝑝→𝑞

(𝑥

_𝑞

) 𝑀

_𝑟→𝑠

(𝑥

_𝑠

)

𝑀

_𝑜→𝑝

(𝑥

_𝑝

)

Comment: Dmitri Schlesinger called the messages Bellman functions.

(87)

Dynamic Programming on chains

q

p r

𝑀

_𝑞→𝑟

Define the message:

𝑀 _𝑞→𝑟 𝑥 _r = min

𝑥

_𝑞