Machine Learning II Markov Trees, Labeling Problems

(1)

Machine Learning II Markov Trees, Labeling Problems

Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug

SS2014, 16.05.2014

(2)

Markov Trees

A “usual” Parametrization (similar to chains):

Discrete Variables y_i ∈K are enumerated,

each one (but the “root”) has a predecessorj(i)< i

The probability of a label configurationy= (y₁, y₂. . . yn)is p(y) =p(y₁)·

n

Y

i=2

p(y_i|y_j(i))

(3)

Markov Trees

Parametrization by marginal probabilities (also similar to chains):

p(y) = p(y₁)·

n

Y

i=2

p(y_i|y_j(i)) =p(y₁)·

n

Y

i=2

p(y_i, y_j(i)) p(y_j(i)) =

=

Q

i,j(i)

p(y_i, y_j(i))

Q

i

p(y_i)ⁿ⁽ⁱ⁾⁻¹ = ^Y

i∈V

p(y_i)¹⁻ⁿ⁽ⁱ⁾· ^Y

ij∈E

p(y_i, y_j),

whereE is the set of edges of a cycle-free graph G= (V,E), n(i) is the cardinality of the nodei

There is no “root” anymore, the parametrization is symmetric, nodes need not to be explicitly enumerated

(4)

Dynamic Programming for trees

In short, it is the same as for chains – variables are sequentially eliminated (replaced by Bellman Functions).

It works for both SumProd Problems (Partition Function, marginal probabilities) and for MinSum (MAP)

(5)

Learning (supervised)

In addition to the „usual“ learning the tree structure should be estimated as well

Given a training set L= (y¹, y². . . y^l) of labellings estimate:

– the Graph, i.e. a cycle free set of edges E, – the numerical parameters ψ for this graph

The parameter set to be estimated depends on the graph, which is also unknown.

The Maximum Likelihood reads:

lnp(L;E, ψ(E)) =^X

l

lnp(y^l;E, ψ(E))→ max

E,ψ(E)

(6)

Learning

"

X

l

lnp(y^l;E, ψ(E))→max

ψ(E)

#

→max

E

“Nested optimization”:

a) Consider a fixed set of edgesE, solve the “inner”

optimization task with respect to ψ

b) Substitute the “inner” task by the solution of a) (the reached value of the likelihood), solve it with respect to E The solution of a) is obvious in the “usual” parametrization by the conditional probabilities:

– just setψ_i,j(i)(k, k⁰) to the corresponding normalized statisticsp^∗_i,j(i)(k, k⁰) (analogously to the Markov chains)

(7)

Learning

The likelihood reached by the “inner” optimization – for a fixed tree E and

– with the optimal parameters ψ_i,j(i)(k, k⁰) =p^∗_i,j(i)(k|k⁰) is:

X

k

p^∗₁(k) lnp^∗₁(k) +

n

X

i=2

X

kk⁰

p^∗_i,j(i)(k, k⁰) lnp^∗_i,j(i)(k|k⁰) =

=^X

i

(1−n(i))^X

k

p^∗_i(k) lnp^∗_i(k)+

+^X

ij∈E

X

kk⁰

p^∗_i,j(k, k⁰) lnp^∗_i,j(k, k⁰)=

=^X

i

(n(i)−1)H(i)− ^X

ij∈E

H(i, j)=

=−^X

i

H(i) + ^X

ij∈E

[H(i) +H(j)−H(i, j)]→max

E

H is the entropy of the corresponding probability distribution

(8)

Learning

−^X

i

H(i) + ^X

ij∈E

[H(i) +H(j)−H(i, j)]→max

E

The first sum does not depend on the graph structure and can be neglected

The second sum is an “additive quality”, i.e. there is a “cost”

for each edge, if it is included into the tree

→find the maximal spanning tree with edge costs c(i, j) =H(i) +H(j)−H(i, j) (mutual information) http://en.wikipedia.org/wiki/Minimum_spanning_tree

(9)

Dynamic Programming for general graphs

Consider the following procedure to build a graph – Nodes are added to the graph consecutively

– Each new node is linked to a fully connected subgraph (of the current graph) that consists ofw nodes at most – After all nodes are added some edges are removed Let a graph be given. Its tree-width is the smallestw so, that the graph can be built by the procedure described above Examples: chains, trees – w=1; cycles, simple networks – w=2; n×m 4-connected grids – w= min(n, m)

For a fixedw, the task “whether a given graph has the tree-width w” can be solved inO(n^w), i.e. with polynomial time complexity. The task “findw for a given graph” is NP, becausew may be equal n.

(10)

Dynamic Programming for general graphs

The Idea: if the order of nodes (in which they were added during the procedure described above) is known, it is possible to eliminate them in the opposite order. The Bellman

Functions have the order w at most, i.e.B:K^w→R.

The Dynamic Programming can be performed inO(nK^w+1) Chain: eliminate the “first” node,w= 1,B:K→R,O(nK²) Tree: eliminate a leave, all the same as for chains

Cycle: eliminate an arbitrary node, w= 2, B:K²→R,O(nK³) The above example: w= 3,B:K³→R, O(nK⁴)

(11)

Labeling Problems

A graphG= (V,E) with nodes i∈V and edges {i, j} ∈ E There is a variabley_i for each i that can take values from a finite discrete setK (label set, states, range ...)

A labeling y:V →K is a mapping that assigns a label from K to each variable y_i

Extensions/Variants: Hyper-graphs instead of Graphs, K ⊂R (continuous range) ...

(12)

Constraint Satisfaction Problems

In each node some labels are “disabled” as well as there are

“disabled” label pairs for edges →there are local constraints that are given in form ofboolean functions ψ_i :K → {0,1}

and ψij :K² → {0,1}

A labeling is “globally consistent” if it fulfills all constraints:

Q(y) = ^{^}

i∈V

ψ_i(y_i)∧ ^{^}

ij∈E

ψ_ij(y_i, y_j) The task is to check, whether there exists at least one consistent labeling (∨– “or”, ∧– “and”):

Q=^_

y

Q(y) =^_

y





^

i∈V

ψ_i(y_i)∧ ^{^}

ij∈E

ψ_ij(y_i, y_j)





(13)

CSP, Examples

nSAT-Problems (e.g. 2SAT-Problem)

Given: a boolean expression in conjunctive normal form, e.g.

F(a, b, c) = (a∧¯b)∨(¯a∧c)∨(b∧c) with a, b, c∈ {0,1}.

Find the values for all variables so that F(a, b, c) = 1 hold.

Nodes – variables, label set –{0,1}, edges – ∧-Terms,ψ_i = 1, ψ₁₂(a, b) =a∧¯b ψ₁₃(a, c) = ¯a∧c ψ₂₃(b, c) =b∧c Compute

^

abc

F(a, b, c) = ^{^}

abc

ψ₁₂(a, b)∨ψ₁₃(a, c)∨ψ₂₃(b, c)

(14)

CSP, Examples

n-queens puzzle: place n chess queens on an n×n chessboard so that no two queens threaten each other.

Nodes: {a, b, c, d, e, f, g, h} –

columns/queens (one queen per column) Label set: {1, n}– the vertical position of the corresponding queen

The graph is fully connected ψ_ij(k, k⁰) =

( 0 if “threaten”

1 otherwise

(15)

CSP, Relaxation Labeling Algorithm

Look at the vicinity of a node and disable configurations that obviously do not belong to the solution, i.e. repeat

ψ_i(k) = ψ_i(k)∧^{^}

ij

_

kk⁰

ψ_ij(k, k⁰)

ψ_ij(k, k⁰) =ψ_ij(k, k⁰)∧ψ_i(k)∧ψ_j(k⁰) as long as something is changed.

After the algorithm finishes there are three possible cases:

1) There is a node for which there is no allowed label

→no consistent labeling

2) There is exactly one allowed label in each node

→there is a consistent labeling

3) In some nodes there are more then two labels allowed

→the task is not solved (example on the board).

CSP is NP-complete in general

(16)

Energy Minimization

ψ_i :K→R and ψ_ij :K²→R do not “disable”, but “penalize”

The quality of a labeling is Q(y) = ^X

i

ψ_i(y_i) +^X

ij

ψ_ij(y_i, y_j) Find the labeling of the minimal quality:

Q= min

y Q(y) = min

y



 X

i

ψ_i(y_i) +^X

ij

ψ_ij(y_i, y_j)





Example: Maximum a-posteriori Entscheidungen in MRF.

CSP is a special case of the energy minimization – all local qualities (values of ψ functions) are

0– corresponds to the boolean 1, i.e. “allowed” or

∞– corresponds to the boolean 0, i.e. “disabled”.

Synonyms for energy minimization – “SoftCSP”, “ValuedCSP”

...

(17)

Partition Function

The probability distribution resulting from an energy E(y)is p(y) = 1

Z exp



 X

i

ψ_i(y_i) +^X

ij

ψ_ij(y_i, y_j)



=

= 1

Z

Y

i

ψ˜i(y_i)·^Y

ij

ψ˜ij(y_i, yj)

with

Z =^X

y



 Y

i

ψ˜_i(y_i)·^Y

ij

ψ˜_ij(y_i, y_j)





Energy minimization is a “special case” of the Partition Function (Zero-Temperature Limit)

miny E(y) = lim

t→0 t·ln^X

y

exp

"

−E(y) t

#

(18)

General formulation

CSP: ^_

y

^

i

ψ_i(y_i)∧^{^}

ij

ψ_ij(y_i, y_j)

Energy Minimization: min

y

X

i

ψ_i(y_i) +^X

ij

ψ_ij(y_i, y_j)

Partition function: ^X

y

Y

i

ψ_i(y_i)·^Y

ij

ψ_ij(y_i, y_j)

General formulation: ^M

y

O

i

ψ_i(y_i)⊗^O

ij

ψ_ij(y_i, y_j)

i.e. the same tasks but in different Semirings (W,⊕,⊗) ψ_i :K →W und ψ_ij :K² →W

Special cases: OrAnd, MinSum, SumProd ...

(19)

State-of-the-art

All labeling problems are NP in general

All labeling problems can be solved by Dynamic Programming for “simple” graphs (partialw-trees of low tree-width)

There is a dichotomy (with respect to the properties of ψ) for OrAnd on general graphs (P↔NP)

Submodular MinSum for general graphs are P-solvable

There is a lot of efficient approximate algorithms for MinSum on general graphs

There are (less efficient) approximate algorithms for SumProd on general graphs

There are dichotomies for MinSum and SumProd as well (?)