Machine Learning II Markov Trees, Labeling Problems
Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug
SS2014, 16.05.2014
Markov Trees
A “usual” Parametrization (similar to chains):
Discrete Variables yi ∈K are enumerated,
each one (but the “root”) has a predecessorj(i)< i
The probability of a label configurationy= (y1, y2. . . yn)is p(y) =p(y1)·
n
Y
i=2
p(yi|yj(i))
Markov Trees
Parametrization by marginal probabilities (also similar to chains):
p(y) = p(y1)·
n
Y
i=2
p(yi|yj(i)) =p(y1)·
n
Y
i=2
p(yi, yj(i)) p(yj(i)) =
=
Q
i,j(i)
p(yi, yj(i))
Q
i
p(yi)n(i)−1 = Y
i∈V
p(yi)1−n(i)· Y
ij∈E
p(yi, yj),
whereE is the set of edges of a cycle-free graph G= (V,E), n(i) is the cardinality of the nodei
There is no “root” anymore, the parametrization is symmetric, nodes need not to be explicitly enumerated
Dynamic Programming for trees
In short, it is the same as for chains – variables are sequentially eliminated (replaced by Bellman Functions).
It works for both SumProd Problems (Partition Function, marginal probabilities) and for MinSum (MAP)
Learning (supervised)
In addition to the „usual“ learning the tree structure should be estimated as well
Given a training set L= (y1, y2. . . yl) of labellings estimate:
– the Graph, i.e. a cycle free set of edges E, – the numerical parameters ψ for this graph
The parameter set to be estimated depends on the graph, which is also unknown.
The Maximum Likelihood reads:
lnp(L;E, ψ(E)) =X
l
lnp(yl;E, ψ(E))→ max
E,ψ(E)
Learning
"
X
l
lnp(yl;E, ψ(E))→max
ψ(E)
#
→max
E
“Nested optimization”:
a) Consider a fixed set of edgesE, solve the “inner”
optimization task with respect to ψ
b) Substitute the “inner” task by the solution of a) (the reached value of the likelihood), solve it with respect to E The solution of a) is obvious in the “usual” parametrization by the conditional probabilities:
– just setψi,j(i)(k, k0) to the corresponding normalized statisticsp∗i,j(i)(k, k0) (analogously to the Markov chains)
Learning
The likelihood reached by the “inner” optimization – for a fixed tree E and
– with the optimal parameters ψi,j(i)(k, k0) =p∗i,j(i)(k|k0) is:
X
k
p∗1(k) lnp∗1(k) +
n
X
i=2
X
kk0
p∗i,j(i)(k, k0) lnp∗i,j(i)(k|k0) =
=X
i
(1−n(i))X
k
p∗i(k) lnp∗i(k)+
+X
ij∈E
X
kk0
p∗i,j(k, k0) lnp∗i,j(k, k0)=
=X
i
(n(i)−1)H(i)− X
ij∈E
H(i, j)=
=−X
i
H(i) + X
ij∈E
[H(i) +H(j)−H(i, j)]→max
E
H is the entropy of the corresponding probability distribution
Learning
−X
i
H(i) + X
ij∈E
[H(i) +H(j)−H(i, j)]→max
E
The first sum does not depend on the graph structure and can be neglected
The second sum is an “additive quality”, i.e. there is a “cost”
for each edge, if it is included into the tree
→find the maximal spanning tree with edge costs c(i, j) =H(i) +H(j)−H(i, j) (mutual information) http://en.wikipedia.org/wiki/Minimum_spanning_tree
Dynamic Programming for general graphs
Consider the following procedure to build a graph – Nodes are added to the graph consecutively
– Each new node is linked to a fully connected subgraph (of the current graph) that consists ofw nodes at most – After all nodes are added some edges are removed Let a graph be given. Its tree-width is the smallestw so, that the graph can be built by the procedure described above Examples: chains, trees – w=1; cycles, simple networks – w=2; n×m 4-connected grids – w= min(n, m)
For a fixedw, the task “whether a given graph has the tree-width w” can be solved inO(nw), i.e. with polynomial time complexity. The task “findw for a given graph” is NP, becausew may be equal n.
Dynamic Programming for general graphs
The Idea: if the order of nodes (in which they were added during the procedure described above) is known, it is possible to eliminate them in the opposite order. The Bellman
Functions have the order w at most, i.e.B:Kw→R.
The Dynamic Programming can be performed inO(nKw+1) Chain: eliminate the “first” node,w= 1,B:K→R,O(nK2) Tree: eliminate a leave, all the same as for chains
Cycle: eliminate an arbitrary node, w= 2, B:K2→R,O(nK3) The above example: w= 3,B:K3→R, O(nK4)
Labeling Problems
A graphG= (V,E) with nodes i∈V and edges {i, j} ∈ E There is a variableyi for each i that can take values from a finite discrete setK (label set, states, range ...)
A labeling y:V →K is a mapping that assigns a label from K to each variable yi
Extensions/Variants: Hyper-graphs instead of Graphs, K ⊂R (continuous range) ...
Constraint Satisfaction Problems
In each node some labels are “disabled” as well as there are
“disabled” label pairs for edges →there are local constraints that are given in form ofboolean functions ψi :K → {0,1}
and ψij :K2 → {0,1}
A labeling is “globally consistent” if it fulfills all constraints:
Q(y) = ^
i∈V
ψi(yi)∧ ^
ij∈E
ψij(yi, yj) The task is to check, whether there exists at least one consistent labeling (∨– “or”, ∧– “and”):
Q=_
y
Q(y) =_
y
^
i∈V
ψi(yi)∧ ^
ij∈E
ψij(yi, yj)
CSP, Examples
nSAT-Problems (e.g. 2SAT-Problem)
Given: a boolean expression in conjunctive normal form, e.g.
F(a, b, c) = (a∧¯b)∨(¯a∧c)∨(b∧c) with a, b, c∈ {0,1}.
Find the values for all variables so that F(a, b, c) = 1 hold.
Nodes – variables, label set –{0,1}, edges – ∧-Terms,ψi = 1, ψ12(a, b) =a∧¯b ψ13(a, c) = ¯a∧c ψ23(b, c) =b∧c Compute
^
abc
F(a, b, c) = ^
abc
ψ12(a, b)∨ψ13(a, c)∨ψ23(b, c)
CSP, Examples
n-queens puzzle: place n chess queens on an n×n chessboard so that no two queens threaten each other.
Nodes: {a, b, c, d, e, f, g, h} –
columns/queens (one queen per column) Label set: {1, n}– the vertical position of the corresponding queen
The graph is fully connected ψij(k, k0) =
( 0 if “threaten”
1 otherwise
CSP, Relaxation Labeling Algorithm
Look at the vicinity of a node and disable configurations that obviously do not belong to the solution, i.e. repeat
ψi(k) = ψi(k)∧^
ij
_
kk0
ψij(k, k0)
ψij(k, k0) =ψij(k, k0)∧ψi(k)∧ψj(k0) as long as something is changed.
After the algorithm finishes there are three possible cases:
1) There is a node for which there is no allowed label
→no consistent labeling
2) There is exactly one allowed label in each node
→there is a consistent labeling
3) In some nodes there are more then two labels allowed
→the task is not solved (example on the board).
CSP is NP-complete in general
Energy Minimization
ψi :K→R and ψij :K2→R do not “disable”, but “penalize”
The quality of a labeling is Q(y) = X
i
ψi(yi) +X
ij
ψij(yi, yj) Find the labeling of the minimal quality:
Q= min
y Q(y) = min
y
X
i
ψi(yi) +X
ij
ψij(yi, yj)
Example: Maximum a-posteriori Entscheidungen in MRF.
CSP is a special case of the energy minimization – all local qualities (values of ψ functions) are
0– corresponds to the boolean 1, i.e. “allowed” or
∞– corresponds to the boolean 0, i.e. “disabled”.
Synonyms for energy minimization – “SoftCSP”, “ValuedCSP”
...
Partition Function
The probability distribution resulting from an energy E(y)is p(y) = 1
Z exp
X
i
ψi(yi) +X
ij
ψij(yi, yj)
=
= 1
Z
Y
i
ψ˜i(yi)·Y
ij
ψ˜ij(yi, yj)
with
Z =X
y
Y
i
ψ˜i(yi)·Y
ij
ψ˜ij(yi, yj)
Energy minimization is a “special case” of the Partition Function (Zero-Temperature Limit)
miny E(y) = lim
t→0 t·lnX
y
exp
"
−E(y) t
#
General formulation
CSP: _
y
^
i
ψi(yi)∧^
ij
ψij(yi, yj)
Energy Minimization: min
y
X
i
ψi(yi) +X
ij
ψij(yi, yj)
Partition function: X
y
Y
i
ψi(yi)·Y
ij
ψij(yi, yj)
General formulation: M
y
O
i
ψi(yi)⊗O
ij
ψij(yi, yj)
i.e. the same tasks but in different Semirings (W,⊕,⊗) ψi :K →W und ψij :K2 →W
Special cases: OrAnd, MinSum, SumProd ...
State-of-the-art
All labeling problems are NP in general
All labeling problems can be solved by Dynamic Programming for “simple” graphs (partialw-trees of low tree-width)
There is a dichotomy (with respect to the properties of ψ) for OrAnd on general graphs (P↔NP)
Submodular MinSum for general graphs are P-solvable
There is a lot of efficient approximate algorithms for MinSum on general graphs
There are (less efficient) approximate algorithms for SumProd on general graphs
There are dichotomies for MinSum and SumProd as well (?)