• Keine Ergebnisse gefunden

Machine Learning II Markov Trees, Labeling Problems

N/A
N/A
Protected

Academic year: 2022

Aktie "Machine Learning II Markov Trees, Labeling Problems"

Copied!
19
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning II Markov Trees, Labeling Problems

Dmitrij Schlesinger, Carsten Rother, Dagmar Kainmueller, Florian Jug

SS2014, 16.05.2014

(2)

Markov Trees

A “usual” Parametrization (similar to chains):

Discrete Variables yiK are enumerated,

each one (but the “root”) has a predecessorj(i)< i

The probability of a label configurationy= (y1, y2. . . yn)is p(y) =p(y1

n

Y

i=2

p(yi|yj(i))

(3)

Markov Trees

Parametrization by marginal probabilities (also similar to chains):

p(y) = p(y1

n

Y

i=2

p(yi|yj(i)) =p(y1

n

Y

i=2

p(yi, yj(i)) p(yj(i)) =

=

Q

i,j(i)

p(yi, yj(i))

Q

i

p(yi)n(i)−1 = Y

i∈V

p(yi)1−n(i)· Y

ij∈E

p(yi, yj),

whereE is the set of edges of a cycle-free graph G= (V,E), n(i) is the cardinality of the nodei

There is no “root” anymore, the parametrization is symmetric, nodes need not to be explicitly enumerated

(4)

Dynamic Programming for trees

In short, it is the same as for chains – variables are sequentially eliminated (replaced by Bellman Functions).

It works for both SumProd Problems (Partition Function, marginal probabilities) and for MinSum (MAP)

(5)

Learning (supervised)

In addition to the „usual“ learning the tree structure should be estimated as well

Given a training set L= (y1, y2. . . yl) of labellings estimate:

the Graph, i.e. a cycle free set of edges E, – the numerical parameters ψ for this graph

The parameter set to be estimated depends on the graph, which is also unknown.

The Maximum Likelihood reads:

lnp(L;E, ψ(E)) =X

l

lnp(yl;E, ψ(E))→ max

E,ψ(E)

(6)

Learning

"

X

l

lnp(yl;E, ψ(E))→max

ψ(E)

#

→max

E

“Nested optimization”:

a) Consider a fixed set of edgesE, solve the “inner”

optimization task with respect to ψ

b) Substitute the “inner” task by the solution of a) (the reached value of the likelihood), solve it with respect to E The solution of a) is obvious in the “usual” parametrization by the conditional probabilities:

– just setψi,j(i)(k, k0) to the corresponding normalized statisticspi,j(i)(k, k0) (analogously to the Markov chains)

(7)

Learning

The likelihood reached by the “inner” optimization – for a fixed tree E and

– with the optimal parameters ψi,j(i)(k, k0) =pi,j(i)(k|k0) is:

X

k

p1(k) lnp1(k) +

n

X

i=2

X

kk0

pi,j(i)(k, k0) lnpi,j(i)(k|k0) =

=X

i

(1−n(i))X

k

pi(k) lnpi(k)+

+X

ij∈E

X

kk0

pi,j(k, k0) lnpi,j(k, k0)=

=X

i

(n(i)−1)H(i)− X

ij∈E

H(i, j)=

=−X

i

H(i) + X

ij∈E

[H(i) +H(j)H(i, j)]→max

E

H is the entropy of the corresponding probability distribution

(8)

Learning

X

i

H(i) + X

ij∈E

[H(i) +H(j)−H(i, j)]→max

E

The first sum does not depend on the graph structure and can be neglected

The second sum is an “additive quality”, i.e. there is a “cost”

for each edge, if it is included into the tree

→find the maximal spanning tree with edge costs c(i, j) =H(i) +H(j)−H(i, j) (mutual information) http://en.wikipedia.org/wiki/Minimum_spanning_tree

(9)

Dynamic Programming for general graphs

Consider the following procedure to build a graph – Nodes are added to the graph consecutively

– Each new node is linked to a fully connected subgraph (of the current graph) that consists ofw nodes at most – After all nodes are added some edges are removed Let a graph be given. Its tree-width is the smallestw so, that the graph can be built by the procedure described above Examples: chains, trees – w=1; cycles, simple networks – w=2; n×m 4-connected grids – w= min(n, m)

For a fixedw, the task “whether a given graph has the tree-width w” can be solved inO(nw), i.e. with polynomial time complexity. The task “findw for a given graph” is NP, becausew may be equal n.

(10)

Dynamic Programming for general graphs

The Idea: if the order of nodes (in which they were added during the procedure described above) is known, it is possible to eliminate them in the opposite order. The Bellman

Functions have the order w at most, i.e.B:Kw→R.

The Dynamic Programming can be performed inO(nKw+1) Chain: eliminate the “first” node,w= 1,B:K→R,O(nK2) Tree: eliminate a leave, all the same as for chains

Cycle: eliminate an arbitrary node, w= 2, B:K2→R,O(nK3) The above example: w= 3,B:K3→R, O(nK4)

(11)

Labeling Problems

A graphG= (V,E) with nodes iV and edges {i, j} ∈ E There is a variableyi for each i that can take values from a finite discrete setK (label set, states, range ...)

A labeling y:VK is a mapping that assigns a label from K to each variable yi

Extensions/Variants: Hyper-graphs instead of Graphs, K ⊂R (continuous range) ...

(12)

Constraint Satisfaction Problems

In each node some labels are “disabled” as well as there are

“disabled” label pairs for edges →there are local constraints that are given in form ofboolean functions ψi :K → {0,1}

and ψij :K2 → {0,1}

A labeling is “globally consistent” if it fulfills all constraints:

Q(y) = ^

i∈V

ψi(yi)∧ ^

ij∈E

ψij(yi, yj) The task is to check, whether there exists at least one consistent labeling (∨– “or”, ∧– “and”):

Q=_

y

Q(y) =_

y

^

i∈V

ψi(yi)∧ ^

ij∈E

ψij(yi, yj)

(13)

CSP, Examples

nSAT-Problems (e.g. 2SAT-Problem)

Given: a boolean expression in conjunctive normal form, e.g.

F(a, b, c) = (a∧¯b)∨(¯ac)∨(b∧c) with a, b, c∈ {0,1}.

Find the values for all variables so that F(a, b, c) = 1 hold.

Nodes – variables, label set –{0,1}, edges – ∧-Terms,ψi = 1, ψ12(a, b) =a∧¯b ψ13(a, c) = ¯ac ψ23(b, c) =bc Compute

^

abc

F(a, b, c) = ^

abc

ψ12(a, b)∨ψ13(a, c)∨ψ23(b, c)

(14)

CSP, Examples

n-queens puzzle: place n chess queens on an n×n chessboard so that no two queens threaten each other.

Nodes: {a, b, c, d, e, f, g, h} –

columns/queens (one queen per column) Label set: {1, n}– the vertical position of the corresponding queen

The graph is fully connected ψij(k, k0) =

( 0 if “threaten”

1 otherwise

(15)

CSP, Relaxation Labeling Algorithm

Look at the vicinity of a node and disable configurations that obviously do not belong to the solution, i.e. repeat

ψi(k) = ψi(k)∧^

ij

_

kk0

ψij(k, k0)

ψij(k, k0) =ψij(k, k0)∧ψi(k)∧ψj(k0) as long as something is changed.

After the algorithm finishes there are three possible cases:

1) There is a node for which there is no allowed label

→no consistent labeling

2) There is exactly one allowed label in each node

→there is a consistent labeling

3) In some nodes there are more then two labels allowed

→the task is not solved (example on the board).

CSP is NP-complete in general

(16)

Energy Minimization

ψi :K→R and ψij :K2→R do not “disable”, but “penalize”

The quality of a labeling is Q(y) = X

i

ψi(yi) +X

ij

ψij(yi, yj) Find the labeling of the minimal quality:

Q= min

y Q(y) = min

y

X

i

ψi(yi) +X

ij

ψij(yi, yj)

Example: Maximum a-posteriori Entscheidungen in MRF.

CSP is a special case of the energy minimization – all local qualities (values of ψ functions) are

0– corresponds to the boolean 1, i.e. “allowed” or

∞– corresponds to the boolean 0, i.e. “disabled”.

Synonyms for energy minimization – “SoftCSP”, “ValuedCSP”

...

(17)

Partition Function

The probability distribution resulting from an energy E(y)is p(y) = 1

Z exp

X

i

ψi(yi) +X

ij

ψij(yi, yj)

=

= 1

Z

Y

i

ψ˜i(yiY

ij

ψ˜ij(yi, yj)

with

Z =X

y

Y

i

ψ˜i(yiY

ij

ψ˜ij(yi, yj)

Energy minimization is a “special case” of the Partition Function (Zero-Temperature Limit)

miny E(y) = lim

t→0 t·lnX

y

exp

"

E(y) t

#

(18)

General formulation

CSP: _

y

^

i

ψi(yi)∧^

ij

ψij(yi, yj)

Energy Minimization: min

y

X

i

ψi(yi) +X

ij

ψij(yi, yj)

Partition function: X

y

Y

i

ψi(yiY

ij

ψij(yi, yj)

General formulation: M

y

O

i

ψi(yi)⊗O

ij

ψij(yi, yj)

i.e. the same tasks but in different Semirings (W,⊕,⊗) ψi :KW und ψij :K2W

Special cases: OrAnd, MinSum, SumProd ...

(19)

State-of-the-art

All labeling problems are NP in general

All labeling problems can be solved by Dynamic Programming for “simple” graphs (partialw-trees of low tree-width)

There is a dichotomy (with respect to the properties of ψ) for OrAnd on general graphs (P↔NP)

Submodular MinSum for general graphs are P-solvable

There is a lot of efficient approximate algorithms for MinSum on general graphs

There are (less efficient) approximate algorithms for SumProd on general graphs

There are dichotomies for MinSum and SumProd as well (?)

Referenzen

ÄHNLICHE DOKUMENTE

Recall the notion of a (global) mincut in a graph. Let G be a graph such that mincut of this graph has at least

As initially H v (0) = L v (0) at all nodes and both increase at the same rate, there can be no first received message causing any node to set its logical clock to a larger value

To prove the general case, assume (without loss of generality) that C is the shortest cycle satisfying the hypotheses of the statement.. Try arguing

d) Assuming an MST of the terminal graph is already constructed and a corresponding edge set is already known (in the sense that nodes know their incident edges in this set), can

If multiple small compute jobs (in a short time) need to be processed, a High Throughput Cluster is required.. Exercise 2

Since the leading error term is - up to a scalar factor in the mixed derivative - identical to ∂ x ∆u, we consider the discretisation isotropic.... There are several

 Locally construct planar graph used by all neighbor nodes → 2-hop neighbor information needed.  Perform a local planar graph traversal until reaching the last node in view and

Previous experimental research has shown that such models can account for the information processing of dimensionally described and simultaneously presented choice