Machine Learning

(1)

Machine Learning

Introduction to Structural Models

Dmitrij Schlesinger

WS2014/2015, 02.02.2015

(2)

The Dream ...

Color Shape Spatial relations Relation “consists of”

Complete scene interpretation

Structural Models:

Data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together.

(3)

A bit reality

– The set of parts is given (e.g. the set of pixels in low-level vision)

– An interpretation (label) should be assigned to each part – There are only relations between parts, the description is not hierarchical – no relation “consists of” (at least not explicitly)

⇒Labeling Problems

These problems can be at least formulated :-), exact solutions are however very seldom :-(

A special case – the set of “pixels” is a chain

⇒Markov Chains

Both formulations and algorithms are relatively simple

(4)

Markov Random Fields (simplified)

(5)

Markov Random Fields (simplified)

GraphG= (V,E) with nodes i∈V and edges ij ∈ E Label setK

F – the set of “elementary” observation (e.g. colors) y:V →K – labeling, y∈ Y

x:V →F – observation (coloring), x∈ X

An elementary event is a pair (x, y), its (negative) energy:

E(x, y) = ^X

i∈V

ψ_i(x_i, y_i) + ^X

ij∈E

ψ_ij(y_i, y_j) The joint probability:

p(x, y) = 1

Z exp^h−E(x, y)ⁱ Partition Function:

Z = ^X

x∈X,y∈Y

exp^h−E(x, y)ⁱ

(6)

Inference – Bayesian Decision Theory

The principle is the same as for unstructured models – minimize the Bayesian Risk, i.e. the expected loss:

Re(x)=^X

y

p(x, y)·Cy, e(x)→min

e

remember thaty are labellings → more complex algorithms Special caseD =Y,C(y, d) = δ(y6=d)

→Maximum A-posteriori Decision y^∗ = arg max

y

p(x, y) = arg min

y

E(x, y) =

= arg min

y

X

i

ψ_i(x_i, y_i) +^X

ij

ψ_ij(y_i, y_j)

Such tasks are known as Energy Minimization problems.

Additive loss is widely used as well (even more often in some particular domains)→ leads to Marginal based decisions

(7)

Learning

Again, the basic principles are the same as for unstructured models:

– Statistical Learning – Maximum (Conditional) Likelihood - Supervised – gradient ascent (usually convex functions),

stochastic gradient ...

- Unsupervised – Expectation-Maximization Algorithm, gradient ascent ...

– Discriminative Learning – Empirical Risk minimization (sub-gradient algorithms), Large Margin learning (quadratic optimization) etc.

The difference to unstructured models – more complex algorithms, because for structured models practically nothing can be computed exactly :-(

(8)

MRF-s are members of the Exponential Family

The energy can be written in form of scalar product E(x, y;θ) =E(x, y;w) =hφ(x, y), wi it is sometimes called “overcomplete” representation (see the board for an example).

→For almost any kind of learning the sufficient statistics φ(x, y)are crucial.

Interesting: the Perceptron Algorithm (that is indeed very simple) is applicable for some tasks of discriminative learning.

Large Margin, SVM, Kernels etc. are possible as well

(9)

Conditional Random Fields

Conditional Random Fields(CRF) – MRF-s that model posterior distributions of labellings instead of the joint ones Energy (fast) as before

E(x, y) = ^X

i

ψi(y_i, x) + ^X

ij∈E

ψij(y_i, yj, x) But now – posterior:

p(y|x) = 1

Z(x)exp^h−E(x, y)ⁱ with image-specific Partition Function:

Z(x) = ^X

y∈Y

exp^h−E(x, y)ⁱ

Energy Minimization→ Hopfield Networks

(Hopfield) Networks with stochastic Neurons→ MRF-s also known asBolzmann Machines (Feed-Forward as well)

(10)

Some popular MRF-s

... of second order over the pixel grid, 4-neighborhood (because simple) – segmentation, denoising, deconvolution, stereo, motion fields etc.

... withcontinuous label spaces – denoising, stereo ... withdense neighborhood structure – shape modeling (e.g. curvature), segmentation

... ofhigher order

E(y) = ^X

i∈V

ψ_i(y_i) + ^X

ij∈E

ψ_ij(y_i, y_j) + ^X

ijk∈C

ψ_ijk(y_i, y_j, y_k) +. . . – all the stuff above

(11)

Labeling Problems – a generalization

Constraint Satisfaction Problems (CSP) – OrAnd

_

y





^

i

ψ_i(y_i)∧^{^}

ij

ψ_ij(y_i, y_j)





Energy Minimization – MinSum miny



 X

i

ψ_i(y_i) +^X

ij

ψ_ij(y_i, y_j)





Partition Function – SumProd

X

y



 Y

i

ψ_i(y_i)·^Y

ij

ψ_ij(y_i, y_j)





Generalized formulation

M

y



 O

i

ψ_i(y_i)⊗^O

ij

ψ_ij(y_i, y_j)





(12)

Labeling Problems – state-of-the-art

All labeling problems are NP-complete in general

All labeling problems can be solved exactly and efficiently by Dynamic Programming, if the graph is simple (chains, trees, cycles, partialw-trees of low tree-width)

For OrAnd problems over general graphs there is a dichotomy (polynomial↔ NP) with respect to the properties of

ψ-functions

Submodular MinSum Problems are exactly solvable

There are many (good) approximate algorithms for MinSum over general graphs – relaxation based, search based, partial optimality etc.

There are also approximations for SumProd

There is also a dichotomy for MinSum and SumProd (?)

(13)

So what?

... see you in summer semester at “Computer Vision II”,

“Machine Learning II” ...

(14)

Machine Learning II

Lecturer: Bogdan Savchynskyy Joined CVLD this year

The “old” homepage:

hci.iwr.uni-heidelberg.de/Staff/

bsavchyn/

(the new one should be on the way...)

Content: Background for expert use of graphical models (inference+learning) and development of new algorithms.

(15)

Machine Learning II

VL1. Introduction, motivation, relation to other courses.

Graphical models.

VL2. Inference in graphical models as integer linear program.

Linear programming relaxation.

VL3. Convex Optimization. Lagrange duality, complementary slackness.

VL4. Convex Optimization. First order smooth and non-smooth optimization (gradient, coordinate descent, sub-gradient, Newton method).

VL5. Convex Optimization. Lagrangian (dual) decomposition.

VL6. Tree-structured graphical models. Dynamic programming. Reparametrization.

(16)

Machine Learning II

VL7. Dual of LP relaxation. Sub-gradient and coordinate ascent.

VL8. Dual decomposition for inference with graphical models.

VL9. Sub-modularity. Dual LP as max-flow problem.

Sub-modularity as a sufficient condition for non-negativity (solvability).

VL10. Graph cut based algorithms. Roof duality. Partial optimality.

VL11. Generative learning . Marginal probabilities vs. Gibbs potentials.

VL12. Discriminative learning. Structural SVM. Cutting plane algorithm.

(17)

Machine Learning II

VL13. Structural SVM. Bundle, sub-gradient algorithms.

Latent SVM.

VL14. Outlook. What we have learned and advanced topics.

Exercises: Blackboard + OpenGM

Other courses:

– Computer Vision II will be offered

– Proseminar, Hauptseminar, Komplexpraktikum ...