• Keine Ergebnisse gefunden

Machine Learning

N/A
N/A
Protected

Academic year: 2022

Aktie "Machine Learning"

Copied!
17
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning

Introduction to Structural Models

Dmitrij Schlesinger

WS2014/2015, 02.02.2015

(2)

The Dream ...

Color Shape Spatial relations Relation “consists of”

Complete scene interpretation

Structural Models:

Data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together.

(3)

A bit reality

– The set of parts is given (e.g. the set of pixels in low-level vision)

– An interpretation (label) should be assigned to each part – There are only relations between parts, the description is not hierarchical – no relation “consists of” (at least not explicitly)

Labeling Problems

These problems can be at least formulated :-), exact solutions are however very seldom :-(

A special case – the set of “pixels” is a chain

Markov Chains

Both formulations and algorithms are relatively simple

(4)

Markov Random Fields (simplified)

(5)

Markov Random Fields (simplified)

GraphG= (V,E) with nodes iV and edges ij ∈ E Label setK

F – the set of “elementary” observation (e.g. colors) y:VK – labeling, y∈ Y

x:VF – observation (coloring), x∈ X

An elementary event is a pair (x, y), its (negative) energy:

E(x, y) = X

i∈V

ψi(xi, yi) + X

ij∈E

ψij(yi, yj) The joint probability:

p(x, y) = 1

Z exph−E(x, y)i Partition Function:

Z = X

x∈X,y∈Y

exph−E(x, y)i

(6)

Inference – Bayesian Decision Theory

The principle is the same as for unstructured models – minimize the Bayesian Risk, i.e. the expected loss:

Re(x)=X

y

p(x, y)·Cy, e(x)→min

e

remember thaty are labellings → more complex algorithms Special caseD =Y,C(y, d) = δ(y6=d)

→Maximum A-posteriori Decision y = arg max

y

p(x, y) = arg min

y

E(x, y) =

= arg min

y

X

i

ψi(xi, yi) +X

ij

ψij(yi, yj)

Such tasks are known as Energy Minimization problems.

Additive loss is widely used as well (even more often in some particular domains)→ leads to Marginal based decisions

(7)

Learning

Again, the basic principles are the same as for unstructured models:

– Statistical Learning – Maximum (Conditional) Likelihood - Supervised – gradient ascent (usually convex functions),

stochastic gradient ...

- Unsupervised – Expectation-Maximization Algorithm, gradient ascent ...

– Discriminative Learning – Empirical Risk minimization (sub-gradient algorithms), Large Margin learning (quadratic optimization) etc.

The difference to unstructured models – more complex algorithms, because for structured models practically nothing can be computed exactly :-(

(8)

MRF-s are members of the Exponential Family

The energy can be written in form of scalar product E(x, y;θ) =E(x, y;w) =hφ(x, y), wi it is sometimes called “overcomplete” representation (see the board for an example).

→For almost any kind of learning the sufficient statistics φ(x, y)are crucial.

Interesting: the Perceptron Algorithm (that is indeed very simple) is applicable for some tasks of discriminative learning.

Large Margin, SVM, Kernels etc. are possible as well

(9)

Conditional Random Fields

Conditional Random Fields(CRF) – MRF-s that model posterior distributions of labellings instead of the joint ones Energy (fast) as before

E(x, y) = X

i

ψi(yi, x) + X

ij∈E

ψij(yi, yj, x) But now – posterior:

p(y|x) = 1

Z(x)exph−E(x, y)i with image-specific Partition Function:

Z(x) = X

y∈Y

exph−E(x, y)i

Energy Minimization→ Hopfield Networks

(Hopfield) Networks with stochastic Neurons→ MRF-s also known asBolzmann Machines (Feed-Forward as well)

(10)

Some popular MRF-s

... of second order over the pixel grid, 4-neighborhood (because simple) – segmentation, denoising, deconvolution, stereo, motion fields etc.

... withcontinuous label spaces – denoising, stereo ... withdense neighborhood structure – shape modeling (e.g. curvature), segmentation

... ofhigher order

E(y) = X

i∈V

ψi(yi) + X

ij∈E

ψij(yi, yj) + X

ijk∈C

ψijk(yi, yj, yk) +. . . – all the stuff above

(11)

Labeling Problems – a generalization

Constraint Satisfaction Problems (CSP) – OrAnd

_

y

^

i

ψi(yi)∧^

ij

ψij(yi, yj)

Energy Minimization – MinSum miny

X

i

ψi(yi) +X

ij

ψij(yi, yj)

Partition Function – SumProd

X

y

Y

i

ψi(yiY

ij

ψij(yi, yj)

Generalized formulation

M

y

O

i

ψi(yi)⊗O

ij

ψij(yi, yj)

(12)

Labeling Problems – state-of-the-art

All labeling problems are NP-complete in general

All labeling problems can be solved exactly and efficiently by Dynamic Programming, if the graph is simple (chains, trees, cycles, partialw-trees of low tree-width)

For OrAnd problems over general graphs there is a dichotomy (polynomial↔ NP) with respect to the properties of

ψ-functions

Submodular MinSum Problems are exactly solvable

There are many (good) approximate algorithms for MinSum over general graphs – relaxation based, search based, partial optimality etc.

There are also approximations for SumProd

There is also a dichotomy for MinSum and SumProd (?)

(13)

So what?

... see you in summer semester at “Computer Vision II”,

“Machine Learning II” ...

(14)

Machine Learning II

Lecturer: Bogdan Savchynskyy Joined CVLD this year

The “old” homepage:

hci.iwr.uni-heidelberg.de/Staff/

bsavchyn/

(the new one should be on the way...)

Content: Background for expert use of graphical models (inference+learning) and development of new algorithms.

(15)

Machine Learning II

VL1. Introduction, motivation, relation to other courses.

Graphical models.

VL2. Inference in graphical models as integer linear program.

Linear programming relaxation.

VL3. Convex Optimization. Lagrange duality, complementary slackness.

VL4. Convex Optimization. First order smooth and non-smooth optimization (gradient, coordinate descent, sub-gradient, Newton method).

VL5. Convex Optimization. Lagrangian (dual) decomposition.

VL6. Tree-structured graphical models. Dynamic programming. Reparametrization.

(16)

Machine Learning II

VL7. Dual of LP relaxation. Sub-gradient and coordinate ascent.

VL8. Dual decomposition for inference with graphical models.

VL9. Sub-modularity. Dual LP as max-flow problem.

Sub-modularity as a sufficient condition for non-negativity (solvability).

VL10. Graph cut based algorithms. Roof duality. Partial optimality.

VL11. Generative learning . Marginal probabilities vs. Gibbs potentials.

VL12. Discriminative learning. Structural SVM. Cutting plane algorithm.

(17)

Machine Learning II

VL13. Structural SVM. Bundle, sub-gradient algorithms.

Latent SVM.

VL14. Outlook. What we have learned and advanced topics.

Exercises: Blackboard + OpenGM

Other courses:

– Computer Vision II will be offered

– Proseminar, Hauptseminar, Komplexpraktikum ...

Referenzen

ÄHNLICHE DOKUMENTE

In particular: linear algebra (vectors, matrices, SVD, scalar products), a bit geometry, functions (derivative, gradients, integrals, series), optimization, probability theory

Design a network (weights and biases for each neuron) so that the energy of a configuration is proportional to the number of cracks, i.e..

Example: Objects are nodes of a weighted graph, is the length of the shortest path from to4. Distances for “other”

Let (in addition to the training data) a loss function be given that penalizes deviations between the true class and the estimated one (the same as the cost function in

– Discriminative Learning – Empirical Risk minimization (sub-gradient algorithms), Large Margin learning (quadratic optimization) etc.. The difference to unstructured models –

This thesis investigates the use of copula models, in particular Gaussian copulas, for solving vari- ous machine learning problems and makes contributions in the domains of

The Bayesian view allows any self-consistent ascription of prior probabilities to propositions, but then insists on proper Bayesian updating as evidence arrives. For example P(cavity)

Bayesian concept learning: the number game The beta-binomial model: tossing coins The Dirichlet-multinomial model: rolling dice... Bayesian