• Keine Ergebnisse gefunden

Machine Learning Empirical Risk Minimization

N/A
N/A
Protected

Academic year: 2022

Aktie "Machine Learning Empirical Risk Minimization"

Copied!
13
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning

Empirical Risk Minimization

Dmitrij Schlesinger

WS2014/2015, 08.12.2014

(2)

Recap – tasks considered before

For a training datasetL=(xl, yl). . . with xl ∈Rn (data) and yl ∈ {+1,−1} (classes) fing a separating hyperplane

yl·[hw, xli+b]≥0 ∀l

Perceptron algorithm

The goal is to fing a "corridor" (stripe) of the maximal width that separates the data

Large Margin learning, linear SVM 1

2kwk2 →min

w

s.t. yl[hw, xli+b]≥1 In both cases the data is assumed to beseparable.

What ifnot?

(3)

Empirical Risk Minimization

Let a loss function C(y, y0) be given that penalizes deviations between the true class and the estimated one (like the loss in the Bayesian Decision theory). TheEmpirical Riskof a decision strategy is the total loss over the training set:

R(e) =X

l

Cyl, e(xl)→min

e

It should bi minimized with respect to the decision strategye.

Special case (today):

– the set of decisions is {+1,−1}

– the loss is the delta-function C(y, y0) =δ(y6=y0) – the decision strategy can be expressed in the form

e(x) =signf(x) with anevaluation function f :X →R Example: f(x) = hw, xi −b is a linear classifier.

(4)

Hinge Loss

The problem: the subject is not convex

The way out: replace the real loss by itsconvex upper bound

←example for y= 1

(fory=−1it should be flipped)

δ

y 6=signf(x)

≤max0,1−y·f(x)

It is calledHinge Loss

(5)

Sub-gradient algorithm

Let the evaluation function be parameterized, i.e.f(x;θ) for exampleθ = (w, b) for linear classifiers

The optimization problem reads:

H(θ) =X

l

max0,1−yl·f(xl;θ)→min

θ

It is convex with respect to f but not differentiable Solution by the sub-gradient (descent) algorithm:

1. Compute the sub-gradient (later)

2. Apply it with a step size that is decreasing in time θ(t+1) =θ(t)γ(t)∂H

∂θ with lim

t→∞γ(t) = 0 and

P

t=1

γ(t) =∞ (e.g. γ(t) = 1/t)

(6)

The sub-gradient for the Hinge loss

H(θ) =X

l

max0,1−yl·f(xl;θ)→min

θ

1. Estimate data points with the Hinge grater zero L0 ={l:yl·f(xl)<1}

2. The sub-gradient is

∂H

∂θ =−X

l∈L0

yl· ∂f(xl;θ)

∂θ In particular, for linear classifiers

∂H

∂θ =−X

l∈L0

yl·xl

i.e. some data points are added to the parameter vector

→it reminds on the Perceptron algorithm

(7)

Kernelization

Let us optimize the Hinge loss in a feature space Φ :X → H,

→the (linear) evaluation function is not f(x;w) =hx, wi but f(x;w) =hΦ(x), wi

Remember (see the previous lecture) that the evaluation function can be expressed as a linear combination

f(x;α) =hΦ(x), wi=hΦ(x),X

i

αiyiΦ(xi)i=

=X

i

αiyihΦ(x),Φ(xi)i=X

i

αiyiκ(x, xi)

The subject to be minimized reads now H(α) =X

l

max0,1−X

i

αiyiκ(xl, xi)→min

α

(8)

Kernelization

The task:

H(α) =X

l

max0,1−X

i

αiyiκ(xl, xi)→min

α

The sub-gradient:

∂H

∂αi =−yi X

l∈L0

ylκ(xl, xi)

As usual for kernels neither the feature space H nor the mapping Φ(x) are necessary in order to estimate α-s, if the kernelκ(x, x0) is given !!!

(9)

Maximum margin vs. minimum loss

– Linear SVM: maximum margin, separable data – Non-separable data: min. Empirical Risk, Hinge loss – "Kernelization" can be done for both variants

Does is have sense to to minimize loss (especially the Hinge loss, which has a "geometrical nature") defined in the feature space ?

It is indeed always possible to make the training set separable by choosing a suitable kernel.

Interesting – both formulations are equivalent in certain circumstances.

(10)

A "unified" formulation

In Machine Learning it is a common technique to enhance an objective function (e.g. the average loss) by a regularizer

R(w) + C

|L|

X

l

`(xl, yl, w)

with

– parameter vector w

– loss `(xl, yl, w), e.g. delta, Hinge, metric, additive etc.

– regularizer R(w), e.g. ||w||2, |w|1 etc.

– balancing factor C

Now: we start from the unified formulation, make some assumption and end up with the maximum margin learning

(11)

Hinge loss → maximum margin

R(w) + C

|L|

X

l

`(xl, yl, w)→min

w

Assumption: the training set is separable, i.e. there exists such wthat the loss is zero. Set C to a very high value and obtain

R(w)→min

w

s.t. `(xl, yl, w) = 0 ∀l

SetR(w) = 1/2||w||2 and ` to the Hinge loss (for linear cl.)

`(xl, yl, w) = max(0,1−ylhw, xli) = 0 It results in the maximum margin learning

1

2||w||2 →min

w

s.t. ylhw, xli ≥1 ∀l

(12)

Maximum margin → Hinge loss

Let us "weaken" the maximum margin learning – we introduce so-called slack-variablesξl ≥0that represent "errors":

1

2||w||2+ C

|L|

X

l

ξl →min

w

s.t. ylhw, xli ≥1−ξl ∀l Note that we want to minimizethe total error.

On the other hand

ξl ≥1−ylhw, xli and ξl ≥0 ⇒ ξl= max(0,1−ylhw, xli) Substituting it gives

1

2||w||2+ C

|L|

X

l

max(0,1−ylhw, xli)→min

w

i.e. the Hinge loss minimization

(13)

Summary

R(w) + C

|L|

X

l

`(xl, yl, w)→min

w

Two extremes:

– Big C →the loss is more important → better recognition rate but smaller margin (worse generalization)

– Small C → the generalization is more important →larger margin (more robust classifier) but worse recognition rate Recommended reading:

Sebastian Nowozin and Christopher H. Lampert,

"Structured Prediction and Learning in Computer Vision", Foundations and Trends in Computer Graphics and Vision, Vol. 6, Nr. 3-4

http://www.nowozin.net/sebastian/cvpr2012tutorial

Referenzen

ÄHNLICHE DOKUMENTE

This paper presents of theoretical specification of a quadratic loss function based on forward looking rational expectations to model the underlying dynamics of

Let (in addition to the training data) a loss function be given that penalizes deviations between the true class and the estimated one (the same as the cost function in

The approach – minimize the average (expected) loss – Simple δ-loss – Maximum A-posteriori Decision – Rejection – an example of &#34;extended&#34; decision sets

The decision strategy e : X → K partitions the input space into two re- gions: the one corresponding to the first and the one corresponding to the second class.. How does this

Let (in addition to the training data) a loss function be given that penalizes deviations between the true class and the estimated one (the same as the cost function in

As a result, there is always a high probability of loss of value in gift giving that Waldfogel (1993) calls the deadweight loss of Christmas given the

The surrogate production process (SPP) conveys the same present discounted value of total output, uses the same present discounted value of labor input, and has the same average

The role which anger plays as a consequence of loss of control experiences as well as how it impacts subsequent risk‐related decision making is traced in Study II in an attempt