Machine Learning Empirical Risk Minimization

(1)

Machine Learning

Empirical Risk Minimization

Dmitrij Schlesinger

WS2014/2015, 08.12.2014

(2)

Recap – tasks considered before

For a training datasetL=(x_l, y_l). . . with x_l ∈Rⁿ (data) and y_l ∈ {+1,−1} (classes) fing a separating hyperplane

y_l·[hw, x_li+b]≥0 ∀l

→Perceptron algorithm

The goal is to fing a "corridor" (stripe) of the maximal width that separates the data

→ Large Margin learning, linear SVM 1

2kwk² →min

w

s.t. y_l[hw, x_li+b]≥1 In both cases the data is assumed to beseparable.

What ifnot?

(3)

Empirical Risk Minimization

Let a loss function C(y, y⁰) be given that penalizes deviations between the true class and the estimated one (like the loss in the Bayesian Decision theory). TheEmpirical Riskof a decision strategy is the total loss over the training set:

R(e) =^X

l

Cy_l, e(x_l)→min

e

It should bi minimized with respect to the decision strategye.

Special case (today):

– the set of decisions is {+1,−1}

– the loss is the delta-function C(y, y⁰) =δ(y6=y⁰) – the decision strategy can be expressed in the form

e(x) =signf(x) with anevaluation function f :X →R Example: f(x) = hw, xi −b is a linear classifier.

(4)

Hinge Loss

The problem: the subject is not convex

The way out: replace the real loss by itsconvex upper bound

←example for y= 1

(fory=−1it should be flipped)

δ

y 6=signf(x)

≤max0,1−y·f(x)

It is calledHinge Loss

(5)

Sub-gradient algorithm

Let the evaluation function be parameterized, i.e.f(x;θ) for exampleθ = (w, b) for linear classifiers

The optimization problem reads:

H(θ) =^X

l

max0,1−yl·f(x_l;θ)→min

θ

It is convex with respect to f but not differentiable Solution by the sub-gradient (descent) algorithm:

1. Compute the sub-gradient (later)

2. Apply it with a step size that is decreasing in time θ^(t+1) =θ^(t)−γ(t)∂H

∂θ with lim

t→∞γ(t) = 0 and

∞

P

t=1

γ(t) =∞ (e.g. γ(t) = 1/t)

(6)

The sub-gradient for the Hinge loss

H(θ) =^X

l

max0,1−y_l·f(x_l;θ)→min

θ

1. Estimate data points with the Hinge grater zero L⁰ ={l:y_l·f(x_l)<1}

2. The sub-gradient is

∂H

∂θ =−^X

l∈L⁰

y_l· ∂f(x_l;θ)

∂θ In particular, for linear classifiers

∂H

∂θ =−^X

l∈L⁰

y_l·x_l

i.e. some data points are added to the parameter vector

→it reminds on the Perceptron algorithm

(7)

Kernelization

Let us optimize the Hinge loss in a feature space Φ :X → H,

→the (linear) evaluation function is not f(x;w) =hx, wi but f(x;w) =hΦ(x), wi

Remember (see the previous lecture) that the evaluation function can be expressed as a linear combination

f(x;α) =hΦ(x), wi=hΦ(x),^X

i

α_iy_iΦ(x_i)i=

=^X

i

α_iy_ihΦ(x),Φ(x_i)i=^X

i

α_iy_iκ(x, x_i)

The subject to be minimized reads now H(α) =^X

l

max0,1−^X

i

α_iy_iκ(x_l, x_i)→min

α

(8)

Kernelization

The task:

H(α) =^X

l

max0,1−^X

i

αiyiκ(xl, xi)→min

α

The sub-gradient:

∂H

∂α_i =−y_i ^X

l∈L⁰

y_lκ(x_l, x_i)

As usual for kernels neither the feature space H nor the mapping Φ(x) are necessary in order to estimate α-s, if the kernelκ(x, x⁰) is given !!!

(9)

Maximum margin vs. minimum loss

– Linear SVM: maximum margin, separable data – Non-separable data: min. Empirical Risk, Hinge loss – "Kernelization" can be done for both variants

Does is have sense to to minimize loss (especially the Hinge loss, which has a "geometrical nature") defined in the feature space ?

It is indeed always possible to make the training set separable by choosing a suitable kernel.

Interesting – both formulations are equivalent in certain circumstances.

(10)

A "unified" formulation

In Machine Learning it is a common technique to enhance an objective function (e.g. the average loss) by a regularizer

R(w) + C

|L|

X

l

`(x_l, y_l, w)

with

– parameter vector w

– loss `(x_l, y_l, w), e.g. delta, Hinge, metric, additive etc.

– regularizer R(w), e.g. ||w||², |w|₁ etc.

– balancing factor C

Now: we start from the unified formulation, make some assumption and end up with the maximum margin learning

(11)

Hinge loss → maximum margin

R(w) + C

|L|

X

l

`(x_l, y_l, w)→min

w

Assumption: the training set is separable, i.e. there exists such wthat the loss is zero. Set C to a very high value and obtain

R(w)→min

w

s.t. `(x_l, y_l, w) = 0 ∀l

SetR(w) = 1/2||w||² and ` to the Hinge loss (for linear cl.)

`(x_l, y_l, w) = max(0,1−y_lhw, x_li) = 0 It results in the maximum margin learning

1

2||w||² →min

w

s.t. y_lhw, x_li ≥1 ∀l

(12)

Maximum margin → Hinge loss

Let us "weaken" the maximum margin learning – we introduce so-called slack-variablesξ_l ≥0that represent "errors":

1

2||w||²+ C

|L|

X

l

ξl →min

w

s.t. y_lhw, x_li ≥1−ξ_l ∀l Note that we want to minimizethe total error.

On the other hand

ξ_l ≥1−y_lhw, x_li and ξ_l ≥0 ⇒ ξ_l= max(0,1−y_lhw, x_li) Substituting it gives

1

2||w||²+ C

|L|

X

l

max(0,1−y_lhw, x_li)→min

w

i.e. the Hinge loss minimization

(13)

Summary

R(w) + C

|L|

X

l

`(x_l, y_l, w)→min

w

Two extremes:

– Big C →the loss is more important → better recognition rate but smaller margin (worse generalization)

– Small C → the generalization is more important →larger margin (more robust classifier) but worse recognition rate Recommended reading:

Sebastian Nowozin and Christopher H. Lampert,

"Structured Prediction and Learning in Computer Vision", Foundations and Trends in Computer Graphics and Vision, Vol. 6, Nr. 3-4

http://www.nowozin.net/sebastian/cvpr2012tutorial