Machine Learning
Empirical Risk Minimization
Dmitrij Schlesinger
WS2014/2015, 08.12.2014
Recap – tasks considered before
For a training datasetL=(xl, yl). . . with xl ∈Rn (data) and yl ∈ {+1,−1} (classes) fing a separating hyperplane
yl·[hw, xli+b]≥0 ∀l
→Perceptron algorithm
The goal is to fing a "corridor" (stripe) of the maximal width that separates the data
→ Large Margin learning, linear SVM 1
2kwk2 →min
w
s.t. yl[hw, xli+b]≥1 In both cases the data is assumed to beseparable.
What ifnot?
Empirical Risk Minimization
Let a loss function C(y, y0) be given that penalizes deviations between the true class and the estimated one (like the loss in the Bayesian Decision theory). TheEmpirical Riskof a decision strategy is the total loss over the training set:
R(e) =X
l
Cyl, e(xl)→min
e
It should bi minimized with respect to the decision strategye.
Special case (today):
– the set of decisions is {+1,−1}
– the loss is the delta-function C(y, y0) =δ(y6=y0) – the decision strategy can be expressed in the form
e(x) =signf(x) with anevaluation function f :X →R Example: f(x) = hw, xi −b is a linear classifier.
Hinge Loss
The problem: the subject is not convex
The way out: replace the real loss by itsconvex upper bound
←example for y= 1
(fory=−1it should be flipped)
δ
y 6=signf(x)
≤max0,1−y·f(x)
It is calledHinge Loss
Sub-gradient algorithm
Let the evaluation function be parameterized, i.e.f(x;θ) for exampleθ = (w, b) for linear classifiers
The optimization problem reads:
H(θ) =X
l
max0,1−yl·f(xl;θ)→min
θ
It is convex with respect to f but not differentiable Solution by the sub-gradient (descent) algorithm:
1. Compute the sub-gradient (later)
2. Apply it with a step size that is decreasing in time θ(t+1) =θ(t)−γ(t)∂H
∂θ with lim
t→∞γ(t) = 0 and
∞
P
t=1
γ(t) =∞ (e.g. γ(t) = 1/t)
The sub-gradient for the Hinge loss
H(θ) =X
l
max0,1−yl·f(xl;θ)→min
θ
1. Estimate data points with the Hinge grater zero L0 ={l:yl·f(xl)<1}
2. The sub-gradient is
∂H
∂θ =−X
l∈L0
yl· ∂f(xl;θ)
∂θ In particular, for linear classifiers
∂H
∂θ =−X
l∈L0
yl·xl
i.e. some data points are added to the parameter vector
→it reminds on the Perceptron algorithm
Kernelization
Let us optimize the Hinge loss in a feature space Φ :X → H,
→the (linear) evaluation function is not f(x;w) =hx, wi but f(x;w) =hΦ(x), wi
Remember (see the previous lecture) that the evaluation function can be expressed as a linear combination
f(x;α) =hΦ(x), wi=hΦ(x),X
i
αiyiΦ(xi)i=
=X
i
αiyihΦ(x),Φ(xi)i=X
i
αiyiκ(x, xi)
The subject to be minimized reads now H(α) =X
l
max0,1−X
i
αiyiκ(xl, xi)→min
α
Kernelization
The task:
H(α) =X
l
max0,1−X
i
αiyiκ(xl, xi)→min
α
The sub-gradient:
∂H
∂αi =−yi X
l∈L0
ylκ(xl, xi)
As usual for kernels neither the feature space H nor the mapping Φ(x) are necessary in order to estimate α-s, if the kernelκ(x, x0) is given !!!
Maximum margin vs. minimum loss
– Linear SVM: maximum margin, separable data – Non-separable data: min. Empirical Risk, Hinge loss – "Kernelization" can be done for both variants
Does is have sense to to minimize loss (especially the Hinge loss, which has a "geometrical nature") defined in the feature space ?
It is indeed always possible to make the training set separable by choosing a suitable kernel.
Interesting – both formulations are equivalent in certain circumstances.
A "unified" formulation
In Machine Learning it is a common technique to enhance an objective function (e.g. the average loss) by a regularizer
R(w) + C
|L|
X
l
`(xl, yl, w)
with
– parameter vector w
– loss `(xl, yl, w), e.g. delta, Hinge, metric, additive etc.
– regularizer R(w), e.g. ||w||2, |w|1 etc.
– balancing factor C
Now: we start from the unified formulation, make some assumption and end up with the maximum margin learning
Hinge loss → maximum margin
R(w) + C
|L|
X
l
`(xl, yl, w)→min
w
Assumption: the training set is separable, i.e. there exists such wthat the loss is zero. Set C to a very high value and obtain
R(w)→min
w
s.t. `(xl, yl, w) = 0 ∀l
SetR(w) = 1/2||w||2 and ` to the Hinge loss (for linear cl.)
`(xl, yl, w) = max(0,1−ylhw, xli) = 0 It results in the maximum margin learning
1
2||w||2 →min
w
s.t. ylhw, xli ≥1 ∀l
Maximum margin → Hinge loss
Let us "weaken" the maximum margin learning – we introduce so-called slack-variablesξl ≥0that represent "errors":
1
2||w||2+ C
|L|
X
l
ξl →min
w
s.t. ylhw, xli ≥1−ξl ∀l Note that we want to minimizethe total error.
On the other hand
ξl ≥1−ylhw, xli and ξl ≥0 ⇒ ξl= max(0,1−ylhw, xli) Substituting it gives
1
2||w||2+ C
|L|
X
l
max(0,1−ylhw, xli)→min
w
i.e. the Hinge loss minimization
Summary
R(w) + C
|L|
X
l
`(xl, yl, w)→min
w
Two extremes:
– Big C →the loss is more important → better recognition rate but smaller margin (worse generalization)
– Small C → the generalization is more important →larger margin (more robust classifier) but worse recognition rate Recommended reading:
Sebastian Nowozin and Christopher H. Lampert,
"Structured Prediction and Learning in Computer Vision", Foundations and Trends in Computer Graphics and Vision, Vol. 6, Nr. 3-4
http://www.nowozin.net/sebastian/cvpr2012tutorial