• Keine Ergebnisse gefunden

Pattern Recognition

N/A
N/A
Protected

Academic year: 2022

Aktie "Pattern Recognition"

Copied!
11
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Pattern Recognition

Hinge Loss

(2)

Recap – tasks considered before

Let a training dataset be given with (i) data and (ii) classes

The goal is to find a hyper plane that separates the data

→ Perceptron algorithm

________________________________________________________

The goal is to find a “corridor” (stripe) of the maximal width that separates the data

→ Large margin learning, linear SVM

In both cases the training set is assumed to be separable.

What if not?

(3)

Empirical risk minimization

Let (in addition to the training data) a loss function be given that penalizes deviations between the true class and the estimated one (the same as the cost function in the Bayesian decision theory).

The Empirical Risk of a decision strategy is the total loss:

It should be minimized with respect to the decision strategy .

Special case (today):

• the set of decisions is , i.e. the set of classes

• the loss is the (simplest) delta-function

• the decision strategy can be expressed in the form with an evaluation function

Example: is a linear classifier.

(4)

Hinge Loss

The problem: the subject is not convex

The way out: replace the real loss by its convex upper bound

← example for

(for it should be flipped)

It is called Hinge Loss

(5)

Sub-gradient algorithm

Let the evaluation function be parameterized, i.e. , for example for linear classifiers.

The optimization problem is

It is convex with respect to but non-differentiable.

Solution by the sub-gradient (descent) algorithm:

1. Compute the sub-gradient (later)

2. Apply it with a step size that is decreasing in time

with and (e.g. )

(6)

Sub-gradient algorithm

Remember on the task of interest:

Computation of the sub-gradient for the Hinge Loss:

1. Estimate data points for which the Hinge Loss grater zero

2. The sub-gradient is

In particular, for linear classifiers

i.e. some data points are added (weighted) to the parameter vector

→ it reminds on the Perceptron algorithm

(7)

Kernelization

Remember: the evaluation function can be expressed as

The subject to be minimized is

and the sub-gradient is

As usual for Kernels, neither the feature space nor the mapping are necessary in order to estimate , if the kernel is given.

(8)

Maximum margin vs. minimum loss

• Linear SVM – maximum margin learning, separable data

• Non-separable data – Empirical Risk Minimization, Hinge Loss

• “Kernelization” can be performed for both variants

Does it have sense to minimize the loss defined in the feature space?

It is indeed always possible to make the training set separable by choosing a suitable kernel.

Interesting – both formulations are equivalent in certain circumstances.

(9)

Maximum margin vs. minimum loss

In Machine Learning it is a common technique to enhance an objective function (e.g. the average loss) by a regularizer

A “unified” formulation:

with

• parameter vector

• loss – e.g. delta, hinge, metric, additive etc.

• regularizer – e.g. , etc.

• balancing factor

(10)

Maximum margin vs. minimum loss

Assumption: the training set is separable, i.e. the average loss is zero Set to a very high value, the above formulation can be written as

Set and to the Hinge loss for linear classifiers, i.e.

We obtain just the maximum margin learning

(11)

The last slide 

Recommended reading:

Sebastian Nowozin and Christoph H. Lampert,

"Structured Prediction and Learning in Computer Vision“, Foundations and Trends in Computer Graphics and Vision, Volume 6, Number 3-4

http://www.nowozin.net/sebastian/cvpr2012tutorial/

_______________________________________________________

Next time: AdaBoost – how to combine simple (bad, weak) classifiers in order to obtain a complex (good, strong) one.

Referenzen

ÄHNLICHE DOKUMENTE

Nevertheless, the framework of the fermionic projector differs from standard quantum theory in that for many-particle wave functions, the superposition principle does not hold on

In the reference study, which is based on a centralized supply chain, offgases from the HTL process are used as a feed for hydrogen production and anaerobic digestion (AD) is used

This mean that without the labor force participation female- male ratios, the employment variables does not solve the gender inequality in Japan.The R-square that explain is

The paper has defined the generalised autocovariance function and has shown its potential for three different analytic tasks: testing for white noise, the estimation of the spectrum

This paper presents of theoretical specification of a quadratic loss function based on forward looking rational expectations to model the underlying dynamics of

The surrogate production process (SPP) conveys the same present discounted value of total output, uses the same present discounted value of labor input, and has the same average

Let a loss function C(y, y 0 ) be given that penalizes deviations between the true class and the estimated one (like the loss in the Bayesian Decision theory)...

Let (in addition to the training data) a loss function be given that penalizes deviations between the true class and the estimated one (the same as the cost function in