• Keine Ergebnisse gefunden

Machine Learning AdaBoost

N/A
N/A
Protected

Academic year: 2022

Aktie "Machine Learning AdaBoost"

Copied!
11
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning AdaBoost

Dmitrij Schlesinger

WS2013/2014, 31.01.2014

(2)

Idea

Compose a“strong” classifier from “weak” ones

Compare with SVM –complex feature spaces,one classifier.

Given:

– a set of weak classifiers H. Example: linear classifiers for two classes h∈ H:X → {−1,+1}

h(x) =signhx, wi+b – labeled training data

(x1, k1),(x2, k2). . .(xm, km), xi ∈ X, ki ∈ {−1,+1}

Find a strong classifier

f(x) = sign

T

X

t=1

αtht(x)

with ht∈ H, αt ∈Rthat separates the training data as good as possible.

(3)

Power of the set of decision strategies

Is it possible to classify any training data without errors?

If not any, which ones?

Yes, it is (if the number of used classifiers is not restricted).

Example for x∈R, h(x) = sign(hx, wi+b) =±sign(x−b)

Key idea: it is possible to build an “elementary” classifier for each particular data point that is “neutral” for all others.

(4)

Power of the set of decision strategies

Examples in x∈R2

(5)

Algorithm

Given:(x1, k1),(x2, k2). . .(xm, km), xi ∈ X,ki ∈ {−1,+1}

Initialize Weightsfor all samples as D(1)(i) = 1/m Fort= 1, . . . , T

(1) Choose (learn) a weak classifier ht∈ H taking into account the actual weights D(t) (2) Choose αt

(3) Update weights:

D(t+1)(i) = D(t)(i)·exp−αtyiht(xi) Zt

with a normalizing factorZt so that PiD(t+1)(i) = 1.

The final strong classifier is:

f(x) = sign

T X

t=1

αtht(x)

(6)

Algorithm (1)

Choose (learn) a weak classifier ht∈ H.

ht = arg min

h∈H

(D(t), h) = arg min

h∈H

X

i

D(t)(i)·δyi, h(xi)

i.e. choose the best one from a pre-defined faimy H – it can be SVM, Hinge Loss, whatever ...

Note: AdaBoost is a meta-algorithm, it can work practically with any classifier family H

The specialty here – the data poins are weighted withD(i) Pre-requirement: the error (D(t), h)<1/2

– the bestht should be not worse than a random choice.

Otherwise – Stop.

(7)

Algorithm (2)

Chooseαt.

The goal is to buildf(x) so that its error (f) = X

i

δyi, f(xi)

is minimal.

The upper bound for the overall error is (f)QTt=1Zt.

→choose αt (greedy) so that the actual Zt is minimal Zt=X

i

D(t)(i)·exp−αtyiht(xi)→min

αt

The task is convex and differentiable→ analytical solution

αt= 1/2 ln 1−(D(t), ht) (D(t), ht)

!

(8)

Algorithm (3)

Update weights.

D(t+1)(i)∼D(t)(i)·exp−αtyiht(xi)

Note:αt>0

yiht(xi)>0 (correct) ⇒ exp−αtyiht(xi)<1 yiht(xi)<0 (error) ⇒ exp−αtyiht(xi)>1 The samples that are actually missclassified are supported

⇒the classifier ht+1 in the next round will attempt to classify properly just these.

Examples by Sochman, Tippetts, Freund

http://cseweb.ucsd.edu/~yfreund/adaboost/

(9)

Sumary

History:

1990 – Boost-by-majority algorithm (Freund) 1995 – AdaBoost (Freund & Schapire)

1997 – Generalized version of AdaBoost (Schapire &

Singer) (today)

2001 – AdaBoost in Face Detection (Viola & Jones)

Some interesting properties:

– AB is a simple linear combination of (linear) classifiers – AB converges to the logarithm of the likelihood ratio – AB has good generalization capabilities (?)

– AB is a feature selector

See also: http://www.boosting.org/

(10)

Viola & Jones’s Face Detection

Rapid Object Detection using a Boosted Cascade of Simple Features (CVPR 2001)

Haar features can be computed very fast

24×24window ×. . .→180.000 feature values per position!!!

Weak classifier – convolution with Haar-kernel ≶threshold AdaBoost for learning – feature selector

The two best features

Best features – 0.1 to 0.3 error, the other ones – 0.4 to 0.5

(11)

Viola & Jones’s Face Detection

Database: 130 images, 507 faces

Overall error rate – about 7%

Referenzen

ÄHNLICHE DOKUMENTE

A rectifier neural network with d input units and L hidden layers of width m ≥ d can compute functions that have Ω m d (L−1)d m d linear regions..

The learning process is the process of choosing an appropriate function from a given set of functions.. Note: from a Bayesian viewpoint we would rather define a distribution

I Gaussian processes for classification Relation to kernel models (e.g. SVMs) Relation to neural networks... 2D Gaussians: a

The famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50

2020: Learning Extremal Representations with Deep Archetypal Analysis.. Applications:

Conferences Computational Learning Theory (COLT), Neural Information Processing Systems (NIPS), Int. Journals Machine Learning, Journal of Machine Learning

– Each of the fittest individuals produce λ/µ children (mutation) – Join operation replaces the parents by the children. Hans-Paul Schwefel

• Difference in selection and breeding operation – ES selects parents before breeding children.. – GA selects little-by-little parents to breed