• Keine Ergebnisse gefunden

1 A Motivating Example

N/A
N/A
Protected

Academic year: 2022

Aktie "1 A Motivating Example"

Copied!
4
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Algorithms and Uncertainty, Winter 2018/19 Lecture 24 (4 pages)

Boosting

Instructor: Thomas Kesselheim

Often, you may face the situation that you have a way to classify data points in a reasonably good way but not accurately enough. For example, the reason might be that you are using hypotheses that are not expressive enough. Today, we will get to know a powerful technique that allows us to “boost” the accuracy of classification on the training set.

1 A Motivating Example

Let us start with a motivating example. While this example itself feels a little trivial, it hopefully does not take too much imagination to see that such problems also arise in more complex scenarios.

Suppose you have to classify points on the real line R. All you start from is the set of decision stumps: These are the hypotheses of the form

h(x) =

(−1 ifx < a

1 ifx≥a or h(x) =

(−1 ifx > a 1 ifx≤a .

[

a

]

a

Note that we can write such a hypothesis very succinctly by two parameters w1 ∈ R, w2∈ {−1,1} such thathw1,w2(x) =w2·sign(x−w1).1

Suppose your ground truth is actually that all x ∈ [a, b] are positive and all x 6∈ [a, b]

are negative. Even in this very simple example, you cannot get a training error below 13. For example, if S = {(−1,−1),(0,+1),(1,−1)}, one of the points will always be classified incorrectly.

However, a linear combination of classifiers will perform well. Given any training set S = {(x1, y1), . . . ,(xm, ym)}, defineh for any z <minixi by

h(x) = sign (ha,1(x) +hb,−1(x) +hz,−1(x)) .

For eachi, this ensures thath(xi) =yi, so errS(h) = 0. Note that this is still not the ground truth: Below zour classification is incorrect.

z a b

Our goal today will be exactly to extend this idea to general hypothesis classes.

1Redefine the sign function to be +1 at 0 for this purpose.

(2)

Algorithms and Uncertainty, Winter 2018/19 Lecture 24 (page 2 of 4)

2 Problem Statement

Suppose we are given a training setS ={(x1, y1), . . . ,(xm, ym)}and we would like to come up with a hypothesis h such that errS(h) is close to 0. All we have is a weak learning algorithm:

Given any weights p1, . . . , pm with P

ipi = 1 (so the weights define a probability distribution overS), it computes a hypothesis hp such that

errp(hp) := X

i:hp(xi)6=yi

pi≤ 1 2 −γ

for some fixedγ >0.2 Note that forpi = m1, this is exactly the definition of the training error.

That is, the hypothesis hp makes sure that a weighted version of the training error is small.

Example 24.1. We come back to the example of decision stumps but the ground truth being defined by an interval [a, b]. We can show that using a decision stumph that minimizes errp(h) fulfills the above guarantee with γ = 16. That is, the empirical risk minimizer using decision stumps is a weak learner if the ground truth is defined by an interval.

To show that γ = 16, observe that for every vector p and every a < b X

i:xi<a

pi ≤ 1

3 or X

i:a≤xi≤b

pi≤ 1

3 or X

i:xi>b

pi ≤ 1 3 .

Classifying two of the the kinds of data points (below a, between a andb, above b) is easy by a decision stump. It error will be at most 13 = 1216.

We will show that this is enough to design a strong learning algorithm: For any >0, for any training setS, it will use the weak learner to compute hypothesesh1, . . . , hT andα1, . . . , αT such thath with h(x) = sign(α1h1(x) +. . .+αThT(x)) fulfills errS(h)≤.

3 Boosting via Experts Algorithms

Given a weak learning algorithm, we can come up with a strong learning algorithm using no- regret experts algorithms in a very surprising way. To this end, we define an expert for every i∈ {1, . . . , m}. That is, each expert corresponds to a data point in our training set.

It remains to define the cost vectors. We will compute hypotheses h1, h2, . . . as follows. In each step t, the experts algorithm uses a probability distribution p(t) over its experts. Input this to the weak learner and call the output ht. Let `(t)i = 1 if ht(xi) = yi and 0 otherwise.

That is, an expert has a cost of 1 if it was correctly classified. This sounds counter-intuitive first but makes a lot of sense: The experts algorithm moves the probability distribution towards experts of low cost. Consequently, the weak learner should classify these points better in the next round.

By the weak-learner property, we have for allt

m

X

i=1

p(t)i `(t)i ≥ 1 2+γ because errp(t)(ht)≤ 12−γ.

2The boosting framework can be extended such that this bound only holds with a certain probability.

(3)

Algorithms and Uncertainty, Winter 2018/19 Lecture 24 (page 3 of 4) Furthermore, by the regret definition for all i0

T

X

t=1 m

X

i=1

p(t)i `(t)i

T

X

t=1

`(t)i0 + Regret(T) . In combination, this implies that for all i0

T

X

t=1

`(t)i0 ≥T 1

2 +γ

−Regret(T) .

That is, in T2 +γT −Regret(T) steps, i0 is classified correctly. If Regret(T) < γT, using the majority vote ofh1, . . . , hT is the correct classification for all ofS.

Using our regret bound for Multiplicative Weights, Regret(T) ≤ 2√

Tlnm, we would need T > 4 lnγ2m.

4 AdaBoost

The algorithmAdaBoost(for adaptive boosting) uses exactly the ideas described in the previous section. It is an adapted version of the Multiplicative Weights algorithm. A more careful and tailored analysis gives us a much better guarantee. In particular, we will get a bound for every T, independent ofm, and do not have to knowγ.

The algorithm works as follows. The most striking difference to the no-regret algorithm is that there is no global learning rate η. Instead, the weight update in step t uses a factor ηt, which depends on the current error.

• Initially setw(1)i = 1 for all i

• In stept= 1, . . . , T

– Compute W(t)=Pm

i=1w(t)i ,p(t)i =wi(t)/W(t) – Lethtbe the outcome of the weak learner onp(t) – Compute t=P

i:ht(xi)6=yip(t)i (the error ofht onp(t)) – Letηt= 12ln

1 t −1

– Updatew(t+1)i =wi(t)e−ηtyiht(xi)

• Return h defined byh(x) = sign PT

t=1ηtht(x)

Theorem 24.2. The AdaBoost algorithm fulfills errS(h)≤exp(−2γ2T).

Proof. Let gt=Pt

t0=1ηt0ht0(x). By this definition wi(t) = e−yigt−1(xi) and h(x) = sign(gT(x)).

Like in the analysis of the Multiplicative Weights algorithm, we consider the change of the sum of weights,W(t) =Pm

i=1w(t)i . We will show that

W(t+1)≤e−2γ2W(t) . (1)

This implies thatW(T+1) ≤e−2γ2TW(1)= e−2γ2Tm.

Furthermore, for everyiwithh(xi)6=yi, we have to haveyigT(xi)≤0 because the product of two reals of different signs is always non-positive. This means that for i also w(Ti +1) =

(4)

Algorithms and Uncertainty, Winter 2018/19 Lecture 24 (page 4 of 4) e−yigT(xi)≥1. For all other i, we use thatw(Ti +1)≥0 to get that W(T+1) ≥ |{i|h(xi)6=yi}|

and so

errS(h)≤ 1

mW(T+1)≤e−2γ2T . So, it remains to show Equation (1). The weight after step tis

W(t+1)=

m

X

i=1

w(t+1)i =

m

X

i=1

wi(t)e−yiηtht(xi) . So, the weight changes as

W(t+1) W(t) =

m

X

i=1

w(t)i

W(t)e−yiηtht(xi)=

m

X

i=1

p(t)i e−yiηtht(xi)= X

i:ht(xi)=yi

p(t)i e−ηt + X

i:ht(xi)6=yi

p(t)i eηt .

By definitionP

i:ht(xi)6=yip(t)i =t and eηt =p

1/t−1, so W(t+1)

W(t) = (1−t)e−ηt +teηt = (1−t) 1

p1/t−1+t

p1/t−1

= (1−t) r t

1−t +t

r1−t

t = 2p

t(1−t) . By the property of a weak learner, we have t12−γ, so

W(t+1) W(t) = 2p

t(1−t)≤2 s

1

2 −γ 1 2 +γ

=p

1−4γ2≤p

e−4γ2 = e−2γ2 . This shows Equation (1) and completes our proof.

5 The Downside of Boosting: Increased VC Dimension

We derived that given any training setS the training error will be at most exp(−2γ2T) when using T iterations of AdaBoost. One may be tempted to setT as high as possible because this makes the error smaller and smaller. It is important to remember that this improved accuracy will be bought by overfitting.

In more formal terms: The VC dimension of the class of hypotheses that can be produced by AdaBoost inT iterations grows in T. If we setT ≥ 12 ln(2d) then errS(h) = 0 on any set S of size at most d. This also means that a set of sizedis shattered.

Consequently, one needs to be cautious when applying boosting: It is a reasonable tool derive better classification but there is a usual trade-off between training errors and overfitting.

References

Freund, Yoav; Schapire, Robert E (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences. 55: 119. (Original AdaBoost paper)

Referenzen

ÄHNLICHE DOKUMENTE

Während dem Unterricht fiel es anschei- nend gar nicht schwer, nicht aufs Handy zu schauen. „Das muss man sonst eh im- mer heimlich machen.“ In den Pausen wird normalerweise

Hegel’s idea of subjectivity liberating itself in the course of history served, in Ferdinand Christian Baur’s “Christianity and the Christian Church,”17 as the paradigm for

(1) Consumers infer lower functional quality for both green core and green peripheral attributes compared to no green attribute in the strength-dependent product category (i.e. the

Thus, the choices of a set of legal sequences and a payoff function define a semantics for argumentation frameworks.. Let us notice that some choices may violate even the most

Coronary revascularization, either by CABG or PTCA, has been proven unequivocally to improve survival in selected patients, especially in those with triple vessel disease and

In summary, the absence of lipomatous, sclerosing or ®brous features in this lesion is inconsistent with a diagnosis of lipo- sclerosing myxo®brous tumour as described by Ragsdale

In line with previous research, potentially BEM-eligible employees who reported that health promotion measures were implemented in their company in the last two years more

With respect to our first research question that contrasts the effect of green and non- green knowledge, we observe that ‘internal green knowledge’ (p-value for test of equality