1 A Motivating Example

(1)

Algorithms and Uncertainty, Winter 2018/19 Lecture 24 (4 pages)

Boosting

Instructor: Thomas Kesselheim

Often, you may face the situation that you have a way to classify data points in a reasonably good way but not accurately enough. For example, the reason might be that you are using hypotheses that are not expressive enough. Today, we will get to know a powerful technique that allows us to “boost” the accuracy of classification on the training set.

1 A Motivating Example

Let us start with a motivating example. While this example itself feels a little trivial, it hopefully does not take too much imagination to see that such problems also arise in more complex scenarios.

Suppose you have to classify points on the real line R. All you start from is the set of decision stumps: These are the hypotheses of the form

h(x) =

(−1 ifx < a

1 ifx≥a or h(x) =

(−1 ifx > a 1 ifx≤a .

[

a

]

a

Note that we can write such a hypothesis very succinctly by two parameters w₁ ∈ R, w₂∈ {−1,1} such thath_w₁_,w₂(x) =w₂·sign(x−w₁).¹

Suppose your ground truth is actually that all x ∈ [a, b] are positive and all x 6∈ [a, b]

are negative. Even in this very simple example, you cannot get a training error below ¹₃. For example, if S = {(−1,−1),(0,+1),(1,−1)}, one of the points will always be classified incorrectly.

However, a linear combination of classifiers will perform well. Given any training set S = {(x₁, y₁), . . . ,(x_m, y_m)}, defineh^∗ for any z <min_ix_i by

h^∗(x) = sign (h_a,1(x) +hb,−1(x) +hz,−1(x)) .

For eachi, this ensures thath^∗(x_i) =y_i, so err_S(h^∗) = 0. Note that this is still not the ground truth: Below zour classification is incorrect.

z a b

Our goal today will be exactly to extend this idea to general hypothesis classes.

1Redefine the sign function to be +1 at 0 for this purpose.

(2)

Algorithms and Uncertainty, Winter 2018/19 Lecture 24 (page 2 of 4)

2 Problem Statement

Suppose we are given a training setS ={(x₁, y₁), . . . ,(x_m, y_m)}and we would like to come up with a hypothesis h such that errS(h) is close to 0. All we have is a weak learning algorithm:

Given any weights p₁, . . . , p_m with P

ip_i = 1 (so the weights define a probability distribution overS), it computes a hypothesis h_p such that

errp(hp) := X

i:hp(xi)6=yi

pi≤ 1 2 −γ

for some fixedγ >0.² Note that forp_i = _m¹, this is exactly the definition of the training error.

That is, the hypothesis hp makes sure that a weighted version of the training error is small.

Example 24.1. We come back to the example of decision stumps but the ground truth being defined by an interval [a, b]. We can show that using a decision stumph that minimizes errp(h) fulfills the above guarantee with γ = ¹₆. That is, the empirical risk minimizer using decision stumps is a weak learner if the ground truth is defined by an interval.

To show that γ = ¹₆, observe that for every vector p and every a < b X

i:xi<a

pi ≤ 1

3 or X

i:a≤x_i≤b

pi≤ 1

3 or X

i:xi>b

pi ≤ 1 3 .

Classifying two of the the kinds of data points (below a, between a andb, above b) is easy by a decision stump. It error will be at most ¹₃ = ¹₂ −¹₆.

We will show that this is enough to design a strong learning algorithm: For any >0, for any training setS, it will use the weak learner to compute hypothesesh₁, . . . , h_T andα₁, . . . , α_T such thath^∗ with h^∗(x) = sign(α1h1(x) +. . .+αThT(x)) fulfills errS(h^∗)≤.

3 Boosting via Experts Algorithms

Given a weak learning algorithm, we can come up with a strong learning algorithm using no- regret experts algorithms in a very surprising way. To this end, we define an expert for every i∈ {1, . . . , m}. That is, each expert corresponds to a data point in our training set.

It remains to define the cost vectors. We will compute hypotheses h1, h2, . . . as follows. In each step t, the experts algorithm uses a probability distribution p^(t) over its experts. Input this to the weak learner and call the output h_t. Let `^(t)_i = 1 if h_t(x_i) = y_i and 0 otherwise.

That is, an expert has a cost of 1 if it was correctly classified. This sounds counter-intuitive first but makes a lot of sense: The experts algorithm moves the probability distribution towards experts of low cost. Consequently, the weak learner should classify these points better in the next round.

By the weak-learner property, we have for allt

m

X

i=1

p^(t)_i `^(t)_i ≥ 1 2+γ because err_p(t)(ht)≤ ¹₂−γ.

2The boosting framework can be extended such that this bound only holds with a certain probability.

(3)

Algorithms and Uncertainty, Winter 2018/19 Lecture 24 (page 3 of 4) Furthermore, by the regret definition for all i⁰

T

X

t=1 m

X

i=1

p^(t)_i `^(t)_i ≤

T

X

t=1

`^(t)_i0 + Regret^(T⁾ . In combination, this implies that for all i⁰

T

X

t=1

`^(t)_i0 ≥T 1

2 +γ

−Regret^(T⁾ .

That is, in ^T₂ +γT −Regret^(T⁾ steps, i⁰ is classified correctly. If Regret^(T⁾ < γT, using the majority vote ofh₁, . . . , h_T is the correct classification for all ofS.

Using our regret bound for Multiplicative Weights, Regret^(T⁾ ≤ 2√

Tlnm, we would need T > ^{4 ln}_γ2^m.

4 AdaBoost

The algorithmAdaBoost(for adaptive boosting) uses exactly the ideas described in the previous section. It is an adapted version of the Multiplicative Weights algorithm. A more careful and tailored analysis gives us a much better guarantee. In particular, we will get a bound for every T, independent ofm, and do not have to knowγ.

The algorithm works as follows. The most striking difference to the no-regret algorithm is that there is no global learning rate η. Instead, the weight update in step t uses a factor η_t, which depends on the current error.

• Initially setw⁽¹⁾_i = 1 for all i

• In stept= 1, . . . , T

– Compute W^(t)=Pm

i=1w^(t)_i ,p^(t)_i =w_i^(t)/W^(t) – Leth_tbe the outcome of the weak learner onp^(t) – Compute _t=P

i:ht(xi)6=y_ip^(t)_i (the error ofh_t onp^(t)) – Letηt= ¹₂ln

1 t −1

– Updatew^(t+1)_i =w_i^(t)e^−η^t^yⁱ^h^t^(xⁱ⁾

• Return h^∗ defined byh^∗(x) = sign PT

t=1η_th_t(x)

Theorem 24.2. The AdaBoost algorithm fulfills errS(h^∗)≤exp(−2γ²T).

Proof. Let gt=Pt

t⁰=1ηt⁰ht⁰(x). By this definition w_i^(t) = e^−yⁱ^g^t−1^(xⁱ⁾ and h^∗(x) = sign(gT(x)).

Like in the analysis of the Multiplicative Weights algorithm, we consider the change of the sum of weights,W^(t) =Pm

i=1w^(t)_i . We will show that

W^(t+1)≤e^−2γ²W^(t) . (1)

This implies thatW^(T⁺¹⁾ ≤e^−2γ²^TW⁽¹⁾= e^−2γ²^Tm.

Furthermore, for everyiwithh^∗(xi)6=yi, we have to haveyigT(xi)≤0 because the product of two reals of different signs is always non-positive. This means that for i also w^(T_i ⁺¹⁾ =

(4)

Algorithms and Uncertainty, Winter 2018/19 Lecture 24 (page 4 of 4) e^−yⁱ^g^T^(xⁱ⁾≥1. For all other i, we use thatw^(T_i ⁺¹⁾≥0 to get that W^(T⁺¹⁾ ≥ |{i|h^∗(x_i)6=y_i}|

and so

err_S(h^∗)≤ 1

mW^(T⁺¹⁾≤e^−2γ²^T . So, it remains to show Equation (1). The weight after step tis

W^(t+1)=

m

X

i=1

w^(t+1)_i =

m

X

i=1

w_i^(t)e^−yⁱ^η^t^h^t^(xⁱ⁾ . So, the weight changes as

W^(t+1) W^(t) =

m

X

i=1

w^(t)_i

W^(t)e^−yⁱ^η^t^h^t^(xⁱ⁾=

m

X

i=1

p^(t)_i e^−yⁱ^η^t^h^t^(xⁱ⁾= X

i:ht(xi)=yi

p^(t)_i e^−η^t + X

i:ht(xi)6=y_i

p^(t)_i e^η^t .

By definitionP

i:ht(xi)6=y_ip^(t)_i =t and e^η^t =p

1/t−1, so W^(t+1)

W^(t) = (1−t)e^−η^t +te^η^t = (1−t) 1

p1/_t−1+t

p1/t−1

= (1−t) r t

1−_t +t

r1−t

_t = 2p

t(1−t) . By the property of a weak learner, we have t≤ ¹₂−γ, so

W^(t+1) W^(t) = 2p

_t(1−_t)≤2 s

1

2 −γ 1 2 +γ

=p

1−4γ²≤p

e^−4γ² = e^−2γ² . This shows Equation (1) and completes our proof.

5 The Downside of Boosting: Increased VC Dimension

We derived that given any training setS the training error will be at most exp(−2γ²T) when using T iterations of AdaBoost. One may be tempted to setT as high as possible because this makes the error smaller and smaller. It is important to remember that this improved accuracy will be bought by overfitting.

In more formal terms: The VC dimension of the class of hypotheses that can be produced by AdaBoost inT iterations grows in T. If we setT ≥ _2γ¹₂ ln(2d) then err_S(h^∗) = 0 on any set S of size at most d. This also means that a set of sizedis shattered.

Consequently, one needs to be cautious when applying boosting: It is a reasonable tool derive better classification but there is a usual trade-off between training errors and overfitting.

References

Freund, Yoav; Schapire, Robert E (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences. 55: 119. (Original AdaBoost paper)