Algorithms and Uncertainty, Winter 2018/19 Lecture 24 (4 pages)
Boosting
Instructor: Thomas Kesselheim
Often, you may face the situation that you have a way to classify data points in a reasonably good way but not accurately enough. For example, the reason might be that you are using hypotheses that are not expressive enough. Today, we will get to know a powerful technique that allows us to “boost” the accuracy of classification on the training set.
1 A Motivating Example
Let us start with a motivating example. While this example itself feels a little trivial, it hopefully does not take too much imagination to see that such problems also arise in more complex scenarios.
Suppose you have to classify points on the real line R. All you start from is the set of decision stumps: These are the hypotheses of the form
h(x) =
(−1 ifx < a
1 ifx≥a or h(x) =
(−1 ifx > a 1 ifx≤a .
[
a
]
a
Note that we can write such a hypothesis very succinctly by two parameters w1 ∈ R, w2∈ {−1,1} such thathw1,w2(x) =w2·sign(x−w1).1
Suppose your ground truth is actually that all x ∈ [a, b] are positive and all x 6∈ [a, b]
are negative. Even in this very simple example, you cannot get a training error below 13. For example, if S = {(−1,−1),(0,+1),(1,−1)}, one of the points will always be classified incorrectly.
However, a linear combination of classifiers will perform well. Given any training set S = {(x1, y1), . . . ,(xm, ym)}, defineh∗ for any z <minixi by
h∗(x) = sign (ha,1(x) +hb,−1(x) +hz,−1(x)) .
For eachi, this ensures thath∗(xi) =yi, so errS(h∗) = 0. Note that this is still not the ground truth: Below zour classification is incorrect.
z a b
Our goal today will be exactly to extend this idea to general hypothesis classes.
1Redefine the sign function to be +1 at 0 for this purpose.
Algorithms and Uncertainty, Winter 2018/19 Lecture 24 (page 2 of 4)
2 Problem Statement
Suppose we are given a training setS ={(x1, y1), . . . ,(xm, ym)}and we would like to come up with a hypothesis h such that errS(h) is close to 0. All we have is a weak learning algorithm:
Given any weights p1, . . . , pm with P
ipi = 1 (so the weights define a probability distribution overS), it computes a hypothesis hp such that
errp(hp) := X
i:hp(xi)6=yi
pi≤ 1 2 −γ
for some fixedγ >0.2 Note that forpi = m1, this is exactly the definition of the training error.
That is, the hypothesis hp makes sure that a weighted version of the training error is small.
Example 24.1. We come back to the example of decision stumps but the ground truth being defined by an interval [a, b]. We can show that using a decision stumph that minimizes errp(h) fulfills the above guarantee with γ = 16. That is, the empirical risk minimizer using decision stumps is a weak learner if the ground truth is defined by an interval.
To show that γ = 16, observe that for every vector p and every a < b X
i:xi<a
pi ≤ 1
3 or X
i:a≤xi≤b
pi≤ 1
3 or X
i:xi>b
pi ≤ 1 3 .
Classifying two of the the kinds of data points (below a, between a andb, above b) is easy by a decision stump. It error will be at most 13 = 12 −16.
We will show that this is enough to design a strong learning algorithm: For any >0, for any training setS, it will use the weak learner to compute hypothesesh1, . . . , hT andα1, . . . , αT such thath∗ with h∗(x) = sign(α1h1(x) +. . .+αThT(x)) fulfills errS(h∗)≤.
3 Boosting via Experts Algorithms
Given a weak learning algorithm, we can come up with a strong learning algorithm using no- regret experts algorithms in a very surprising way. To this end, we define an expert for every i∈ {1, . . . , m}. That is, each expert corresponds to a data point in our training set.
It remains to define the cost vectors. We will compute hypotheses h1, h2, . . . as follows. In each step t, the experts algorithm uses a probability distribution p(t) over its experts. Input this to the weak learner and call the output ht. Let `(t)i = 1 if ht(xi) = yi and 0 otherwise.
That is, an expert has a cost of 1 if it was correctly classified. This sounds counter-intuitive first but makes a lot of sense: The experts algorithm moves the probability distribution towards experts of low cost. Consequently, the weak learner should classify these points better in the next round.
By the weak-learner property, we have for allt
m
X
i=1
p(t)i `(t)i ≥ 1 2+γ because errp(t)(ht)≤ 12−γ.
2The boosting framework can be extended such that this bound only holds with a certain probability.
Algorithms and Uncertainty, Winter 2018/19 Lecture 24 (page 3 of 4) Furthermore, by the regret definition for all i0
T
X
t=1 m
X
i=1
p(t)i `(t)i ≤
T
X
t=1
`(t)i0 + Regret(T) . In combination, this implies that for all i0
T
X
t=1
`(t)i0 ≥T 1
2 +γ
−Regret(T) .
That is, in T2 +γT −Regret(T) steps, i0 is classified correctly. If Regret(T) < γT, using the majority vote ofh1, . . . , hT is the correct classification for all ofS.
Using our regret bound for Multiplicative Weights, Regret(T) ≤ 2√
Tlnm, we would need T > 4 lnγ2m.
4 AdaBoost
The algorithmAdaBoost(for adaptive boosting) uses exactly the ideas described in the previous section. It is an adapted version of the Multiplicative Weights algorithm. A more careful and tailored analysis gives us a much better guarantee. In particular, we will get a bound for every T, independent ofm, and do not have to knowγ.
The algorithm works as follows. The most striking difference to the no-regret algorithm is that there is no global learning rate η. Instead, the weight update in step t uses a factor ηt, which depends on the current error.
• Initially setw(1)i = 1 for all i
• In stept= 1, . . . , T
– Compute W(t)=Pm
i=1w(t)i ,p(t)i =wi(t)/W(t) – Lethtbe the outcome of the weak learner onp(t) – Compute t=P
i:ht(xi)6=yip(t)i (the error ofht onp(t)) – Letηt= 12ln
1 t −1
– Updatew(t+1)i =wi(t)e−ηtyiht(xi)
• Return h∗ defined byh∗(x) = sign PT
t=1ηtht(x)
Theorem 24.2. The AdaBoost algorithm fulfills errS(h∗)≤exp(−2γ2T).
Proof. Let gt=Pt
t0=1ηt0ht0(x). By this definition wi(t) = e−yigt−1(xi) and h∗(x) = sign(gT(x)).
Like in the analysis of the Multiplicative Weights algorithm, we consider the change of the sum of weights,W(t) =Pm
i=1w(t)i . We will show that
W(t+1)≤e−2γ2W(t) . (1)
This implies thatW(T+1) ≤e−2γ2TW(1)= e−2γ2Tm.
Furthermore, for everyiwithh∗(xi)6=yi, we have to haveyigT(xi)≤0 because the product of two reals of different signs is always non-positive. This means that for i also w(Ti +1) =
Algorithms and Uncertainty, Winter 2018/19 Lecture 24 (page 4 of 4) e−yigT(xi)≥1. For all other i, we use thatw(Ti +1)≥0 to get that W(T+1) ≥ |{i|h∗(xi)6=yi}|
and so
errS(h∗)≤ 1
mW(T+1)≤e−2γ2T . So, it remains to show Equation (1). The weight after step tis
W(t+1)=
m
X
i=1
w(t+1)i =
m
X
i=1
wi(t)e−yiηtht(xi) . So, the weight changes as
W(t+1) W(t) =
m
X
i=1
w(t)i
W(t)e−yiηtht(xi)=
m
X
i=1
p(t)i e−yiηtht(xi)= X
i:ht(xi)=yi
p(t)i e−ηt + X
i:ht(xi)6=yi
p(t)i eηt .
By definitionP
i:ht(xi)6=yip(t)i =t and eηt =p
1/t−1, so W(t+1)
W(t) = (1−t)e−ηt +teηt = (1−t) 1
p1/t−1+t
p1/t−1
= (1−t) r t
1−t +t
r1−t
t = 2p
t(1−t) . By the property of a weak learner, we have t≤ 12−γ, so
W(t+1) W(t) = 2p
t(1−t)≤2 s
1
2 −γ 1 2 +γ
=p
1−4γ2≤p
e−4γ2 = e−2γ2 . This shows Equation (1) and completes our proof.
5 The Downside of Boosting: Increased VC Dimension
We derived that given any training setS the training error will be at most exp(−2γ2T) when using T iterations of AdaBoost. One may be tempted to setT as high as possible because this makes the error smaller and smaller. It is important to remember that this improved accuracy will be bought by overfitting.
In more formal terms: The VC dimension of the class of hypotheses that can be produced by AdaBoost inT iterations grows in T. If we setT ≥ 2γ12 ln(2d) then errS(h∗) = 0 on any set S of size at most d. This also means that a set of sizedis shattered.
Consequently, one needs to be cautious when applying boosting: It is a reasonable tool derive better classification but there is a usual trade-off between training errors and overfitting.
References
Freund, Yoav; Schapire, Robert E (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences. 55: 119. (Original AdaBoost paper)