Algorithms and Uncertainty, Winter 2018/19 Lecture 22 (4 pages)
VC Dimension
Instructor: Thomas Kesselheim
Recall our setting from last time. We have to classify data points from a set X using hypothesis h:X → {−1,1}. The class of all hypotheses is called H. There is a ground truth f:X→ {−1,1} and we are in the realizable case, which means that f ∈ H.
By H[m] we indicate the maximum number of distinct ways to labelm data points fromX using different functions in H. A trivial upper bound is H[m]≤ 2m but the function can be much smaller.
Givenm sample pointsx1, . . . , xm with labels y1, . . . , ym, thetraining error of a hypothesis is
errS(h) := 1
m|{h(xi)6=yi}| .
The true error errD(h) of a hypothesis h with respect to a distributionDis errD(h) :=PrX∼D[h(X)6=f(X)] .
For all choices of >0,δ > 0, if we drawm times independently from distribution Dsuch that
m≥max 8
,2 log2
2H[2m]
δ
, (1)
then with probability at least 1−δ, allh∈ H with errS(f) = 0 have errD(h)< .
Today, we would like to better understand Condition (1). Note that is equivalent to require that
≥max 8
m, 2 mlog2
2H[2m]
δ
.
The question that we are interested in is if the true error errD(h) vanishes if we choose larger and largerm. This indeed works out if log2(H[2m])m converges to 0.
For the trivial bound H[m] ≤ 2m, this is not true. For threshold classifiers on a line, we could show thatH[m]≤m+ 1. This is sufficient. More generally, we ask: Is there a point after which H[m] grows subexponentially?
1 VC Dimension
Today, we will get to know the central notion of VC dimension. It was introduced by Vapnik and Chervonenkis in 1968. The VC dimension of a set of hypotheses H is roughly the point from which onH[m] is smaller than 2m.
Definition 22.1. A set of hypotheses Hshatters a set S⊆X if there are hypotheses in Hthat label S in all possible 2|S| ways, that is, H[S] = 2|S|.
Definition 22.2. The VC dimensionof a set of hypotheses His the largest size of a setS that is shattered by H, i.e., max{|S| | H[S] = 2|S|}. If there are sets of unbounded sizes that are shattered then the VC dimension is infinite.
Let us consider a few examples.
Algorithms and Uncertainty, Winter 2018/19 Lecture 22 (page 2 of 4)
• ForX =Rand Hbeing the class of functions of the form h(x) =
(−1 forx≤t 1 otherwise
the VC dimension is 1. This is because any set {x} is shattered because h(x) = −1 and h0(x) = 1 for suitable choices of h and h0. In contract, for any set of two points x1 ≤x2∈R, it is impossible thath(x1) = 1 but h(x2) =−1.
• IfH is finite, then the VC dimension is at most log2|H|.
• If X is infinite and H contains all functions h:X → {−1,1}, then the VC dimension is infinite.
2 Bounding the Growth Function by the VC Dimension
Theorem 22.3 (Sauer’s Lemma). Let H be a hypothesis class of VC dimension d. Then for allm≥d
H[m]≤
d
X
i=0
m i
.
In order to prove Sauer’s Lemma, the following lemma will turn out to be very helpful.
Lemma 22.4. Consider a set of data points S ⊆X and let L be an arbitrary set of labelings
`:S→ {−1,1}. ThenLshatters at least|L|subsets ofS. That is, there are at least|L|distinct sets S0 ⊆S such that S0 can be labelled in all2|S0| different ways using functions fromL.
Proof. We prove the claim by induction on |L|. The base case is |L| = 1. In this case, the empty set is shattered.
For the induction step, consider that |L|>1. In this case, there has to be some x∈S such that `(x) =−1 for some `∈L and `0(x) = 1 for some `0 ∈ L. Let L− ={`∈ L|`(x) =−1}
andL+={`∈L|`(x) = 1}. Now, apply the induction hypothesis on the setsL−and L+. Let T−⊆2S and T+⊆2S denote the shattered sets respectively. By induction hypothesis, we have
|T−| ≥L− and |T+| ≥L+.
Note that there is no S0 ∈T− orS0 ∈T+ with x∈S0 because the label ofx is always fixed to−1 or 1.
All ofT−∪T+is shattered byL. Additionally, ifS0 ∈T−∩T+, thenS0∪{x}is also shattered by L because after assigning x an arbitrary label we can still assign all possible labels to the S0 using a labelling inL. All sets constructed this way are not contained in T− orT+ because they always contain x.
Consequently, the number of shattered sets is at least
|T−∪T+|+|T−∩T+|=|T−|+|T+|− |T−∩T+|+|T−∩T+|=|T−|+|T+| ≥ |L−|+|L+|=|L| . Proof of Sauer’s Lemma. Given any setS ⊆X of sizem, we would like to boundH[S]. To this end, let Lbe the set of possible labelings`:S → {−1,1}applying different hypotheses from H on S. Formally, L={h|S|h∈ H}. By definitionH[S] =|L|.
Furthermore, let T ⊆ 2S be the family of subsets of S that are shattered by H. Using Lemma 22.4, we know that |T| ≥ |L|.
Algorithms and Uncertainty, Winter 2018/19 Lecture 22 (page 3 of 4) We also know that no set larger than dcan be shattered, so T contains sets of size at most d. Therefore, the size ofT is bounded by the number of such sets
|T| ≤
d
X
i=0
m i
.
In combination,H[S] =|L| ≤ |T| ≤Pd i=0
m i
.
To simplify the expression in Sauer’s Lemma, we can use the following bound on the binomial coefficients
m i
= m!
(m−i)!·i!≤ mi i! =m
d i di
i! ≤m d
ddi i! . Together with the definition of the exponetial function ex =P∞
i=0 xi
i!, we get
d
X
i=0
m i
≤
d
X
i=0
m d
ddi i! =
m d
d d
X
i=0
di i! ≤m
d d
ed . This gives us the following corollary.
Corollary 22.5. Let H be a hypothesis class of VC dimension d. Then for all m≥d
H[m]≤em d
d
.
Plugging this bound into Condition (1), we get that for a hypothesis classHof VC dimension dfor all choices of >0,δ >0 if we drawmtimes independently from distributionDsuch that
m≥max (8
,2
log2 2 2emd d
δ
!)
= max 8
,2d log2
2em d
+2
log2 2
δ
,
then with probability at least 1−δ, allh∈ H with errS(f) = 0 have errD(h)< . Corollary 22.6. Any hypothesis class of finite VC dimension is PAC-learnable.
3 Infinite VC Dimension
Not all hypothesis classes have a finite VC dimension. One example would be the set of all functions X→ {0,1}. As we will show, these hypothesis classes are not PAC-learnable.
Theorem 22.7. Any hypothesis class of infinite VC dimension is not PAC-learnable.
To show this theorem, we have to show that the function mH in the definition of PAC- learnability does not exist. We will show the following.
Proposition 22.8. Let H be a hypothesis class of VC dimension at least d. Then for every learning algorithm there exists a distribution such on that on a training set of size d2 we have errD(hS)≥ 18 with probability at least 17.
Algorithms and Uncertainty, Winter 2018/19 Lecture 22 (page 4 of 4) Proof. By definition H shatters a set of size d. So, let T ⊆ X, |T| = d, be such a set. By definition, any labeling `: T → {−1,1} can be extended to a hypothesis f ∈ H such that
`(x) = f(x) for all x ∈ T. There are k = 2d such labelings. Let f1, . . . , fk, be the respective extended hypotheses. Each of them can be the ground truth. Let Di denote the uniform distribution over pairs (x, fi(x)) forx∈T.
Our learning algorithm will have to infer the correct i. The important observation is that any sample of size at most d2 tells us the correct labels of only at most d2 points inT. The others are still completely arbitrary.
Let hS be the hypothesis computed by the learning algorithm on sample S. In principle, this may also be randomized. Our goal is to show that
maxi Pr
errDi(hS)≥ 1 8
≥ 1 7 .
We will apply Yao’s principle: Draw I uniformly from {1, . . . , k} and considerDI. This is potentially confusing: We first draw indexI randomly and then we use probability distribution DI. Now, it suffices to show that
Pr
errDI(hS)≥ 1 8
≥ 1 7 .
Fix any x ∈ X. We bound the probability that hS(x) 6= fI(x). To this end, we think of the labels fI being determined in a different way. First draw the sampleS and determine the labels for the points in this set. Based on this, computehS. Only now determine the labels for the points not in this set. Ifx is not in the sample, thenhS(x) is correct with probability 12. It is not in the sample with probability at least 12. Therefore
Pr[hS(x)6=fI(x)]≥ 1 4 . This holds for allx∈X, therefore
E[errDI(hS)]≥ 1 4 . Now, we can apply Markov’s inequality to get
Pr
errDI(hS)< 1 8
=Pr
1−errDI(hS)> 7 8
≤ 1
7 8
E[1−errDI(hS)]≤ 3 4 ·8
7 = 6 7 . This proves the claim.
References and Further Reading
These notes are based on notes and lectures by Anna Karlinhttps://courses.cs.washington.
edu/courses/cse522/17sp/ and Avrim Blumhttp://www.cs.cmu.edu/~avrim/ML14/. Also see the references therein.