VC Dimension

(1)

Algorithms and Uncertainty, Winter 2018/19 Lecture 22 (4 pages)

VC Dimension

Instructor: Thomas Kesselheim

Recall our setting from last time. We have to classify data points from a set X using hypothesis h:X → {−1,1}. The class of all hypotheses is called H. There is a ground truth f:X→ {−1,1} and we are in the realizable case, which means that f ∈ H.

By H[m] we indicate the maximum number of distinct ways to labelm data points fromX using different functions in H. A trivial upper bound is H[m]≤ 2^m but the function can be much smaller.

Givenm sample pointsx₁, . . . , x_m with labels y₁, . . . , y_m, thetraining error of a hypothesis is

errS(h) := 1

m|{h(x_i)6=yi}| .

The true error errD(h) of a hypothesis h with respect to a distributionDis errD(h) :=PrX∼D[h(X)6=f(X)] .

For all choices of >0,δ > 0, if we drawm times independently from distribution Dsuch that

m≥max 8

,2 log₂

2H[2m]

δ

, (1)

then with probability at least 1−δ, allh∈ H with errS(f) = 0 have errD(h)< .

Today, we would like to better understand Condition (1). Note that is equivalent to require that

≥max 8

m, 2 mlog₂

2H[2m]

δ

.

The question that we are interested in is if the true error errD(h) vanishes if we choose larger and largerm. This indeed works out if ^log²^(H[2m])_m converges to 0.

For the trivial bound H[m] ≤ 2^m, this is not true. For threshold classifiers on a line, we could show thatH[m]≤m+ 1. This is sufficient. More generally, we ask: Is there a point after which H[m] grows subexponentially?

1 VC Dimension

Today, we will get to know the central notion of VC dimension. It was introduced by Vapnik and Chervonenkis in 1968. The VC dimension of a set of hypotheses H is roughly the point from which onH[m] is smaller than 2^m.

Definition 22.1. A set of hypotheses Hshatters a set S⊆X if there are hypotheses in Hthat label S in all possible 2^|S| ways, that is, H[S] = 2^|S|.

Definition 22.2. The VC dimensionof a set of hypotheses His the largest size of a setS that is shattered by H, i.e., max{|S| | H[S] = 2^|S|}. If there are sets of unbounded sizes that are shattered then the VC dimension is infinite.

Let us consider a few examples.

(2)

Algorithms and Uncertainty, Winter 2018/19 Lecture 22 (page 2 of 4)

• ForX =Rand Hbeing the class of functions of the form h(x) =

(−1 forx≤t 1 otherwise

the VC dimension is 1. This is because any set {x} is shattered because h(x) = −1 and h⁰(x) = 1 for suitable choices of h and h⁰. In contract, for any set of two points x1 ≤x2∈R, it is impossible thath(x1) = 1 but h(x2) =−1.

• IfH is finite, then the VC dimension is at most log₂|H|.

• If X is infinite and H contains all functions h:X → {−1,1}, then the VC dimension is infinite.

2 Bounding the Growth Function by the VC Dimension

Theorem 22.3 (Sauer’s Lemma). Let H be a hypothesis class of VC dimension d. Then for allm≥d

H[m]≤

d

X

i=0

m i

.

In order to prove Sauer’s Lemma, the following lemma will turn out to be very helpful.

Lemma 22.4. Consider a set of data points S ⊆X and let L be an arbitrary set of labelings

`:S→ {−1,1}. ThenLshatters at least|L|subsets ofS. That is, there are at least|L|distinct sets S⁰ ⊆S such that S⁰ can be labelled in all2^|S⁰^| different ways using functions fromL.

Proof. We prove the claim by induction on |L|. The base case is |L| = 1. In this case, the empty set is shattered.

For the induction step, consider that |L|>1. In this case, there has to be some x∈S such that `(x) =−1 for some `∈L and `⁰(x) = 1 for some `⁰ ∈ L. Let L− ={`∈ L|`(x) =−1}

andL₊={`∈L|`(x) = 1}. Now, apply the induction hypothesis on the setsL−and L₊. Let T−⊆2^S and T+⊆2^S denote the shattered sets respectively. By induction hypothesis, we have

|T₋| ≥L− and |T₊| ≥L₊.

Note that there is no S⁰ ∈T− orS⁰ ∈T₊ with x∈S⁰ because the label ofx is always fixed to−1 or 1.

All ofT−∪T₊is shattered byL. Additionally, ifS⁰ ∈T−∩T₊, thenS⁰∪{x}is also shattered by L because after assigning x an arbitrary label we can still assign all possible labels to the S⁰ using a labelling inL. All sets constructed this way are not contained in T− orT+ because they always contain x.

Consequently, the number of shattered sets is at least

|T₋∪T₊|+|T₋∩T₊|=|T₋|+|T₊|− |T₋∩T₊|+|T₋∩T₊|=|T₋|+|T₊| ≥ |L₋|+|L₊|=|L| . Proof of Sauer’s Lemma. Given any setS ⊆X of sizem, we would like to boundH[S]. To this end, let Lbe the set of possible labelings`:S → {−1,1}applying different hypotheses from H on S. Formally, L={h|_S|h∈ H}. By definitionH[S] =|L|.

Furthermore, let T ⊆ 2^S be the family of subsets of S that are shattered by H. Using Lemma 22.4, we know that |T| ≥ |L|.

(3)

Algorithms and Uncertainty, Winter 2018/19 Lecture 22 (page 3 of 4) We also know that no set larger than dcan be shattered, so T contains sets of size at most d. Therefore, the size ofT is bounded by the number of such sets

|T| ≤

d

X

i=0

m i

.

In combination,H[S] =|L| ≤ |T| ≤Pd i=0

m i

.

To simplify the expression in Sauer’s Lemma, we can use the following bound on the binomial coefficients

m i

= m!

(m−i)!·i!≤ mⁱ i! =m

d i dⁱ

i! ≤m d

ddⁱ i! . Together with the definition of the exponetial function e^x =P∞

i=0 xⁱ

i!, we get

d

X

i=0

m i

≤

d

X

i=0

m d

ddⁱ i! =

m d

d d

X

i=0

dⁱ i! ≤m

d d

e^d . This gives us the following corollary.

Corollary 22.5. Let H be a hypothesis class of VC dimension d. Then for all m≥d

H[m]≤em d

d

.

Plugging this bound into Condition (1), we get that for a hypothesis classHof VC dimension dfor all choices of >0,δ >0 if we drawmtimes independently from distributionDsuch that

m≥max (8

,2

log₂ 2 ^2em_d d

δ

!)

= max 8

,2d log₂

2em d

+2

log₂ 2

δ

,

then with probability at least 1−δ, allh∈ H with errS(f) = 0 have errD(h)< . Corollary 22.6. Any hypothesis class of finite VC dimension is PAC-learnable.

3 Infinite VC Dimension

Not all hypothesis classes have a finite VC dimension. One example would be the set of all functions X→ {0,1}. As we will show, these hypothesis classes are not PAC-learnable.

Theorem 22.7. Any hypothesis class of infinite VC dimension is not PAC-learnable.

To show this theorem, we have to show that the function mH in the definition of PAC- learnability does not exist. We will show the following.

Proposition 22.8. Let H be a hypothesis class of VC dimension at least d. Then for every learning algorithm there exists a distribution such on that on a training set of size ^d₂ we have errD(h_S)≥ ¹₈ with probability at least ¹₇.

(4)

Algorithms and Uncertainty, Winter 2018/19 Lecture 22 (page 4 of 4) Proof. By definition H shatters a set of size d. So, let T ⊆ X, |T| = d, be such a set. By definition, any labeling `: T → {−1,1} can be extended to a hypothesis f ∈ H such that

`(x) = f(x) for all x ∈ T. There are k = 2^d such labelings. Let f1, . . . , fk, be the respective extended hypotheses. Each of them can be the ground truth. Let D_i denote the uniform distribution over pairs (x, f_i(x)) forx∈T.

Our learning algorithm will have to infer the correct i. The important observation is that any sample of size at most ^d₂ tells us the correct labels of only at most ^d₂ points inT. The others are still completely arbitrary.

Let h_S be the hypothesis computed by the learning algorithm on sample S. In principle, this may also be randomized. Our goal is to show that

maxi Pr

errDi(hS)≥ 1 8

≥ 1 7 .

We will apply Yao’s principle: Draw I uniformly from {1, . . . , k} and considerD_I. This is potentially confusing: We first draw indexI randomly and then we use probability distribution D_I. Now, it suffices to show that

Pr

errD_I(hS)≥ 1 8

≥ 1 7 .

Fix any x ∈ X. We bound the probability that hS(x) 6= fI(x). To this end, we think of the labels f_I being determined in a different way. First draw the sampleS and determine the labels for the points in this set. Based on this, computeh_S. Only now determine the labels for the points not in this set. Ifx is not in the sample, thenhS(x) is correct with probability ¹₂. It is not in the sample with probability at least ¹₂. Therefore

Pr[h_S(x)6=f_I(x)]≥ 1 4 . This holds for allx∈X, therefore

E[errD_I(h_S)]≥ 1 4 . Now, we can apply Markov’s inequality to get

Pr

errDI(hS)< 1 8

=Pr

1−errDI(hS)> 7 8

≤ 1

7 8

E[1−errDI(hS)]≤ 3 4 ·8

7 = 6 7 . This proves the claim.

References and Further Reading

These notes are based on notes and lectures by Anna Karlinhttps://courses.cs.washington.

edu/courses/cse522/17sp/ and Avrim Blumhttp://www.cs.cmu.edu/~avrim/ML14/. Also see the references therein.