• Keine Ergebnisse gefunden

VC Dimension

N/A
N/A
Protected

Academic year: 2022

Aktie "VC Dimension"

Copied!
4
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Algorithms and Uncertainty, Winter 2018/19 Lecture 22 (4 pages)

VC Dimension

Instructor: Thomas Kesselheim

Recall our setting from last time. We have to classify data points from a set X using hypothesis h:X → {−1,1}. The class of all hypotheses is called H. There is a ground truth f:X→ {−1,1} and we are in the realizable case, which means that f ∈ H.

By H[m] we indicate the maximum number of distinct ways to labelm data points fromX using different functions in H. A trivial upper bound is H[m]≤ 2m but the function can be much smaller.

Givenm sample pointsx1, . . . , xm with labels y1, . . . , ym, thetraining error of a hypothesis is

errS(h) := 1

m|{h(xi)6=yi}| .

The true error errD(h) of a hypothesis h with respect to a distributionDis errD(h) :=PrX∼D[h(X)6=f(X)] .

For all choices of >0,δ > 0, if we drawm times independently from distribution Dsuch that

m≥max 8

,2 log2

2H[2m]

δ

, (1)

then with probability at least 1−δ, allh∈ H with errS(f) = 0 have errD(h)< .

Today, we would like to better understand Condition (1). Note that is equivalent to require that

≥max 8

m, 2 mlog2

2H[2m]

δ

.

The question that we are interested in is if the true error errD(h) vanishes if we choose larger and largerm. This indeed works out if log2(H[2m])m converges to 0.

For the trivial bound H[m] ≤ 2m, this is not true. For threshold classifiers on a line, we could show thatH[m]≤m+ 1. This is sufficient. More generally, we ask: Is there a point after which H[m] grows subexponentially?

1 VC Dimension

Today, we will get to know the central notion of VC dimension. It was introduced by Vapnik and Chervonenkis in 1968. The VC dimension of a set of hypotheses H is roughly the point from which onH[m] is smaller than 2m.

Definition 22.1. A set of hypotheses Hshatters a set S⊆X if there are hypotheses in Hthat label S in all possible 2|S| ways, that is, H[S] = 2|S|.

Definition 22.2. The VC dimensionof a set of hypotheses His the largest size of a setS that is shattered by H, i.e., max{|S| | H[S] = 2|S|}. If there are sets of unbounded sizes that are shattered then the VC dimension is infinite.

Let us consider a few examples.

(2)

Algorithms and Uncertainty, Winter 2018/19 Lecture 22 (page 2 of 4)

• ForX =Rand Hbeing the class of functions of the form h(x) =

(−1 forx≤t 1 otherwise

the VC dimension is 1. This is because any set {x} is shattered because h(x) = −1 and h0(x) = 1 for suitable choices of h and h0. In contract, for any set of two points x1 ≤x2∈R, it is impossible thath(x1) = 1 but h(x2) =−1.

• IfH is finite, then the VC dimension is at most log2|H|.

• If X is infinite and H contains all functions h:X → {−1,1}, then the VC dimension is infinite.

2 Bounding the Growth Function by the VC Dimension

Theorem 22.3 (Sauer’s Lemma). Let H be a hypothesis class of VC dimension d. Then for allm≥d

H[m]≤

d

X

i=0

m i

.

In order to prove Sauer’s Lemma, the following lemma will turn out to be very helpful.

Lemma 22.4. Consider a set of data points S ⊆X and let L be an arbitrary set of labelings

`:S→ {−1,1}. ThenLshatters at least|L|subsets ofS. That is, there are at least|L|distinct sets S0 ⊆S such that S0 can be labelled in all2|S0| different ways using functions fromL.

Proof. We prove the claim by induction on |L|. The base case is |L| = 1. In this case, the empty set is shattered.

For the induction step, consider that |L|>1. In this case, there has to be some x∈S such that `(x) =−1 for some `∈L and `0(x) = 1 for some `0 ∈ L. Let L ={`∈ L|`(x) =−1}

andL+={`∈L|`(x) = 1}. Now, apply the induction hypothesis on the setsLand L+. Let T⊆2S and T+⊆2S denote the shattered sets respectively. By induction hypothesis, we have

|T| ≥L and |T+| ≥L+.

Note that there is no S0 ∈T orS0 ∈T+ with x∈S0 because the label ofx is always fixed to−1 or 1.

All ofT∪T+is shattered byL. Additionally, ifS0 ∈T∩T+, thenS0∪{x}is also shattered by L because after assigning x an arbitrary label we can still assign all possible labels to the S0 using a labelling inL. All sets constructed this way are not contained in T orT+ because they always contain x.

Consequently, the number of shattered sets is at least

|T∪T+|+|T∩T+|=|T|+|T+|− |T∩T+|+|T∩T+|=|T|+|T+| ≥ |L|+|L+|=|L| . Proof of Sauer’s Lemma. Given any setS ⊆X of sizem, we would like to boundH[S]. To this end, let Lbe the set of possible labelings`:S → {−1,1}applying different hypotheses from H on S. Formally, L={h|S|h∈ H}. By definitionH[S] =|L|.

Furthermore, let T ⊆ 2S be the family of subsets of S that are shattered by H. Using Lemma 22.4, we know that |T| ≥ |L|.

(3)

Algorithms and Uncertainty, Winter 2018/19 Lecture 22 (page 3 of 4) We also know that no set larger than dcan be shattered, so T contains sets of size at most d. Therefore, the size ofT is bounded by the number of such sets

|T| ≤

d

X

i=0

m i

.

In combination,H[S] =|L| ≤ |T| ≤Pd i=0

m i

.

To simplify the expression in Sauer’s Lemma, we can use the following bound on the binomial coefficients

m i

= m!

(m−i)!·i!≤ mi i! =m

d i di

i! ≤m d

ddi i! . Together with the definition of the exponetial function ex =P

i=0 xi

i!, we get

d

X

i=0

m i

d

X

i=0

m d

ddi i! =

m d

d d

X

i=0

di i! ≤m

d d

ed . This gives us the following corollary.

Corollary 22.5. Let H be a hypothesis class of VC dimension d. Then for all m≥d

H[m]≤em d

d

.

Plugging this bound into Condition (1), we get that for a hypothesis classHof VC dimension dfor all choices of >0,δ >0 if we drawmtimes independently from distributionDsuch that

m≥max (8

,2

log2 2 2emd d

δ

!)

= max 8

,2d log2

2em d

+2

log2 2

δ

,

then with probability at least 1−δ, allh∈ H with errS(f) = 0 have errD(h)< . Corollary 22.6. Any hypothesis class of finite VC dimension is PAC-learnable.

3 Infinite VC Dimension

Not all hypothesis classes have a finite VC dimension. One example would be the set of all functions X→ {0,1}. As we will show, these hypothesis classes are not PAC-learnable.

Theorem 22.7. Any hypothesis class of infinite VC dimension is not PAC-learnable.

To show this theorem, we have to show that the function mH in the definition of PAC- learnability does not exist. We will show the following.

Proposition 22.8. Let H be a hypothesis class of VC dimension at least d. Then for every learning algorithm there exists a distribution such on that on a training set of size d2 we have errD(hS)≥ 18 with probability at least 17.

(4)

Algorithms and Uncertainty, Winter 2018/19 Lecture 22 (page 4 of 4) Proof. By definition H shatters a set of size d. So, let T ⊆ X, |T| = d, be such a set. By definition, any labeling `: T → {−1,1} can be extended to a hypothesis f ∈ H such that

`(x) = f(x) for all x ∈ T. There are k = 2d such labelings. Let f1, . . . , fk, be the respective extended hypotheses. Each of them can be the ground truth. Let Di denote the uniform distribution over pairs (x, fi(x)) forx∈T.

Our learning algorithm will have to infer the correct i. The important observation is that any sample of size at most d2 tells us the correct labels of only at most d2 points inT. The others are still completely arbitrary.

Let hS be the hypothesis computed by the learning algorithm on sample S. In principle, this may also be randomized. Our goal is to show that

maxi Pr

errDi(hS)≥ 1 8

≥ 1 7 .

We will apply Yao’s principle: Draw I uniformly from {1, . . . , k} and considerDI. This is potentially confusing: We first draw indexI randomly and then we use probability distribution DI. Now, it suffices to show that

Pr

errDI(hS)≥ 1 8

≥ 1 7 .

Fix any x ∈ X. We bound the probability that hS(x) 6= fI(x). To this end, we think of the labels fI being determined in a different way. First draw the sampleS and determine the labels for the points in this set. Based on this, computehS. Only now determine the labels for the points not in this set. Ifx is not in the sample, thenhS(x) is correct with probability 12. It is not in the sample with probability at least 12. Therefore

Pr[hS(x)6=fI(x)]≥ 1 4 . This holds for allx∈X, therefore

E[errDI(hS)]≥ 1 4 . Now, we can apply Markov’s inequality to get

Pr

errDI(hS)< 1 8

=Pr

1−errDI(hS)> 7 8

≤ 1

7 8

E[1−errDI(hS)]≤ 3 4 ·8

7 = 6 7 . This proves the claim.

References and Further Reading

These notes are based on notes and lectures by Anna Karlinhttps://courses.cs.washington.

edu/courses/cse522/17sp/ and Avrim Blumhttp://www.cs.cmu.edu/~avrim/ML14/. Also see the references therein.

Referenzen

ÄHNLICHE DOKUMENTE

Observe that (i) nodes in F always transition at the same times, (ii) nodes in S always transition at the same times, (iii) all nodes always take the exact same time to transition

(hint: proof of this and the previous homework is given in the proof of Theorem 4.4 of the referenced paper. There we introduced an equivalent definition of butterfly minor,

(Hint: First “spend” some of the uncer- tainty and clock drifts to show that logical clocks cannot increase too rapidly. Then argue that any solution would imply a pulse

Keywords: Software security, sql injection, generating, static code analysis.

the top of the list: building a computer ca- pable of a teraflop-a trillion floating- point operations per second. Not surprisingly, Thinking Machines 63.. had an inside track

 Choose the minimum area rectangle containing all the positive points:. Minimum

For the solution of the following excercises, include these methods in suitable subclasses of your rudimentary matrix class... e) Write a program for the LU decomposition of

Walvis Ridge itself is underlain by a lower crustal high velocity anomaly from 73  .. landfall over a length of approximately 300 km (Fromm et al., 2015), which may be 74