Machine Learning

(1)

Machine Learning

Volker Roth

Department of Mathematics & Computer Science University of Basel

(2)

Section 6 Elements of Statistical Learning Theory

(3)

A ’black box’ model of learning

G S

LM

y x

y

G Generates i.i.d. samplesx according tounknown pdfp(x).

S Outputs valuesy according to unknownp(y|x).

LM Observes pairs (x₁,y₁), . . . ,(x_n,y_n)∼p(x,y) =p(x)p(y|x).

Tries to capture the relation betweenx andy.

(4)

Expected Risk

Learning Process: frequentist view

The learning process is the process of choosing an appropriatefunction from a given setof functions.

Note: from a Bayesianviewpoint we would rather define a distribution over functions.

A good function should incur only a few errors small expected risk:

Expected Risk The quantity

R[f] =E_(x,y)∼p{Loss(y,f(x))}

is called the expected risk and measures the loss averaged over the unknown distribution.

(5)

Empirical Risk

The best possible risk is inf_f R[f]. The infimum is often achieved at a minimizer fρthat we call target function.

...but we just have a sampleS. To find “good” functions we typically have to restrict ourselves to ahypothesis space H containing functions with some property (regularity, smoothness etc.) In a given hypothesis space H, denote byf^∗ thebest possible functionthat can be implementedby the learning machine.

Empirical risk

The empirical risk of a function f is R_emp[f] = 1

n

X

i=1

Loss(y_i,f(x_i)).

Denote by fS ∈ H theempirical risk minimizer on sampleS.

(6)

Hypothesis space

(7)

Generalization

SLT gives results for bounding the error on the unseendata, given only the training data.

There needs to be a relation thatcouples past and future.

Sampling Assumption

The only assumption of SLT is that all samples (past and future) are iid.

Typical structure of a bound: With probability 1−δ it holds that R[fn]≤

known

z }| { Remp[fn] +

sa n

capacity(H) + lnb δ

| {z }

confidence term

,

with some constantsa,b >0.

(8)

Convergence of random variables

Definition (Convergence in Probability)

Let X1,X2, . . . be random variables. We say that Xnconverges in probabilityto the random variableX asn → ∞, iff, for allε >0,

P(|X_n−X|> ε)→0, asn → ∞.

We writeX_n−→^p X as n→ ∞.

(9)

The simplest case: Binary classification

Binary classification with 0/1 loss Our analysis considers the case where

f :X → {−1,1}, andL(y,f(x)) = 1

2|1−f(x)y|.

Note: We can use any hypothesis space and apply the signum function:

H⁰ ={f⁰= sign(f)|f ∈ H}

Similar results from SLT also available other classification loss functions and for regression (but we will not discuss this here).

(10)

Consistency of ERM

The principle of empirical risk minimization is consistent if for any >0,

n→∞lim P(|R[f_n]−R[f^∗]|> ) = 0 and

n→∞lim P(|R_emp[f_n]−R[f_n]|> ) = 0

(11)

A counter example

Why is boundingP(|R_emp[f_n]−R[f^∗]|> ) not sufficient?

f f n

*

(12)

Hoeffding’s inequality

Theorem (Hoeffding)

Let ξ_i,i ∈[0,n]be n independent instances of a bounded random variable ξ, with values in [a,b]. Denote their average by Q_n= ¹_n^P_iξ_i. Then for any >0,

P(Q_n−E(ξ)≥) P(E(ξ)−Q_n≥)

)

≤exp(− 2n²

(b−a)²) (1)

and

P(|Q_n−E(ξ)| ≥)≤2 exp(− 2n²

(b−a)²) (2)

(13)

Hoeffding’s inequality

Let ξ be the 0/1 loss function:

ξ = 1

2|1−f(x)y|=L(y,f(x)).

Then

Q_n[f] = 1 n

n

X

i=1

ξ_i = 1 n

n

X

i=1

L(y_i,f(x_i)) =R_emp[f] and

E[ξ] =E[L(y,f(x))] =R[f].

I.i.d. sampling assumption: ξi are independent instances of bounded random variable ξ, with values in [0,1].

Hoeffding’s Inequality for fixed functions P(|R_emp[f]−R[f]|> )≤2 exp(−2n²)

(14)

Hoeffding’s inequality

Hoeffding’s inequality gives us rates of convergence for any fixedfunction.

Example: Letf ∈ Hbe an arbitrary fixed function For = 0.1 and n= 100,

P(|R_emp[f]−R[f]|>0.1)≤0.28 For = 0.1 and n= 200,

P(|R_emp[f]−R[f]|>0.1)≤0.04 Caution!

Hoeffding’s inequality does nottell us that P(|R_emp[f_n]−R[f_n]|> )≤2 exp(−2n²).

Because:

f_n is chosen to minimize R_emp. This is not a fixed function!!

(15)

Consistency

Risk

R

For eachfixed functionf,Remp[f]−−−→^P

n→∞ R[f] (downward arrow).

This does not mean that the empirical risk minimizer fn will lead to a value of risk that is as good as possible, R[f^∗] (consistency).

(16)

Conditions for consistency

Let

f_n:= arg min

f∈HR_emp[f] f^∗:= arg min

f∈HR[f] then

R[f]−R[f^∗]≥0,∀f ∈ H Remp[f]−Remp[fn]≥0,∀f ∈ H

(17)

Conditions for consistency

Let

f_n:= arg min

f∈HR[f] then

R[fn]−R[f^∗]≥0,

∀f ∈ H

Remp[f]−Remp[fn]≥0,∀f ∈ H

(18)

Conditions for consistency

Let

f_n:= arg min

f∈HR[f] then

R[fn]−R[f^∗]≥0,∀f ∈ H Remp[f^∗]−Remp[fn]≥0,∀f ∈ H

(19)

Conditions for consistency

0≤

≥0

z }| { R[fn]−R[f^∗] +

≥0

z }| {

Remp[f^∗]−Remp[fn]

=R[fn]−Remp[fn] +Remp[f^∗]−R[f^∗]

≤sup

f∈H

(R[f]−Remp[f])

| {z }

Assumption: −−−→^P

n→∞ 0

+Remp[f^∗]−R[f^∗]

| {z }

Hoeffding:−−−→^P

n→∞ 0

Assume sup

f∈H

(R[f]−R_emp[f])−−−→^P

n→∞ 0

One-sided uniform convergence over all functions in H

(20)

Conditions for consistency

0≤

≥0

z }| { R[f_n]−R[f^∗]+

≥0

z }| {

R_emp[f^∗]−R_emp[f_n]−−−→^P

n→∞ 0

−R[f^∗] +Remp[f^∗]−−−→^P

n→∞ 0

R[fn]− Remp[fn]−−−→^P

n→∞ 0

sup_f_∈H(R[f]−Remp[f])−−−→^P

n→∞ 0⇒ consistencyof ERM.

Thus, it is a sufficientcondition for consistency.

(21)

The key theorem of learning theory

Theorem (Vapnik & Chervonenkis ’98)

Let Hbe a set of functions with bounded loss for the distribution F(x,y), A≤R[f]≤B, ∀f ∈ H.

For the ERM principle to be consistent, it isnecessary and sufficient that

n→∞lim P(sup

f∈H

(R[f]−R_emp[f])> ) = 0, ∀ >0.

Note: here, we looked only at the sufficient condition for consistency.

For the necessary condition see (Vapnik & Chervonenkis ’98).

(22)

The key theorem of learning theory

The key theorem asserts that any analysis of the convergence of ERM must be a worst case analysis.

We will show:

Consistency depends on the capacity of the hypothesis space.

But there are some open questions:

How can we check the condition for the theorem (uniform one-sided convergence) in practice?

Are there “simple” hypothesis classes with guaranteed consistency?

Analysis is still asymptotic.

What can we say about finite sample sizes?

(23)

Finite hypothesis spaces

Assume the set Hcontains only 2 functions:

H={f₁,f2}.

Let

Cⁱ :={(x₁,y₁), . . . ,(x_n,y_n)|R[f_i]−R_emp[f_i]> }

be the set of samples for which the risks of fi differ by more than . Hoeffding’s inequality:

P(Cⁱ)≤exp(−2n²) Union bound:

P(sup

f∈H

(R[f]−R_emp[f]> )) =P(C¹∪C²) =P(C¹) +P(C²)−P(C¹∩C²)

≤P(C¹) +P(C²)≤2 exp(−2n²).

(24)

Finite hypothesis spaces

Assume Hcontains a finite number of functions: H={f₁, . . . ,fN}.

Cⁱ :={(x₁,y₁), . . . ,(x_n,y_n)|R[f_i]−R_emp[f_i]> } Hoeffding’s inequality: P(Cⁱ)≤exp(−2n²)

Union bound: P(∪^N_i=1Cⁱ)≤^P^N_i=1P(Cⁱ)≤Nexp(−2n²) P(sup

f∈H

(R[f]−Remp[f]> ))≤Nexp(−2n²) = exp(lnN−2n²) For any finite hypothesis space, the ERM is consistent.

The convergence is exponentially fast.

(25)

Some consequences

P(sup

f∈H

R[f]−Remp[f]> )≤exp(lnN−2n²) Bound holds uniformly for all functions in H

can use it for the functions that minimize R_emp. We can bound the test error:

P(R[f_n]−R_emp[f_n]> )≤exp(lnN−2n²).

(26)

Some consequences

Can derive a confidence interval: equate r.h.s. toδ and solve for: P(R[fn]−Remp[fn]> )≤δ()

P(R[f_n]−R_emp[f_n]≤)≥1−δ() With probability at least (1−δ) it holds that

R[f_n]≤R_emp[f_n] +(δ) R[fn]≤Remp[fn] +

v u u t a

n lnN

|{z}

Capacity(H)

+ lnb δ

, with a= 1/2,b = 1.

Bound depends only onH andn.

However: “Simple” spaces (like the space of linear functions) contain infinitely many functions.

(27)

Infinite to finite (?)

Observation: R_emp[f] effectively refers only to a finitefunction class: for n sample pointsx1, . . . ,xn, the functions inf ∈ H can take at most 2ⁿ different values y1, . . . ,yn.

But this does not yet solve our problem: Confidence term ln(2ⁿ)/n= ln 2 does not converge to 0 asn → ∞. But let’s formalize this idea first...

(28)

Infinite case: Shattering Coefficient

Let a sample: Z_n:={(x₁,y₁), . . . ,(x_n,y_n)} be given.

Denote by N(H,Zn) the cardinality ofH when restricted to

{x₁, . . . ,xn},H|Z_n, i.e. the number of functionsfromH that can be distinguished on the given sample.

Consider now the maximum (over all possible n-samples):

Definition (Shattering Coefficient)

The Shattering Coefficientis the maximum number of ways into whichn points can be classified by the function class:

N(H,n) = max

Zn

N(H,Z_n).

Since f(x)∈ {−1,1},N(H,n) is finite.

N(H,Z_n)≤ N(H,n)≤2ⁿ

(29)

Example

Linear functions

H={sign(hw,xi+b)|w ∈R²,b∈R}

N(H,2) = 4 = 2²

(30)

Example

N(H,3) = 8 = 2³

(31)

Example

N(H,4) = 14<2⁴

(32)

Capacity concepts

Recall: we search for other capacity measures of Hreplacing lnN.

We know N(H,Zn)

| {z }

depends on sample

≤ N(H,n)≤ 2ⁿ

|{z}

too loose

Dependency on sample can be removed by averaging over all samples:

E[N(H,Zn)]. It turns out that this is a valid capacity measure:

Theorem (Vapnik and Chervonenkis)

Let Z_2n= ((x₁,y₁), . . . ,(x_2n,y_2n) be a sample of size 2n. For any >0it holds that

P(sup

f∈H

R[f]−R_emp[f]> )≤4 exp(lnE[N(H,Z_2n)]−n² 8 ) If lnE[N(H,Z_2n)] grows sublinearly, we get a nontrivial bound.

(33)

Some consequences

P(sup

f∈H

R[f]−R_emp[f]> )≤4 exp(lnE[N(H,Z_2n)]−n² 8 ) Bound holds uniformly for all functions in H

can use it for the functions that minimize Remp. We can bound the test error:

P(R[f_n]−R_emp[f_n]> )≤4E[N(H,Z_2n)] exp(−n² 8 ).

Can derive a confidence interval: equate r.h.s. toδ and solve for: With probability at least (1−δ) it holds that

R[f_n]≤R_emp[f_n] +(δ) R[fn]≤Remp[fn] +

s8 n

lnE[N(H,Z2n)] + ln4 δ

Bound depends on H,n and the unknown probabilityP(Z).

(34)

VC Dimension and other capacity concepts

Growth function: upper bound expectation by maximum:

G_H(n) = ln[max

Zn

N(H,Z_n)] = ln N(H,n)

| {z }

Shattering coeff.

.

VC-Dimension: recall that N(H,n)≤2ⁿ. Vapnik & Chervonenkis showed that either N(H,n) = 2ⁿ for all n, or there exists some maximal n for which this is the case.

Definition

The VC dimension h of a classHis the largest n such that N(H,n) = 2ⁿ, or, equivalentlyGH(n) =nln(2).

Interpretation: The VC-Dimension is the maximal number of samples that can be classified in all 2ⁿ possible ways.

(35)

VC Dimension

(36)

VC Dimension

4 Points in 2D cannot be labeled in all possible ways by linear functions.

The VC-Dimension is 3!

(37)

A remarkable property of the growth function

Theorem (Vapnik & Chervonenkis)

Let Hbe a class of functions with finite VC-dimension h. Then for n≤h, G_H(n) grows linearly with the sample size, and for all n>h

G_H(n)≤h

lnn h + 1

.

0 20 40 60 80 100

020406080100

G

h*(ln(n/h) +1)

(38)

Capacity concepts

Relation of capacity concepts:

lnE[N(H,Z2n)]

| {z }

distribution dependent

≤ G_H(n)≤

(sometimes) easy to compute

z }| { h

lnn

h + 1

| {z }

distribution independent

Structure of bounds:

R[f_n]≤R_emp[f_n] + sa

n

If the VC Dimension is finite, we get non-trivial bounds!

(39)

VC-Dimension for linear functions

Theorem

The VC dimension of linear functions in d -dimensional space is d + 1.

Question: Does the number of parameters coincide with the VC-Dimension? No!! Counter example:

FIGURE 7.5 in (Hastie et al.: The Elements of Statistical Learning). Solid curve: sin(50x) forx∈[0,1]. Blue and green points illustrate how sign(sin(αx)) can separate an arbitrarily large number of points by choosing a high frequencyα.

The VC-Dimension of{sign(sin(αx))|α∈R} is infinite.

(40)

Linear functions: Role of the margin

Recall: the VC dimension of linear functions on R^d is d + 1.

We need finite VC dimension for “simple” nontrivial bounds.

Question: is learning impossible in infinite dimensional spaces (e.g. Gaussian RBF kernels)?

Not necessarily! The capacity of the subset of hyperplanes with large classification margin can be much smaller than the general VC dimension of all hyperplanes.

(41)

Recall: Decision hyperplanes

f(x;w) defines distance r fromx to the hyperplane: x =xp+r_kwk^w . f(x_p) = 0 ⇒ f(x) =rkwk ⇔ r =f(x)/kwk.

x

g(

x) = 0 w

x1

x2

x3

w⁰

/

^||^w^||

r

H

x_p

R1

R2

FIGURE 5.2. The linear decision boundary H, where g(x) = w

^t

x + w

0

= 0, separates the feature space into two half-spaces R

₁

(where g( x ) > 0) and R

₂

(where g( x ) < 0). From:

Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c

2001 by John Wiley & Sons, Inc.

Volker Roth (University of Basel) Machine Learning 39 / 41

(42)

Canonical hyperplanes

Definition of hyperplane is not unique: weight vectorw can be multiplied by any nonzero constant.

The definition of acanonical hyperplane overcomes this ambiguity by additionally requiring

i=1,...,nmin

w^txi +w0

= 1.

Distance between canonical hyperplane and the closest point:

marginr = 1/kwk.

(43)

Structure on canonical hyperplanes

Theorem (Vapnik, 1982)

Let R be the radius of the smallest ball containing the pointsx₁, . . . ,x_n: BR(a) ={x ∈R^d :kx−ak<R,a∈R^d}. The set of canonical

hyperplane decision functions f(w,w0) =sign{w^tx+w0} satisfying kwk ≤A has VC dimension h bounded by

h≤R²A²+ 1.

Intuitive interpretation: margin = 1/kwk

minimizing capacity(H) corresponds to maximizing the margin.

Structure of bounds:

R[fn]≤Remp[fn] + sa

n

Large margin classifiers.