Covering, packing, and approximation - Probability in High Dimension

Part I Concentration

5.2 Covering, packing, and approximation

andk&ε⁻²logn, there exists a linear mapT :H →R^k such that (1−ε)kxi−xjk ≤ kT xi−T xjk ≤(1+ε)kxi−xjk for all1≤i, j≤n.

This result should interpreted in terms of compression: if we want to store the distances betweennpoints in a data structure, and if we tolerate a small distortion of orderε, it suffices to store ann×kmatrix of size∼nlognrather than the fulln×ndistance matrix of size∼n².

At first sight, the Johnson-Lindenstrauss lemma has nothing to do with probability: it is a deterministic statement about the geometry of Hilbert spaces. However, the easiest way to findT is to select it randomly!

a. Argue that we can assume without loss of generality thatH =Rⁿ. b. For ak×nrandom matrixT such thatTij are i.i.d.N(0, k⁻¹), show that

P[|kT zk −EkT zk| ≥εkzk]≤2e^−kε²^/2 forz∈Rⁿ. Hint: Gaussian concentration.

c. Show that

p1−k⁻¹kzk ≤EkT zk ≤ kzk, and conclude that for 0< ε <1 andk≥ε⁻¹

P[(1−ε)kzk<kT zk<(1 +ε)kzk]≥1−2e^−kε²^/8 forz∈Rⁿ. Hint: UseEkT zk ≤E[kT zk²]^1/2for the upper bound. For the lower bound, estimate VarkT zkfrom above using the Gaussian Poincar´e inequality.

d. Show that ifk >24ε⁻²logn, then

P[(1−ε)kxi−xjk<kT xi−T xjk<(1 +ε)kxi−xjk for alli, j]>0.

Hint: use a union bound.

5.2 Covering, packing, and approximation

If the set T is infinite, the maximal inequalities of the previous section pro-vide no information. This is, however, not surprising. We have seen that the inequalities for finite maxima work well when the random variables are inde-pendent. On the other hand, suppose that T is infinite but that t 7→ Xt is continuous in a suitable sense. Then lim_t→sXt=Xs, soXtandXsmust be strongly dependent when t and s are nearby points! Thus the lack of inde-pendence should in fact help us to control the infinite supremum: we should apply the maximal inequalities of the previous section only to a finite number

120 5 Maxima, approximation, and chaining

of well-separated points (at which the process might be expected to be nearly independent), and use continuity to control the fluctuations of the remaining (strongly dependent) degrees of freedom. In this section, we will develop the crudest illustration of this principle, which will be systematically developed in the sequel into a powerful machinery to control suprema.

To implement the above idea, we need to have a quantitative notion of continuity. In this section, we will use the simplest (but, as we will see, often unsatisfactory) such notion for random processes.

Definition 5.4 (Lipschitz process).The random process{X_t}_t∈T is called Lipschitzfor a metricdonT if there exists a random variableC such that

|Xt−Xs| ≤Cd(t, s) for allt, s∈T.

Given a Lipschitz process, our aim is to approximate the supremum over T by the maximum over a finite setN, to which we will apply the inequalities of the previous section. To obtain a good bound, we have two competing demands: on the one hand, we would like the setN to be as small as possible (so that the bound on the maximum is small); on the other hand, to control the approximation error, we must make sure that every point inT is close to at least one of the points inN. This leads to the following concept.

Definition 5.5 (-net and covering number).A set N is called an ε-net for(T, d) if for every t ∈ T, there exists π(t) ∈N such that d(t, π(t)) ≤ε.

The smallest cardinality of anε-net for (T, d) is called the covering number N(T, d, ε) := inf{|N|:N is an ε-net for (T, d)}.

The covering number N(T, d, ε) should be viewed as a measure of the complexity of the setT at the scaleε: the more complexT, the more points we will need to approximate its structure up to a fixed precision. Alternatively, we can interpret the covering number as describing the geometry of the metric space (T, d). Indeed, letB(t, ε) ={s:d(t, s)≤ε}be a ball of radiusε. Then

N is anε-net if and only if T ⊆ [

t∈N

B(t, ε),

so that the covering numberN(T, d, ε) is the smallest number of balls of radius εneeded to coverT(hence the name). We can therefore interpret the covering number as a measure of the degree of (non-)compactness of (T, d).

Remark 5.6.In many applications, we may want to compute the supremum sup_t∈TXt of a stochastic process {Xt}_t∈S that is defined on a larger index setS⊃T. In this case, even though we are only interested in the process on the set T, it is not necessary to require that theε-net N is a subset ofT: it can be convenient to approximate the set T by points in S\T also. For this reason, we have not insisted in the above definition thatN ⊆T.

5.2 Covering, packing, and approximation 121 We are now ready to develop our first bound on the supremum of a random process. We adopt the notation of Definitions 5.4 and 5.5.

Lemma 5.7 (Lipschitz maximal inequality).Suppose{Xt}t∈T is a Lips-chitz process (Definition 5.4) andXtisσ²-subgaussian for everyt∈T. Then

sup

t∈T

X_t

≤inf

ε>0{εE[C] +p

2σ²logN(T, d, ε)}.

Note that this result is indeed a simple incarnation of the informal principle formulated in Chapter 1: if the process X_t is “sufficiently continuous,” then sup_t∈TX_tis controlled by the “complexity” of the index setT.

Proof. Letε >0 and letN be anε-net. Then sup

t∈T

X_t≤sup

t∈T

{X_t−X_π(t)}+ sup

t∈T

X_π(t)≤Cε+ max

t∈N X_t. Taking the expectation and using Lemma 5.1 yields

sup

t∈T

X_t

≤εE[C] +p

2σ²log|N|.

Optimizing overε-netsN andε >0 yields the result. ut Remark 5.8.The idea behind Lemma 5.7 is that it allows us to trade off between exploiting independence (better at large scales) and controlling for dependence (worse at large scales). However, note that we never explicitly assume or use independence in the proof: instead, the distance d could be interpreted as a proxy for the degree of independence. While the conclusion of Lemma 5.7 does not depend on this validity of this interpretation, we expect that such bounds (and the more powerful bounds to be developed in the sequel) will be the most effective when the distancedis chosen in such a way that large distance does indeed correspond to more independence. This is often the case in practice. In the case of Gaussian processes, for example, we will see in the next chapter that this idea holds to such a degree that we can obtain matching upper and lower bounds for the supremum of Gaussian processes in terms of the geometry of the index set (T, d), albeit in a much more sophisticated manner than is captured by the trivial Lemma 5.7.

Remark 5.9.WhenN(T, d, ε) =∞, the bound of Lemma 5.7 is infinite. How-ever, note that if X1, X2, . . . are i.i.d. unbounded random variables, then we already have sup_iXi=∞a.s. It is therefore to be expected that the supremum of a random process will typically indeed be infinite if it contains infinitely many independent degrees of freedom. Thus the fact that N(T, d, ε) = ∞ (which means there are infinitely many points in T that are well separated) yields an infinite bound is not a shortcoming of Lemma 5.7. To obtain a finite supremum for noncompact index setsT one must often add a penalty inside the supremum; such problems will be investigated in section 5.4 below.

122 5 Maxima, approximation, and chaining

In the remainder of this section, we will illustrate the application of Lemma 5.7 using two illuminating examples. Along the way, we will develop some useful examples of how one can control covering numbers.

Example 5.10 (Random matrices). LetM be ann×m random matrix such thatMij are independentσ²-subgaussian random variables. We would like to estimate the magnitude of the operator norm

kMk:= sup

v∈Bⁿ₂,w∈B₂^m

hv, M wi= sup

(v,w)∈T

X_v,w,

whereB₂ⁿ={x∈Rⁿ:kxk ≤1}is the Euclidean unit ball inRⁿ and T :=B₂ⁿ×B₂^m, Xv,w:=hv, M wi=

i=1 m

j=1

viMijwj.

It follows immediately from Azuma’s inequality (Lemma 3.7) that X_v,w is σ²-subgaussian for every (v, w)∈T. On the other hand, note that

|Xv,w−X_v⁰_,w⁰|=|hv, M wi − hv⁰, M w⁰i|

≤ |hv−v⁰, M wi|+|hv⁰, M(w−w⁰)i|

≤ kv−v⁰kkMkkwk+kv⁰kkMkkw−w⁰k

≤ kMk {kv−v⁰k+kw−w⁰k}

for (v, w)∈T. If we define a metric onT as

d((v, w),(v⁰, w⁰)) :=kv−v⁰k+kw−w⁰k,

we see that the random process {Xv,w}_(v,w)∈T is Lipschitz for the metric d.

Note that the random Lipschitz constant happens to bekMk, which is in fact the quantity we are trying to control in the first place! This is a rather peculiar situation, but we can nonetheless readily apply Lemma 5.7: this yields

E[kMk]≤εE[kMk] +p

2σ²logN(T, d, ε) for everyε >0, which we can rearrange to obtain

E[kMk]≤inf

ε>0

σ√ 2 1−ε

plogN(T, d, ε).

What remains is to estimate the covering number. To this end, we must intro-duce an additional idea that will be of significant importance in the sequel.

How can one construct asmall ε-netN? The defining property of anε-net is that every point in T is within a distance at most εof some point in N. We can always achieve this by choosing a very dense setN. However, if we want|N|to be small, we should intuitively choose the points in N to be as far apart as possible. This motivates the following definition.

5.2 Covering, packing, and approximation 123 Definition 5.11 (ε-packing and packing number).A setN ⊆T is called an ε-packing of (T, d) if d(t, t⁰) > ε for every t, t⁰ ∈N, t 6= t⁰. The largest cardinality of anε-packing of(T, d) is called the packing number

D(T, d, ε) := sup{|N|:N is an ε-packing of(T, d)}.

The key idea, which was already hinted at above, is that the notion of packing dual to the notion of covering, as is made precise by the following result. This means that we can use covering and packing interchangeably (up to constants). In some cases it is easier to estimate packing numbers than covering numbers, as we will see shortly. On the other hand, we will see in the following chapter that packing numbers arise naturally when we aim to provelowerbounds for the suprema of random processes (as opposed toupper bounds which are considered exclusively in this chapter).

Lemma 5.12 (Duality between covering and packing).For everyε >0 D(T, d,2ε)≤N(T, d, ε)≤D(T, d, ε).

Note that this can indeed be viewed as a form of duality (in the sense of optimization): the packing number is defined in terms of a supremum, but the covering number is defined in terms of an infimum.

Proof.LetD be a 2ε-packing and let N be an ε-net. For everyt∈D, choose π(t)∈N such thatd(t, π(t))≤ε. Then fort6=t⁰, we have

2ε < d(t, t⁰)≤d(t, π(t)) +d(π(t), π(t⁰)) +d(π(t⁰), t⁰)≤2ε+d(π(t), π(t⁰)), which implies π(t) 6= π(t⁰). Thus π : D → N is one-to-one, and therefore

|D| ≤ |N|. This yields the first inequalityD(T, d,2ε)≤N(T, d, ε).

To obtain the second inequality, let D be a maximal ε-packing of (T, d) (that is, |D| =D(T, d, ε)). We claim thatD is necessarily anε-net. Indeed, suppose this is not the case; then there is a pointt∈T such that d(t, t⁰)> ε for everyt⁰∈D. But thenD∪ {t}must be aε-packing also, which contradicts the maximality ofD. Thus we haveD(T, d, ε) =|D| ≥N(T, d, ε). ut We are now in a position to bound the covering number of the Euclidean ballB₂ⁿ with respect to the Euclidean distance. The proof of this elementary result uses a clever technique known as avolume argument.

Lemma 5.13.We have N(Bⁿ₂,k · k, ε) = 1forε≥1 and 1

ε n

≤N(B₂ⁿ,k · k, ε)≤ 3

ε n

for0< ε <1.

Proof.That N(B₂ⁿ,k · k, ε) = 1 for ε ≥ 1 is obvious: by definition, we have ktk=kt−0k ≤1 for everyt∈Bⁿ₂, so the singleton{0}is anε-net.

The main part of the proof is illustrated in the following figure:

124 5 Maxima, approximation, and chaining

The colored ball isB₂ⁿ. To obtain an upper bound on the covering number, we choose a 2ε-packingD ofBⁿ₂ (black dots in left figure). Then balls of radius εaroundt∈Dbe disjoint, and all these balls are contained in a large ball of size 1 +ε. As the sum of the volumes of the small balls (of which there are

|D|) is bounded above by the volume of the large ball, we obtain an upper bound on the size of D (and thus on the covering number by Lemma 5.12).

To obtain a lower bound on the covering number, we choose anε-net N of B₂ⁿ (black dots in right figure). As the balls of radiusε aroundt ∈N cover B₂ⁿ, the sum of the volumes of these balls (of which there are|N|) is bounded below by the volume ofB₂ⁿ. This yields a lower bound on the size ofN.

We now proceed to make this argument precise. Let us begin with the upper bound. LetD be a 2ε-packing of Bⁿ₂. As d(t, t⁰)>2ε for all t6=t⁰ in D, the balls{B(t, ε) :t∈D}must be disjoint. On the other hand, every ball B(t, ε) fort∈B₂ⁿ must be contained in the larger ballB(0,1 +ε). Thus

t∈D

λ(B(t, ε)) =λ [

t∈D

B(t, ε)

≤λ(B(0,1 +ε)),

whereλdenotes the Lebesgue measure onRⁿ. By homogeneity of the Lebesgue measure,λ(B(t, α)) =λ(B(0, α)) =λ(αB(0,1)) =αⁿλ(B(0,1)). Thus

|D| ≤ λ(B(0,1 +ε)) λ(B(0, ε)) =

1 +ε ε

ⁿ .

As this holds for every 2ε-packing D, we have evidently proved the upper boundN(T, d,2ε)≤D(T, d,2ε)≤(1 + 1/ε)ⁿ≤(3/2ε)ⁿ for 2ε <1.

To obtain the lower bound, letN be anε-net forB₂ⁿ. Then λ(B₂ⁿ)≤λ [

t∈N

B(t, ε)

≤X

t∈N

λ(B(t, ε)), so we obtain

|N| ≥ λ(Bⁿ₂) λ(B(0, ε)) =

1 ε

As this holds for everyε-netN, we have provedN(T, d, ε)≥(1/ε)ⁿ. ut

5.2 Covering, packing, and approximation 125 Remark 5.14.Lemma 5.13 quantifies explicitly the dependence of the covering number on dimension: the number of balls of radiusεneeded to cover a ball in Rⁿ is polynomial in 1/ε of order n. This is not surprising: think of how many cubes of side lengthε can fit into the unit cube inRⁿ. While balls do not pack as nicely as cubes, the ultimate conclusion is the same (in fact, the conclusion of Lemma 5.13 carries over to any norm onRⁿ, see Problem 5.5).

In this manner, the dependence on dimension will enter explicitly into our estimates of the suprema of random processes.

Beyond the concrete result on covering numbers inRⁿ, Lemma 5.13 pro-vides a good way to think about the notion of dimension in the first place.

The classical idea that Rⁿ is n-dimensional stems from its linear structure:

there is a basis of sizensuch that any vector inRⁿcan be written as a linear combination of these basis elements. This linear-algebraic notion of dimension is not very useful in general spaces where one does not need to have any linear structure. Lemma 5.13 motivates a different notion of dimension that makes sense in any metric space: we say that a metric space (T, d) hasmetric dimen-sion nifN(T, d, ε)∼ε⁻ⁿ. Lemma 5.13 shows that for (bounded subsets of) Rⁿ, the linear-algebraic and metric notions of dimension coincide; however, the definition of metric dimension is independent of the linear structure of the space. The notion of metric dimension certainly conforms to the intuitive no-tion that a high-dimensional space has more “room” than a low-dimensional space (the number of balls of fixed radius needed to cover the space increases exponentially in the dimension). Of course, not every metric space has fi-nite metric dimension: we will shortly encounter an infifi-nite-dimensional space (T, d) for which the covering numbers grow exponentially in 1/ε.

Having developed some basic estimates, we can now complete the example of random matrices. Here we are not interested in the covering number ofB₂ⁿ itself, but rather in the covering number ofT =B₂ⁿ×B₂^mwith respect to the metricd. The latter is however easily estimated using Lemma 5.13. LetN be anε-net forB₂ⁿ and letM be anε-net for B₂^m. ThenN×M is a 2ε-net for T of cardinality|N||M|: indeed, settingπ((t, s)) = (π(t), π(s)), we have

d((t, s), π((t, s))) =kt−π(t)k+ks−π(s)k ≤2ε.

This evidently implies that

N(T, d,2ε)≤N(B₂ⁿ,k · k, ε)N(B₂^m,k · k, ε)≤ 3

ε n+m

forε≤1. We therefore obtain E[kMk]≤ inf

ε>0

σ√ 2 1−ε

plogN(T, d, ε).σ√ n+m.

It turns out that this crude bound already captures the correct order of magnitude of the matrix norm! In particular, for square matrices, we obtain E[kMk].√

nas was already alluded to in Example 2.5.

126 5 Maxima, approximation, and chaining

We now turn to our second example. Unlike in the previous example, where we got a sharp result with little work, we will not be so lucky here: we will derive a nontrivial bound from Lemma 5.7, but the methods we developed so far will prove to be too crude to capture the correct order of magnitude.

Example 5.15 (Wasserstein law of large numbers). Let X₁, X₂, . . . be i.i.d.

random variables with values in the interval [0,1]. We denote their distribution asX_i ∼µ. Define the empirical measure of X₁, . . . , X_n as

µn:= 1 n

k=1

δX_k. Then it is easy to estimate

E|µnf−µf| ≤E[|µnf −µf|²]^1/2≤ kfk∞

√n .

In particular, we haveµnf →µf inL¹for every bounded functionf: this is none other than the weak law of large numbers with the optimaln^−1/2 rate.

At what rate does the law of large numbersµn→µhold when we consider other notions of distance between probability measures? In this spirit, we will presently attempt to estimate the expected Wasserstein distanceE[W1(µn, µ)]

between the empirical measure and the underlying distribution. Recall that W1(µn, µ) = sup

f∈Lip([0,1])

{µnf−µf}= sup

f∈FXf, where we have defined

Xf:=µnf−µf, F:={f ∈Lip([0,1]) : 0≤f ≤1}.

Thus this question reduces to controlling the supremum of a random process.

(Note that|f(x)−f(y)| ≤ |x−y| ≤1 forf ∈Lip([0,1]) andx, y∈[0,1]; as X_f is invariant under adding a constant tof, there is no loss of generality in restricting the supremum to functions 0≤f ≤1 in the definition ofW₁.)

We begin by noting the trivial estimate

|Xf −Xg|=|µn(f−g)−µ(f−g)| ≤2kf −gk∞.

Thus the process{Xf}_f∈_F is Lipschitz with respect to the uniform distance onF. On the other hand, note that by definition

Xf =

k=1

f(Xk)−µf

n ,

which is a sum of i.i.d. random variables with values in the interval [−¹_n,_n¹].

ThusXf is _n¹-subgaussian for everyf ∈Fby the Azuma-Hoeffding inequality (Lemma 3.6). We can therefore estimate using Lemma 5.7

5.2 Covering, packing, and approximation 127

E[W₁(µ_n, µ)]≤inf

ε>0

2ε+

nlogN(F,k · k_∞, ε)

. To proceed, we must bound the covering numberN(F,k · k_∞, ε).

Lemma 5.16.There is a constant c <∞ such that

N(F,k · k∞, ε)≤e^c/ε forε < ¹₂, N(F,k · k∞, ε) = 1 forε≥¹₂. Remark 5.17.Note that, unlike in the case of a Euclidean ball where the covering number is polynomial in 1/ε, the covering number of the familyFof Lipschitz functions is exponential in 1/ε. This indicates that the metric space (F,k · k_∞) is in fact infinite-dimensional, which is not too surprising.

Proof.Fixε >0. For every function f ∈F, we will construct a new function π(f) in the manner illustrated in the following picture:

.. .

· · · 0 ^ε₂ ε 3ε

2 1

0 ε 2ε 1

π(f)

To be precise, we approximate f : [0,1] → [0,1] by π(f) : [0,1] → [0,1]

defined as follows. Partition the horizontal axis into consecutive nonoverlap-ping intervalsI₁, . . . , I_d2/εe of size ε/2 and the vertical axis into consecutive nonoverlapping intervalsJ₁, . . . , J_d1/εe of sizeε. We then define

π(f)(x) = maxJ`+ minJ`

2 whenever x∈I_k, f(minI_k)∈J_`. That is, in each interval on the horizontal axis, we approximatef by its value at the left endpoint of the interval rounded to the center of the interval on the vertical axis to which it belongs. By construction, the setN ={π(f) :f ∈F} is anε-net: indeed, note that wheneverx∈Ik andf(minIk)∈J`, we have

|f(x)−π(f)(x)| ≤ |f(x)−f(minIk)|+

f(minIk)−maxJ`+ minJ`

≤ |x−minI_k|+maxJ_`−minJ_`

2 ≤ε,

128 5 Maxima, approximation, and chaining

where we have used the Lipschitz property off and the definition ofIk, J`. (Note thatN 6⊆F: but this is not a problem, cf. Remark 5.6.)

As we now have anε-netN, it remains to estimate |N|. The most naive bound would be|N| ≤ d1/εe^d2/εe <∞, but we can do somewhat better by taking into account the Lipschitz property of the functions inF. Note that

|π(f)(minIk)−π(f)(minIk+1)| ≤ |f(minIk)−f(minIk+1)|+ε≤ ³₂ε;

As the possible values of π(f) can only differ by multiples ofε, this implies that π(f)(minI_k+1)−π(f)(minI_k) ∈ {−ε,0, ε}. Thus π(f)(0) can take any ofd1/εedifferent values, but each subsequent interval can only differ from the previous one in three different ways. This implies the bound

N(F,k · k_∞, ε)≤ |N| ≤ d1/εe3^d2/εe−1≤e^c/ε

for some constantcand everyε >0. On the other hand, askf−¹₂k∞≤ ¹₂ for everyf ∈F, we clearly have N(F,k · k_∞, ε) = 1 forε≥ ¹₂. ut Having estimated the covering numbers ofF, we can now readily complete our bound on the convergence rate in the Wasserstein law of large numbers:

E[W1(µn, µ)]≤inf

ε>0

2ε+

r2c εn

.n^−1/3.

Recall that the rate of convergence in the law of large numbers for a single function is E|µnf −µf| .n^−1/2, but we have obtained a slower rate n^−1/3 when we consider the convergence uniformly over Lipschitz functions. Is this rate sharp? It turns out that this is not the case: in the present example, we will show in the next section that the optimal rate is actually still∼n^−1/2. Remark 5.18.There is no reason to expect, in general, the the rate of conver-gence uniformly over a class of functions will be the same as that for a single function. The fact that the rate still turns out to be n^−1/2 in the present setting is an artefact of the fact that we are working in one dimension: for random variables X_k ∈ [0,1]^p for p ≥ 2, the optimal rates turn out to be strictly slower thann^−1/2. Nonetheless, even in this case, the method we have used in this section does not capture the correct rate of convergence.

The method that we have used in this section to control the suprema or random processes is too crude to obtain sharp results in most examples of interest. While we obtained a sharp result in the random matrix example, this was not the case for the Wasserstein law of large numbers. Unfortunately, the situation encountered in the second example is the norm. It is illuminating to understand in what part of the proof we incurred the loss of precision: this will directly motivate the more powerful approach for bounding the suprema of random processes that will be developed in the next section.

The approach of Lemma 5.7 relies on two steps: the approximation of the

Im Dokument Probability in High Dimension (Seite 125-137)