The chaining method - Probability in High Dimension

Part I Concentration

5.3 The chaining method

In the previous section, we developed a simple method to bound the supremum of a random process that satisfies the Lipschitz property X_t−X_s .d(t, s) in analmost sure sense. However, we have seen that this requirement is very restrictive: in many cases, the typical size of the incrementsX_t−X_sis much smaller than in the worst case. We therefore aim to develop a method to bound the suprema of random processes that only requires the Lipschitz property Xt−Xs.d(t, s) to holdin probability in a suitable sense.

To understand how one might approach this problem, let us recall the basic idea behind the proof of Lemma 5.7. IfN is anε-net, we can estimate

The first term is a finite maximum that can be controlled by the maximal inequality of Lemma 5.1. The second term is a small remainder: each variable inside the supremum has magnitude of orderεby the Lipschitz property of the process. If the Lipschitz property holds in an almost sure sense, the supremum drops out and we can immediately control the remainder term.

However, if the Lipschitz property only holds in probability, we cannot directly control the remainder term. Indeed, in this case each variable inside the supremum has “typical”sizeε; however, we have to control the supremum of many such variables, whose magnitude can be much larger thanε(e.g., the maximum ofnindependentN(0, σ²) variables is of orderσ√

lognσ, even though each variable is only of orderσ). Therefore, in this case, the problem of controlling the remainder term is essentially of the same type as that of controlling the original supremum of interest. Nonetheless, we expect that the remainder term is smaller than the original supremum, as the size of each variable in the remainder term is now smaller. To shrink the remainder term further, we can approximate it once again by a finite maximum at a smaller scale. For example, ifN⁰ is anε/2-net, then we can estimate

E The first term on the right is a finite maximum that can be controlled by Lemma 5.1. The remainder term is still an infinite supremum, but now each variable inside the supremum is only of order ε/2: that is, we have cut the remainder term roughly by half. The key idea of this section is that we can repeat this procedure over and over again, each time cutting the size of the remainder term roughly by half. Let us investigate this idea a bit more sys-tematically. For eachk≥0, letNk be a 2^−k-net and chooseπk(t)∈Nk such thatd(t, π_k(t))≤2^−k. Repeating the approximationntimes, we obtain

132 5 Maxima, approximation, and chaining

The remainder term is now a supremum of variables of order 2⁻ⁿ. Under mild conditions, the remainder term will disappear if we letn→ ∞without having to invoke any almost sure Lipschitz property of the process. Thus we surmount the inefficiency of Lemma 5.7 by approximating the supremum not at a single scale, but at infinitely many scales. The remaining bound is now an infinite sum: thekth term in the sum is a finite maximum of random variables at the scale 2^−k. To control these finite maxima, we also do not require an almost sure Lipschitz property: in view of Lemma 5.1, it suffices to assume that the Lipschitz property holds “in probability” in the following sense.

Definition 5.20 (Subgaussian process). A random process {X_t}_t∈T on the metric space(T, d)is called subgaussian ifE[X_t] = 0and

E[e^λ{X^t^−X^s^}]≤e^λ²^d(t,s)²^/2 for allt, s∈T, λ≥0.

Remark 5.21.The subgaussian property should indeed be interpreted as an

“in probability” form of the Lipschitz property: by Problem 3.1, the subgaus-sian assumption is equivalent up to constants to an assumption of the form

P[|Xt−X_s| ≥x d(t, s)]≤Ce^−x²^/C.

Note also that the assumption E[e^λ{X^t^−X^s^}] ≤ e^λ²^d(t,s)²^/2 already implies E[X_t−X_s] = 0 (as lim_λ↓0{e^cλ²^/2−1}/λ= 0), so the assumptionE[X_t] = 0 merely imposes a convenient normalization. In section 5.4, we will see how to control the suprema of random processes with nontrivial meant7→E[X_t].

The technique that we have outlined above is known aschaining: the idea is to approximateXtby a “chain”Xπ_k(t)of increasingly accurate approxima-tions (the “links” in the chain are the incrementsXπ_k(t)−Xπ_k−1(t)). The main remaining difficulty in implementing the method is to show that the remain-der term does indeed vanish asn→ ∞. To get around this, we will impose a very mild technical assumption that holds in almost all cases of interest.

Definition 5.22 (Separable process).A random process{Xt}_t∈T is called separableif there is a countable set T0⊆T such that

Xt∈ lim

s∈Ts→t0

Xs for all t∈T a.s.

[Herex∈lim_s→txsmeans that there is a sequencesn →tsuch thatxs_n→x.]

5.3 The chaining method 133 Remark 5.23.The assumption of separability is technical, and is almost always trivially satisfied. For example, if t 7→X_t is continuous a.s., we can take T₀ to be any countable dense subset of T. At the same time, the separability assumption is in some sense intrinsic to the chaining argument. After all, the main idea of the chaining argument is to approximateX_t= lim_k→∞X_π_k_(t)for everyt∈T. If this is in fact valid, however, then the definition of a separable process will hold for the countable setT₀={π_k(t) :k≥0, t∈T}.

For completeness, let us note a somewhat esoteric point that we swept under the rug. IfT is uncountable, sup_t∈TXtis the supremum of an uncount-able family of random variuncount-ables. In general, the supremum of uncountably many measurable functions is not even necessarily measurable. Measurability issues do arise, on occasion, in the control of suprema, but we will shamelessly ignore such problems in these notes. Under the separability assumption, how-ever, sup_t∈TXt = sup_t∈T₀Xt a.s., and thus no measurability problems arise (as a countable supremum of measurable functions is always measurable).

We now have all the ingredients to implement the chaining argument.

Theorem 5.24 (Dudley). Let {Xt}_t∈T be a separable subgaussian process on the metric space(T, d). Then we have the following estimate:

Proof. We first prove the result in the finite case |T|< ∞, which allows us to easily eliminate the remainder term in the chaining argument. We subse-quently use the separability assumption to lift this restriction.

Let|T|<∞. Letk0be the largest integer such that 2^−k⁰≥diam(T). Then any singletonNk₀ ={t0}is trivially a 2^−k⁰-net. We therefore start chaining at the scale 2^−k⁰. Fork > k0, letNkbe a 2^−k-net such that|Nk|=N(T, d,2^−k).

Running the chaining argument up to the scale 2⁻ⁿ yields E

Let us consider each of the terms. AsE[Xt0] = 0 by assumption, the first term disappears. Moreover, as|T|<∞, we can choosen sufficiently large so that N_n=T. Then the last term disappears. To control the terms inside the sum, note that the maximum in thekth term contains at most|N_k||N_k−1| ≤ |N_k|² terms (as|N_k−1| ≤ |N_k|). Moreover, we can readily estimate

d(πk(t), π_k−1(t))≤d(t, πk(t)) +d(t, π_k−1(t))≤3×2^−k. AsXπ_k(t)−Xπ_k−1(t)isd(πk(t), πk−1(t))²-subgaussian, Lemma 5.1 yields

134 5 Maxima, approximation, and chaining

In the proof we have used the assumption|T|<∞to control the remainder term in the chaining argument. We now use separability to show that one can approximate the general case by the finite case. Indeed, by separability, there is a countable subsetT⁰ ⊆T such that sup_t∈TXt= sup_t∈T0Xt a.s. Denote byTk the firstkelements ofT⁰ (in arbitrary order). Then

by monotone convergence. Applying the chaining inequality to each finite maximum and usingN(Tk, d, ε)≤N(T, d, ε) yields the general result. ut Very often the result of Theorem 5.24 is written in a slightly different form by noting that the sum can be viewed as a Riemann sum approximation to a certain integral. There is no particular mathmatical significance to this reformulation: it is made for purely aesthetic reasons.

Corollary 5.25 (Entropy integral). Let {Xt}t∈T be a separable subgaus-sian process on the metric space(T, d). Then we have the following estimate:

Proof. We can readily estimate X it suffices to take integral in Corollary 5.25 only up toε= diam(T).

Remark 5.27.The logarithm of the covering number logN(T, d, ε) is often called metric entropy in analogy with information theory: it measures the number of bits needed to specify an element of T up to precision ε. It is customary to refer to the integral in Corollary 5.25 as theentropy integral.

5.3 The chaining method 135 To illustrate Corollary 5.25, let us revisit Example 5.15.

Example 5.28 (Wasserstein law of large numbers revisited). We adopt the same setting and notation as in Example 5.15. Recall that we want to estimate the expected Wasserstein distance between the empirical and true measures

W1(µn, µ) = sup

f∈FXf,

whereX₁, X₂, . . .are i.i.d. variables in [0,1] with distribution µand Xf =

k=1

f(Xk)−µf

n , F={f ∈Lip([0,1]) : 0≤f ≤1}.

By the Azuma-Hoeffding inequality (Corollary 3.9), we have E[e^λ{X^f^−X^g^}]≤e^λ²^kf−gk²^∞^/2n.

The process {Xf}f∈F is therefore subgaussian with respect to the metric d(f, g) =n^−1/2kf−gk∞. We can consequently estimate using Corollary 5.25

E[W₁(µ_n, µ)]≤12 Z ∞

plogN(F, n^−1/2k · k∞, ε)dε.

But it is easily seen that

N(F, n^−1/2k · k_∞, ε) =N(F,k · k_∞, n^1/2ε),

so that changing variables in the integral and using Lemma 5.16 yields E[W1(µn, µ)]≤ 12

√n Z ∞

plogN(F,k · k_∞, ε)dε≤ 12

√n Z ¹₂

rc εdε.

Asε^−1/2 is integrable at the origin, we have proved E[W1(µn, µ)].n^−1/2,

which is a huge improvement over the n^−1/3 rate obtained by the crude method used in Example 5.15. It is evident from the above computations that the crucial improvement is due to the fact that|Xf−Xg|.n^−1/2kf−gk_∞in probability (as is made precise by the subgaussian property), while the best almost sure Lipschitz bound one can hope for is|Xf−Xg|.kf−gk_∞.

In the present example, it is rather easy to obtain a matching lower bound on the Wasserstein distance. Indeed, note that for any functionf ∈Fthat is not constantµ-a.s., we obtain by the central limit theorem

E[W1(µn, µ)]≥E[Xf∨X_1−f] =E|Xf| ∼n^−1/2. Thus the rate we obtained by chaning is sharp in the present setting.

136 5 Maxima, approximation, and chaining

Now that we understand the chaining principle, we can use it to obtain more sophisticated results. For example, just as we could obtain a tail bound in Lemma 5.2 corresponding to the maximal inequality of Lemma 5.1, we can obtain a tail bound counterpart to Corollary 5.25.

Theorem 5.29 (Chaining tail inequality). Let {Xt}_t∈T be a separable subgaussian process on the metric space(T, d). Then for allt0∈T andx≥0 whereC <∞is a universal constant.

Proof. The beginning of the proof is identical to that of Theorem 5.24, and we adopt the notations used there. As in Theorem 5.24, it is easily seen that it suffices to consider|T|<∞, as we will assume in the remainder of the proof.

The idea here is to run the chaining argument without taking the expec-tation. As|T|<∞, we haveπ_n(t) =tfornsufficiently large. Thus

X_t−X_t₀ = X

k>k₀

{X_π_k_(t)−X_π_k−1_(t)}

by the telescoping property of the sum. This elementarychaining identity lies at the heart of the chaining argument. We immediately obtain

sup

Rather than bounding the expectation of this quantity, as we did in Theorem 5.24, we will bound the tail behavior of every term in this sum. To this end, note that the subgaussian property of{Xt}t∈T and Lemma 5.2 yield

We would like to show that all links atevery scale are small simultaneously, that is, that the probability of the union over allkof the events in the above bound is small. We can use a crude union bound to control the latter prob-ability, but it is clear that we must then choosez to be increasing in such a way that the probabilities of the individual events are summable: that is, P[Ω] :=P

5.3 The chaining method 137

and the proof is readily completed. ut

Remark 5.30.Note that the result of Theorem 5.29 is reminiscent of a concen-tration inequality. Indeed, if we could establish the concenconcen-tration inequality

P then the conclusion of Theorem 5.29 would follow directly by combining this inequality with the chaining bound of Corollary 5.25 for the expected supre-mum. Despite the similarities, however, Theorem 5.29 should not be confused with a concentration inequality. Its conclusion is both weaker and stronger:

weaker, because Theorem 5.29 cannot establish a deviation inequality from the mean, but only from a particular upper bound on the mean; stronger, because the subgaussian assumption of Theorem 5.29 is much weaker than would be required to establish a concentration inequality.

The proof of Theorem 5.29 suggests that at its core, the chaining method boils down to simultaneously controlling, using a union bound, the magnitude of all the linksX_π_k_(t)−X_π_k−1_(t)in the chaining identity. We might therefore ex-pect that chaining yields sharp results if the links{X_π_k_(t)−X_π_k−1_(t)}t∈T ,k>k₀

are “nearly independent” in some sense. This is not entirely implausible, as two links are either far apart or are at a different scale. It turns out that the

138 5 Maxima, approximation, and chaining

chaining method that we have developed here yields sharp results in many cases, but falls short in others. In the next chapter, we will see that the chain-ing method can be further improved to adapt to the structure of the set T. The resulting method, called thegeneric chaining, is so efficient that it cap-tures exactly (up to universal constants) the magnitude of the supremum of Gaussian processes! Once this has been understood, we can truly conclude that chaining is the “correct” way to think about the suprema of random pro-cesses. Nonetheless, considering that we have ultimately used no idea more sophisticated than the union bound, the remarkably far-reaching power of the chaining method remains somewhat of a miracle to this author.

Problems

5.9 (The entropy integral and sum).Show that Z ∞

Thus nothing is lost in expressing the chaining bound as an integral rather than a sum, as we have done in Corollary 5.25, up to a constant factor.

5.10 (Chaining with arbitrary tails). The chaining method is not re-stricted to subgaussian processes: it can be developed analogously for pro-cesses that are Lipschitz “in probability” in a more general sense.

Let{Xt}_t∈T be a separable process withE[Xt] = 0 and

5.11 (An improved chaining bound and Wasserstein LLN). The key improvement of the chaining bound of Corollary 5.25 over the crude approxi-mation of Lemma 5.7 is that the former uses only anin probability Lipschitz property, while the latter uses a strongeralmost sureLipschitz property. These two ideas are not mutually exclusive, however: when the process{Xt}t∈T sat-isfies both types of Lipschitz property, we can obtain an improved chaining bound that is a sort of hybrid between Corollary 5.25 and Lemma 5.7.

a. Prove the following theorem.

Theorem 5.31 (Improved chaining). Let {Xt}_t∈T be a separable pro-cess that is both subgaussian (Definition 5.20) and almost surely Lipschitz (Definition 5.4). Then we have the following estimate:

5.3 The chaining method 139 Hint: run the chaining argument only up to scale 2⁻ⁿ and use the almost sure Lipschitz property to estimate the remainder term.

To understand the advantage of Theorem 5.31, we first note the following.

b. Show thatN(T, d, ε) diverges asε↓0 whenever|T|=∞.

As the covering number diverges, a nontrivial application of Corollary 5.25 re-quires that this divergence is sufficiently slow thatp

logN(T, d, ε) is integrable at zero. This is not always the case. On the other hand, Lemma 5.7 would give a nontrivial bound even when the covering number is not integrable, but the use of the almost sure Lipschitz property yields a very pessimistic bound.

Theorem 5.31 provides the best of both worlds: it uses the “in probability”

Lipschitz property as much as possible, while using the almost sure Lipschitz property to cut off the divergent part of the integral.

To illustrate the efficiency of Theorem 5.31, let us revisit once more the Wasserstein law of large numbers. We have resolved completely the rate of convergence in one dimension in Example 5.28. However, in higher dimensions, we have so far only obtained pessimistic rates in Problem 5.8.

c. Show that we cannot obtain any nontrivial bound for the Wasserstein law of large numbers in dimensionsd≥2 from Corollary 5.25.

d. Using Theorem 5.31, show that in the setting of Problem 5.8

E[W₁(µ_n, µ)].







n^−1/2 ford= 1, n^−1/2logn ford= 2, n^−1/d ford≥3.

Unlike in the one-dimensional case, a lower bound (and hence the sharpness of the above estimates for the rates) is not immediately obvious in dimensions d≥2. We must work a little bit harder to obtain some insight.

e. Suppose thatµ(dx) =ρ(x)dxwithkρk_∞<∞. Show that E

i=1,...,nmin kx−Xik

&n^−1/d for allx∈[0,1]^d. Hint: useP[mini≤nkx−Xik ≥t] =P[kx−X1k ≥t]ⁿ and integrate.

f. Conclude that whenµhas a bounded density, we have in any dimensiond E[W1(µn, µ)]&n^−1/d.

Hint: consider the (random) functionf(x) =−min_i≤nkx−Xik.

Taking together all the upper and lower bounds that we have proved for the Wasserstein law of large numbers, we have evidently obtained sharp rates

∼n^−1/2 in dimension d= 1 and ∼n^−1/d in dimensiond≥3. The only case

140 5 Maxima, approximation, and chaining

still in question is dimensiond= 2, where there remains a gap between our lower and upper bounds n^−1/2 . E[W₁(µ_n, µ)] . n^−1/2logn. It turns out that neither bound is sharp in this case: the correct rate is∼n^−1/2(logn)^1/2. It has been shown by Talagrand that this rather deep result, due to Ajtai, Koml´os, and Tusn´ady, can be derived (in a nontrivial manner) using the more sophisticated generic chaining method that will be developed in Chapter 6.

Im Dokument Probability in High Dimension (Seite 137-146)