Finite maxima - Probability in High Dimension

Part I Concentration

5.1 Finite maxima

Maxima, approximation, and chaining

We have shown in the previous chapters that in many cases a function f(X1, . . . , Xn) of i.i.d. random variables is close to its meanE[f(X1, . . . , Xn)].

The concentration phenomenon says nothing, however, about the magnitude of the meanE[f(X₁, . . . , X_n)] itself. One cannot hope to address such ques-tions at the same level as generality as we investigated concentration: some additional structure is needed in order to develop any meaningful theory.

The type of structure that will be investigated in the sequel are suprema F= sup

t∈T

Xt,

where{Xt}t∈T is a random process that is defined on some index setT. Such problems arise in numerous high-dimensional applications, such as random matrix theory and probability in Banach spaces, control of empirical processes in statistics and machine learning, random optimization problems, etc. It is typically the case that the distribution of individualX_tis well understood, so that the main difficulty lies in understanding the effect of the supremum. To this end, we formulated in Chapter 1 the following informal principle:

If{Xt}t∈T is “sufficiently continuous,” the magnitude ofsup_t∈TX_tis controlled by the “complexity” of the index setT.

In the sequel, we proceed to make this informal idea precise.

5.1 Finite maxima

Before we can develop a general theory to control suprema of random pro-cesses, we must understand the simplest possible situation: the maximum of a finite number of random variables, that is, the case where the index setT has finite cardinality |T| <∞. In fact, this special case will form the most basic ingredient of our theory. To develop a more general theory, the funda-mental idea in the sequel will be to approximate the supremum over a general

114 5 Maxima, approximation, and chaining

index set by the maximum over a finite set in increasingly sophisticated ways.

By appropriately combining these two basic ingredients—finite maxima and approximation—we will develop powerful tools that yield remarkably sharp control over the suprema of many random processes.

How can one bound the maximum of a finite number of random variables?

The most naive approach imaginable is to bound the supremum by a sum:

sup

t∈T

Xt≤X

t∈T

|Xt|.

Plugging this trivial fact into an expectation, we obtain E

Thus if we can control the magnitude of every random variableXtindividually, then we obtain a bound that grows linearly in the cardinality|T|.

Of course, bounding a maximum by a sum is an exceedingly crude idea, and it seems unlikely a priori that one could draw any remotely accurate conclusions from such a procedure. Nonetheless, this simple idea is not a bad as it may appear on first sight if we use it a bit more carefully. Suppose, for example, that the random variablesX_thave boundedpth moment. Then

where we have bounded the maximum by a sum after applying Jensen’s in-equality. This has significantly improved the dependence on the cardinality from|T|to|T|^1/p. Evidently our control of the maximum of random variables is closely related to the tail behavior of these random variables: the thinner the tails (i.e., the larger p), the better we can control their maximum. Once this idea has been understood, however, there is no need to stop at moments:

if the random variables X_t possess a finite moment generating function, we can apply an exponential transformation precisely as in the development of Chernoff bounds in section 3.1 to estimate the maximum.

Lemma 5.1 (Maximal inequality). Suppose that logE[e^λX^t] ≤ ψ(λ) for allλ≥0 andt∈T, whereψ is convex andψ(0) =ψ⁰(0) = 0. Then

5.1 Finite maxima 115

Proof. By Jensen’s inequality, we have for anyλ >0 E

Asλ >0 is arbitrary, we can now optimize overλon the right hand side. In the special case thatXtisσ²-subgaussian (so thatψ(λ) =λ²σ²/2), we obtain

In the general case, the only difficulty is to evaluate the infimum in E definition ofψ^∗, and that the inequality is attained if we choose λto be the optimizer in the definition ofψ^∗. Settingψ^∗(z) = log|T|yields the conclusion.

It remains to show that that ψ^∗ is invertible. As ψ^∗ is the supremum of linear functions,x7→ψ^∗(x) is convex and strictly increasing except at those values x where the maximum in the definition of ψ^∗ is attained at λ = 0, that is, whenλx−ψ(λ)≤ −ψ(0) for allλ≥0. By the first-order condition for convexity, the latter occurs if and only if x ≤ ψ⁰(0) = 0. Moreover, as ψ^∗(0) = 0, we conclude that x 7→ ψ^∗(x) is convex, strictly increasing, and nonnegative forx≥0. Thus the inverseψ^∗−1(x) is well defined forx≥0. ut Lemma 5.1 should be viewed as an analogue of the Chernoff bound of Lemma 3.1 in the setting of maxima of random variables. Recall that the Chernoff bound states that if logE[e^λX^t]≤ψ(λ) for allλ≥0 andt∈T, then

P[X_t≥x]≤e^−ψ^∗^(x) for allx≥0, t∈T.

Thus our bound on the magnitude of the maximum depends on |T| as the inverse of the tail probability of the individual random variables (as the inverse of the function e^ψ^∗^(x) is ψ^∗−1(logx)). This is not a coincidence. In fact, we can use the Chernoff bound directly to estimate the tail probabilities of the maximum (rather than the expectation as in Lemma 5.1) as follows.

Lemma 5.2 (Maximal tail inequality). Suppose that logE[e^λX^t]≤ψ(λ) for allλ≥0andt∈T, whereψ is convex andψ(0) =ψ⁰(0) = 0. Then

116 5 Maxima, approximation, and chaining

Proof. We readily estimate using the Chernoff bound P

sup

t∈T

Xt≥x

[

t∈T

{Xt≥x}

≤X

t∈T

P[Xt≥x]≤e^log^|T^|−ψ^∗^(x). Writingu=ψ^∗(x)−log|T| yields the first inequality (the invertibility ofψ^∗ was shown in the proof of Lemma 5.1). In the subgaussian case,

ψ^∗−1(logT+u) =p

2σ²(log|T|+u)≤p

2σ²log|T|+

√ 2σ²u

yields the second inequality. ut

The argument used in the proof of Lemma 5.2 is called a union bound: we have estimated the probability of a union of events by the sum of the probabilities P[A∪B] ≤ P[A] +P[B]. This crude estimate plays exactly the same role in the proof of Lemma 5.2 as does bounding the maximum of random variables by their sum in the proof of Lemma 5.1.

Remark 5.3.While this may not be evident at the outset, the proofs of Lem-mas 5.1 and 5.2 are based on precisely the same idea. Indeed, the union bound is merely another example of bounding a maximum by a sum:

P[A1∪ · · · ∪An] =E[max{1A₁, . . . ,1A_n}]≤E[1A₁] +· · ·+E[1A_n].

Lemmas 5.1 and 5.2 are therefore ultimately implementing the same bound in a slightly different way. In fact, is not difficult to deduce a form of Lemma 5.1 with a slightly worse constant directly from Lemma 5.2 by integrating the tail bound, that is, usingE[Z] =R∞

0 P[Z≥z]dz forZ≥0.

We have obtained above some simple bounds on the maximum of a finite number of random variables. How good are these bounds? There are several reasons to be suspicious. On the one hand, we have obtained our estimates in an exceedingly crude fashion by bounding a maximum by a sum. On the other hand, while we made assumptions about the tail behavior of the individual variablesX_t, we made no assumptions of any kind about the joint distribu-tion of {X_t}_t∈T. One would expect that dependencies between the random variables X_t to make a significant difference to their maximum. As an ex-treme example, suppose{X_t}_t∈T are completely dependent in the sense that Xt=Xsfor allt, s∈T. ThenE[sup_tXt] =E[Xs] does not depend on|T| at all, whereas the bound in Lemma 5.1 necessarily grows with|T|. Of course, there is no contradiction: Lemma 5.1 is correct, but is evidently far from sharp in the presence of strong dependence between the random variablesXt.

Remarkably, however, Lemmas 5.1 and 5.2 prove to be essentially sharp when the random variables{Xt}_t∈T areindependent. It is perhaps surprising that a method as crude as bounding a maximum by a sum would lead to a sharp result in any nontrivial situation. However, it turns out that this idea is not as bad as may be expected on first sight in the presence of independence.

5.1 Finite maxima 117 For example, consider the union boundP[A∪B]≤P[A]+P[B]. Equality holds whenA andB are disjoint, but this is certainly not the case in the proof of Lemma 5.2. Nonetheless, whenAandBare independent, the probability that they occur simultaneously is much smaller than the individual probabilities, so that we still haveP[A∪B]&P[A] +P[B]. This idea will be exploited in Problem 5.1 below to show that Lemmas 5.1 and 5.2 are essentially sharp in the independent case. When viewed in terms of a sum of random variables, we see that in this setting the sum is dominated by its largest term, so that approximating the maximum by a sum is not such a bad idea after all.

Problems

5.1 (Maxima of independent random variables).The proofs of the max-imal inequalities in the present section rely on a very crude device: bounding the maximum of random variable by a sum. Nonetheless, when the random variables are independent, the bounds we obtain above are often sharp. To understand why, we must prove lower bounds of the same order.

It is easiest to consider first the setting of Lemma 5.2. Let us begin by proving matching upper and lower union bounds for independent events.

a. Show that ifA1, . . . , An are independent events, then

b. Letη^∗be a strictly increasing convex function. Suppose that P[Xt≥x]≥e^−η^∗^(x) for allx≥0, t∈T, and compare with the corresponding upper bound in Lemma 5.2.

Now that we have obtained a lower bound on the tail probability of the max-imum (corresponding to the upper bound of Lemma 5.2), we can obtain a lower bound on the expectation of the maximum (corresponding to the upper bound of Lemma 5.1) by integrating the tail bound.

c. Deduce from the previous part that forx≥0 P

118 5 Maxima, approximation, and chaining d. Conclude that if

e^−η^∗^(x)≤P[Xt≥x]≤e^−ψ^∗^(x) for allx≥0, t∈T, then we have

η^∗−1(log|T|) + sup

t∈T

E[0∧X_t].E

sup

t∈T

X_t

.ψ^∗−1(log|T|).

Hint: useE[0∨Z] =R∞

0 P[Z≥x]dx.

The upper and lower bound in the previous part are generally of the same order, provided that we start with upper and lower bounds onP[X_t≥x] of the same order. For example, let us consider the case of Gaussian variables.

e. ForX∼N(0,1), show that

P[X≥x]≥ e^−x² 2√

2 for allx≥0.

Hint: write the probability as an integral and use (v+x)²≤2v²+ 2x². f. Let X1, . . . , Xn be i.i.d. Gaussian random variables with zero mean and

unit variance. Show that the above bound implies 1−e⁻¹

p2 logn2^−3/4− 1

√2π ≤E

maxi≤n X_i

≤√ 2 logn.

In particular,c√

logn≤E[max_i≤nXi]≤C√

lognfornsufficiently large.

g. IfX₁, X₂, . . .are i.i.d. Gaussian, prove the asymptotic maxi≤nXi

√2 logn

−−−−→n→∞ 1 in probability.

Hint: for the upper bound, see Problem 3.5. For the lower bound, proceed analogously using a suitable improvement on the Gaussian tail lower bound obtained above (use (v+x)²≤(1 +ε⁻¹)v²+ (1 +ε)x²).

5.2 (Approximating a maximum by a sum). Show that forλ >0 maxt∈T X_t≤ 1

λlogX

t∈T

e^λX^t ≤max

t∈T X_t+log|T| λ .

Thus whenλis large, the sum is increasingly dominated by its largest term.

This simple observation is often useful in problems where a smooth approxi-mation of the maximum functionx7→maxixi is needed.

5.3 (Johnson-Lindenstrauss lemma). The following functional analysis result has found many applications in computer science and signal processing.

5.2 Covering, packing, and approximation 119

Im Dokument Probability in High Dimension (Seite 119-125)