The martingale method - Probability in High Dimension

Part I Concentration

3.2 The martingale method

max

i≤n{Xi−EX_i} ≥(1 +ε)σp 2 logn

n→∞

−−−−→0 for allε >0.

Hint: use the union bound

P[X∨Y ≥t] =P[X ≥torY ≥t]≤P[X≥t] +P[Y ≥t].

This problem shows that the maximum max_i≤n{Xi−EXi}ofσ²-subgaussian random variables is at most of orderσ√

2 logn. This is the simplest example of the crucial role played by tail bounds in estimating the size of maxima of random variables. The second part of this course will be entirely devoted to the investigation of such problems (using much deeper ideas).

3.2 The martingale method

LetX₁, . . . , X_n be independent random variables. In the previous chapter, we showed that the variance off(X₁, . . . , X_n) can be bounded in many cases by a “square gradient” of the function f. The aim of this chapter is to obtain a much stronger type of result: we would like to show thatf(X₁, . . . , X_n) is subgaussian with variance proxy controlled by a “square gradient” off.

A key idea developed in the previous chapter was to use tensorization to reduce the problem to the one-dimensional case. With the tensorization inequality in hand, we could even apply a trivial bound such as Lemma 2.1 to obtain a nontrivial variance inequality in terms of bounded differences.

Our first instinct in the present setting is therefore to prove a tensorization inequality for the subgaussian property, which could then be combined with Hoeffding’s Lemma 3.6 (which plays the analogous role in the present setting to the trivial Lemma 2.1 for the variance) in order to obtain a concentration inequality in terms of bounded differences. Unfortunately, it turns out that unlike in the case of the variance, the subgaussian propertydoes not tensorize in a natural manner, and thus we cannot directly implement this program. One of the most important ideas that will be developed in the following sections is that the proof of subgaussian inequalities can be reduced to a strengthened form of Poincar´e inequalities, calledlog-Sobolev inequalities, thatdotensorize exactly in the same manner as the variance. This will provide us with a very powerful tool to prove subgaussian concentration.

There is, however, a more elementary approach that should be attempted before we begin introducing new ideas. Even though the subgaussian property does not tensorize in the same manner as the variance, we can still repeat some of the steps in the proof of the tensorization Theorem 2.3 in the subgaussian setting. Recall that the main idea of the proof of Theorem 2.3 is to write

3.2 The martingale method 51

f(X₁, . . . , X_n)−Ef(X₁, . . . , X_n) =

k=1

∆_k, where

∆k =E[f(X1, . . . , Xn)|X1, . . . , Xk]−E[f(X1, . . . , Xn)|X1, . . . , X_k−1] are martingale differences. The following simple result, which exploits the nice behavior of the exponential of a sum, could be viewed as a sort of poor man’s tensorization property for sums of martingale increments. By working directly with the martingale increments, we will be able to derive a first concentration inequality. This approach is commonly known as themartingale method.

Lemma 3.7 (Azuma).Let {Fk}k≤n be any filtration, and let∆₁, . . . , ∆_n be random variables that satisfy the following properties fork= 1, . . . , n:

1. Martingale difference property:∆k isFk-measurable andE[∆k|Fk−1] = 0.

2. Conditional subgaussian property:E[e^λ∆^k|Fk−1]≤e^λ²^σ²^k^/2 a.s.

Then the sumPn

k=1∆k is subgaussian with variance proxy Pn k=1σ²_k. Proof. For any 1≤k≤n, we can compute

E[e^λ^P^kⁱ⁼¹^∆ⁱ] =E[e^λ^P^k−1ⁱ⁼¹^∆ⁱE[e^λ∆^k|Fk−1]]≤e^λ²^σ²^k^/2E[e^λ^P^k−1ⁱ⁼¹^∆ⁱ].

It follows by induction thatE[e^λ^Pⁿⁱ⁼¹^∆ⁱ]≤e^λ²^Pⁿⁱ⁼¹^σ²ⁱ^/2. ut Remark 3.8.While we did not explicitly use the martingale difference property in the proof,E[e^λ∆^k|Fk−1]≤e^λ²^σ^k²^/2 can in fact only hold ifE[∆k|Fk−1] = 0 (consider (E[e^λ∆^k|Fk−1]−1)/λasλ↓0). In general, the conditional subgaus-sian property ofX givenFshould readE[e^λ(X−E[X|^F^])|F]≤e^λ²^σ²^/2 a.s.

In combination with Hoeffding’s Lemma 3.6, we now obtain a classical result on the tail behavior of sums of martingale differences.

Corollary 3.9 (Azuma-Hoeffding inequality).Let{Fk}_k≤n be any filtra-tion, and let∆k, Ak, Bk satisfy the following properties fork= 1, . . . , n:

1. Martingale difference property:∆k isFk-measurable andE[∆k|Fk−1] = 0.

2. Predictable bounds:A_k, B_k areFk−1-measurable and A_k ≤∆_k≤B_k a.s.

ThenPn

k=1∆k is subgaussian with variance proxy ¹₄Pn

k=1kBk −Akk²_∞. In particular, we obtain for everyt≥0 the tail bound

" _n X

k=1

∆k≥t

≤exp

− 2t² Pn

k=1kBk−A_kk²_∞

Proof. Applying Hoeffding’s Lemma 3.6 to∆k conditionally onFk−1 implies E[e^λ∆^k|Fk−1]≤e^λ²^(B^k^−A^k⁾²^/8. The result now follows from Lemma 3.7. ut

52 3 Subgaussian concentration and log-Sobolev inequalities

Example 3.10.The Azuma-Hoeffding inequality is often applied in the fol-lowing setting. Let X₁, . . . , X_n be independent random variables such that a≤X_i≤bfor alli. Applying Corollary 3.9 with∆_k = (X_k−EX_k)/n yields

1 n

k=1

{X_i−EX_i} ≥t

≤e^−2nt²^/(b−a)².

By the central limit theorem, this bound is of the correct order both in terms of the size of the sum and its Gaussian tail behavior. However, just as for the case of the variance (see the discussion in section 2.1), this bound can be pessimistic in that it does not capture any information on the distribution of the variablesXi: in particular, the variance proxy (b−a)²/4 may be much larger than the actual variance of the random variablesXi. Much of the effort in developing concentration inequalities is to obtain bounds in terms of “good”

variance proxies for the purposes of the application at hand.

We motivated the development of tail bounds for martingale differences as a partial replacement of the tensorization inequality for the variance. Let us therefore return to the case of functionsf(X1, . . . , Xn) of independent ran-dom variablesX1, . . . , Xn. Using the Azuma-Hoeffding inequality, we readily obtain our first and simplest subgaussian concentration inequality. Recall that

Dif(x) :=

sup

f(x1, . . . , x_i−1, z, xi+1, . . . , xn)−inf

z f(x1, . . . , x_i−1, z, xi+1, . . . , xn) are the discrete derivatives defined in section 2.1.

Theorem 3.11 (McDiarmid).ForX₁, . . . , X_n independent,f(X₁, . . . , X_n) is subgaussian with variance proxy ¹₄Pn

k=1kDkfk²_∞. In particular, P[f(X1, . . . , Xn)−Ef(X1, . . . , Xn)≥t]≤e^−2t²^/^Pⁿ^k=1^kD^k^fk²^∞. Proof. As in the proof of the tensorization Theorem 2.3, we write

f(X1, . . . , Xn)−Ef(X1, . . . , Xn) =

k=1

∆k, where

∆k =E[f(X1, . . . , Xn)|X1, . . . , Xk]−E[f(X1, . . . , Xn)|X1, . . . , Xk−1] are martingale differences. Note thatAk ≤∆k≤Bk with

A_k =E[inf

z f(X₁, . . . , X_k−1, z, X_k+1, . . . , X_n)−f(X₁, . . . , X_n)|X₁, . . . , X_k−1], Bk =E[sup

f(X1, . . . , X_k−1, z, Xk+1, . . . , Xn)−f(X1, . . . , Xn)|X1, . . . , X_k−1] where we have used the independence ofXk andX1, . . . , Xk−1, Xk+1, . . . , Xn. The result now follows immediately from the Azuma-Hoeffding inequality of Corollary 3.9 once we note that|Bk−Ak| ≤ kDkfk∞. ut

3.2 The martingale method 53 McDiarmid’s inequality should be viewed as a subgaussian form of the bounded difference inequality of Corollary 2.4. In Corollary 2.4, the variance is controlled by the expectation of the “square gradient” of the functionf. In contrast, McDiarmid’s inequality yields the stronger subgaussian property, but here the variance proxy is controlled by a uniform upper bound on the “square gradient” rather than its expectation. Of course, it makes sense that a stronger property requires a stronger assumption. We will repeatedly encounter this idea in the setting of concentration inequalities: typically the expectation of the “square gradient” controls the variance, while a uniform bound on the

“square gradient” controls the subgaussian variance proxy.

However, from this viewpoint, the result of Theorem 3.11 is not satisfac-tory: as the appropriate notion of “square gradient” in the bounded differ-ence inequality is Pn

k=1|Dkf|², we would expect a variance proxy of order kPn

k=1|Dkf|²k∞; however, Theorem 3.11 only yields control in terms of the larger quantity Pn

k=1kDkfk²_∞. The former would constitute a crucial im-provement over the latter in many situations (for example, in the setting of the random matrix Example 2.5). Unfortunately, the martingale method is far too crude to capture this idea. In the sequel, we will develop new techniques for proving subgaussian concentration inequalities that will make it possible to prove much more refined bounds in many settings.

Problems

3.6 (Bin packing).For the bin packing Problem 2.3, show that the variance bound Var[Bn]≤n/4 can be strengthened to a Gaussian tail bound

P[|Bn−EB_n| ≥t]≤2e^−2t²^/n. In view of Problem 2.3, this bound has the correct order.

3.7 (Rademacher processes). Let ε1, . . . , εn be independent symmetric Bernoulli random variablesP[εi=±1] = ¹₂, and letT ⊆Rⁿ. Define

Z= sup

t∈T n

k=1

ε_kt_k. In Problem 2.2, we showed that

Var[Z]≤4 sup

t∈T n

k=1

t²_k.

Show that McDiarmid’s inequality can give, at best, a bound of the form P[|Z−EZ| ≥t]≤2e^−t²^/2σ² with σ²=

k=1

sup

t∈T

t²_k.

Show by means of an example that the variance proxy in McDiarmid’s in-equality can exhibit a vastly incorrect scaling as a function of dimensionn.

54 3 Subgaussian concentration and log-Sobolev inequalities

3.8 (Empirical frequencies). Let X1, . . . , Xn be i.i.d. random variables with any distribution µ on a measurable space E, and let C be a countable class of measurable subsets ofE. By the law of large numbers,

#{k∈ {1, . . . , n}:Xk ∈C}

n ≈µ(C)

whenn is large. In order to analyze empirical risk minimization methods in machine learning, it is important to control the deviation between the true probability µ(C) and its empirical average uniformly over the class C. In particular, one would like to guarantee that the uniform deviation

Z_n= sup

does not exceeed a certain level with high probability. As a starting point towards proving such a result, show that for everyn≥1 andt≥0

P[Zn≥EZn+t]≤e^−2nt².

To obtain a bound onP[Zn ≥ t], it therefore remains to control EZn (the techniques for this will be developed in the second part of the course).

3.9 (Sums in Hilbert space).LetX1, . . . , Xnbe independent random vari-ables with zero mean in a Hilbert space, and suppose thatkXkk ≤C a.s. for everyk. Let us prove a sort of Hilbert-valued analogue of Example 3.10.

a. Show that for allt≥0

3.3 The entropy method 55

Im Dokument Probability in High Dimension (Seite 56-61)