Main results - Simultaneous likelihood-based bootstrap confidence sets for a large number of mo

The following theorem shows the closeness of the joint cumulative distribution functions (c.d.f-s.) of

. The approximating error term ∆total equals to a sum of the errors from all the steps in the scheme (3.1).

Theorem 3.1. Under the conditions of Section 5 it holds with probability ≥1−12e^−x for z_k≥C√

The approximating total error ∆_total≥0 is deterministic and in the case of i.i.d. obser-vations (see Section 5.3) it holds:

∆_total ≤ C

Remark 3.1. The obtained approximation bound is mainly of theoretical interest, al-though it shows the impact of pmax, K and n on the quality of the bootstrap procedure.

For more details on the error term see RemarkA.1.

The next theorem justifies the bootstrap procedure under the (SmB)d condition. The theorem says that the bootstrap quantile functions z_k^ab(·) with the bootstrap-corrected for multiplicity confidence levels 1−c^ab(α) can be used for construction of the simultaneous confidence set in the Y -world.

Theorem 3.2 (Bootstrap validity for a small modeling bias). Assume the conditions of Theorem 3.1, and c(α),0.5c^ab(α) ≥ ∆_full,max, then for α ≤ 1−8e^−x it holds with probability 1−12e^−x

IP where ∆_full,max ≤ C{(p_max+x)³/n}^1/8 in the case of i.i.d. observations (see Section 5.3), and ∆_z,total ≤ 3∆_total; their explicit definitions are given in (C.11) and (C.14).

Moreover

c^ab(α) ≤ c(α+∆c) +∆full,max, c^ab(α) ≥ c(α−∆_c)−∆_full,max, for 0≤∆c≤2∆total, defined in (C.15).

The following theorem does not assume the (SmB)d condition to be fulfilled. It turns out that in this case the bootstrap procedure becomes conservative, and the bootstrap critical values corrected for the multiplicity z_k^ab(c^ab(α)) are increased with the modelling bias

tr{D_k⁻¹H_k²D_k⁻¹} − q

tr{D⁻¹_k (H_k²−B_k²)D⁻¹_k }, therefore, the confidence set based on the bootstrap estimates can be conservative.

Theorem 3.3 (Bootstrap conservativeness for a large modeling bias). Under the con-ditions of Section 5 except for (SmB)d it holds with probability ≥1−14e^−x for zk ≥ C√

p_k, 1≤C <2 IP

k=1

2Lk(θek)−2Lk(θ^∗_k)> zk

≤IP^ab

[K k=1

2L_k^ab(eθ_k^ab)−2L_k^ab(θek)> zk

+∆b,total.

The deterministic value ∆_b,_total ∈[0, ∆_total] (see (3.6) in the case 5.3). Moreover, the bootstrap-corrected for multiplicity confidence level 1−c^ab(α) is conservative in compar-ison with the true corrected confidence level:

1−c^ab(α) ≥ 1−c(α+∆_b,c)−∆_full,_max, and it holds for all k= 1, . . . , K and α≤1−8e^−x

z_k^ab(c^ab(α))≥ z_k(c(α+∆_b,c) +∆_full,max) +

tr{D⁻¹_k H_k²D⁻¹_k } − q

tr{D_k⁻¹(H_k²−B²_k)D_k⁻¹} −∆qf,1,k,

for 0≤∆_b,c≤2∆_total, defined in(C.18), and the positive value ∆_qf,1,k is bounded from above with (a²_k+a²_B,k)(√

8xp_k+ 6x) for the constants a²_k >0,a²_B,k ≥0 from conditions (I), (I_B).

The (SmB)d condition is automatically fulfilled if all the parametric models are correct or in the case of i.i.d. observations. This condition is checked for generalised linear model and linear quantile regression in Spokoiny and Zhilova(2014) (the version of 2015).

4 Numerical experiments

Here we check the performance of the bootstrap procedure by constructing simultaneous confidence sets based on the local constant and local quadratic estimates, the former one is also known as Nadaraya-Watson estimate Nadaraya (1964); Watson (1964). Let Y₁, . . . , Y_n be independent random scalar observations and X₁, . . . , X_n some determin-istic design points. In Sections 4.1-4.3 below we introduce the models and the data, Sections4.4-4.6present the results of the experiments.

4.1 Local constant regression

Consider the following quadratic likelihood function reweighted with the kernel functions K(·) :

L(θ, x, h)^def= −1 2

i=1(Yi−θ)²wi(x, h), w_i(x, h)^def= K({x−X_i}/h), K(x)∈[0,1],

K(x)dx= 1, K(x) =K(−x).

Here h >0 denotes bandwidth, the local smoothing parameter. The target point and the local MLE read as:

θ^∗(x, h)^def= Pn

i=1w_i(x, h)IEY_i Pn

i=1wi(x, h) , eθ(x, h)^def= Pn

i=1w_i(x, h)Y_i Pn

i=1wi(x, h) .

Let us fix a bandwidth h and consider the range of points x1, . . . , xK. They yield K local constant models with the target parameters θ^∗_k ^def= θ^∗(x_k, h) and the likelihood functions L_k(θ)^def= L(θ, x_k, h) for k= 1, . . . , K.

The bootstrap local likelihood function is defined similarly to the global one (2.2), by reweighting L(θ, x, h) with the bootstrap multipliers u1, . . . , un:

L_k^ab(θ)^def= L^ab(θ, x_k, h)^def= −1 2

i=1(Y_i−θ)²w_i(x_k, h)u_i, θe_k^ab^def= eθ^ab(x_k, h)^def=

i=1wi(x_k, h)uiYi

i=1w_i(x_k, h)u_i . 4.2 Local quadratic regression

Here the local likelihood function reads as L(θ, x, h)^def= −1

2 Xn

i=1(Y_i−Ψ_i^>θ)²w_i(x, h), θ, Ψi ∈IR³, Ψi def

= 1, Xi, X_i²>

and

θ^∗(x, h) ^def=

Ψ W(x, h)Ψ^>

−1

Ψ W(x, h)IEY, θ(x, h)e ^def=

Ψ W(x, h)Ψ^>−1

Ψ W(x, h)Y, where

Y ^def= (Y1, . . . , Yn)^>, Ψ ^def= (Ψ1, . . . , Ψn)∈IR^3×n, W(x, h)^def= diag{w₁(x, h), . . . , w_n(x, h)}. And similarly for the bootstrap objects

L^ab(θ, x, h) ^def= −1 2

i=1(Y_i−Ψ_i^>θ)²w_i(x, h)u_i, θe^ab(x, h) ^def=

Ψ U W(x, h)Ψ^>−1

Ψ U W(x, h)Y, for U ^def= diag{u₁, . . . , u_n}.

4.3 Simulated data

In the numerical experiments we constructed two 90% simultaneous confidence bands:

using Monte Carlo (MC) samples and bootstrap procedure with Gaussian weights (u_i ∼ N(1,1) ), in each case we used 10⁴ {Y_i} and 10⁴ {u_i} independent samples. The sample size n= 400 . K(x) is Epanechnikov’s kernel function. The independent random observations Yi are generated as follows:

Yi=f(Xi) +N(0,1), Xi are equidistant on [0,1], (4.1)

f(x) =











5, x∈[0,0.25]∪[0.65,1];

5 + 3.8{1−100(x−0.35)²}, x∈[0.25,0.45];

5−3.8{1−100(x−0.55)²}, x∈[0.45,0.65].

(4.2)

The number of local models K = 71 , the points x1, . . . , x71 are equidistant on [0,1] . For the bandwidth we considered two cases: h= 0.12 and h= 0.3 .

4.4 Effect of the modeling bias on a width of a bootstrap confidence band

The function f(x) defined in (4.2) should yield a considerable modeling bias for both mean constant and mean quadratic estimators. Figures 4.1, 4.2 demonstrate that the bootstrap confidence bands become conservative (i.e. wider than the MC confidence

band) when the local model is misspecified. The top graphs on Figures 4.1, 4.2 show the 90% confidence bands, the middle graphs show their width, and the bottom graphs show the value of the modelling bias for K = 71 local models (see formulas (4.3) and (4.4) below). For the local constant estimate (Figure 4.1) the width of the bootstrap confidence sets is considerably increased by the modeling bias when x ∈ [0.25,0.65] . In this case case the expression for the modeling bias term for the k-th model (see also(SmB)d condition) reads as:

H_k⁻¹B_k²H_k⁻¹ =

i=1{IEY_i−θ^∗(x_k)}²w²_i(x_k, h) Pn

i=1IE{Y_i−θ^∗(x_k)}²w²_i(x_k, h)

= 1− 1 + Pn

i=1w_i²(x_k, h){f(X_i)−θ^∗(x_k)}² Pn

i=1w_i²(x_k, h)

!−1

(4.3)

And for the local quadratic estimate it holds:

H_k⁻¹B_k²H_k⁻¹ =

Ip−H_k⁻¹nXn

i=1ΨiΨ_i^>w_i²(x_k, h)o H_k⁻¹

, (4.4)

where Ip is the identity matrix of dimension p×p (here p= 3 ), and H_k² =Xn

i=1Ψ_iΨ_i^>w²_i(x_k, h)IE{Y_i−θ^∗(x_k)}²

=Xn

i=1Ψ_iΨ_i^>w²_i(x_k, h){f(X_i)−θ^∗(x_k)}²+Xn

i=1Ψ_iΨ_i^>w²_i(x_k, h).

(4.5)

Therefore, if max1≤k≤K{f(X_i)−θ^∗(x_k)}² = 0 , then

H_k⁻¹B_k²H_k⁻¹

= 0 . On the Figure 4.1both the modelling bias and the difference between the widths of the bootstrap and MC confidence bands are close to zero in the regions where the true function f(x) is constant. On Figure 4.2 the modelling bias for h = 0.12 is overall smaller than the corresponding value on Figure 4.1. For the bigger bandwidth h = 0.3 the modelling biases on Figures4.1and 4.2are comparable with each other.

Thus the numerical experiment is consistent with the theoretical results from Sec-tion 3.2, and confirm that in the case when a (local) parametric model is close to the true distribution the simultaneous bootstrap confidence set is valid. Otherwise the boot-strap procedure is conservative: the modelling bias widens the simultaneous bootboot-strap confidence set.

4.5 Effective coverage probability (local constant estimate)

In this part of the experiment we check the bootstrap validity by computing the effective coverage probability values. This requires to perform many independent experiments:

for each of independent 5000 {Y_i} ∼(4.1) samples we took 10⁴ independent bootstrap samples {u_i} ∼ N(1,1) , and constructed simultaneous bootstrap confidence sets for a range of confidence levels. The second row of Table 4.1 contains this range (1−α) =

Figure 4.1: Local constant regression:

Confidence bands, their widths, and the modeling bias

bandwidth = 0.12 bandwidth = 0.3

Legend for the top graphs:

90% bootstrap simultaneous confidence band the true functionf(x) 90% MC simultaneous confidence band local constant MLE

smoothed target function

Legend for the middle and the bottom graphs:

width of the 90% bootstrap confidence bands from the upper graphs width of the 90% MC confidence bands from the upper graphs modeling bias from the expression (4.3)

Figure 4.2: Local quadratic regression:

Confidence bands, their widths, and the modeling bias

bandwidth = 0.12 bandwidth = 0.3

Legend for the top graphs:

90% bootstrap simultaneous confidence band the true functionf(x) 90% MC simultaneous confidence band local constant MLE

smoothed target function

Legend for the middle and the bottom graphs:

width of the 90% bootstrap confidence bands from the upper graphs width of the 90% MC confidence bands from the upper graphs modeling bias from the expression (4.4)

0.95,0.9, . . . ,0.5 . The third and the fourth rows of Table4.1show the frequencies of the event

1≤k≤Kmax n

L_k(eθ_k)−L_k(θ^∗_k)−z_k^ab(c^ab(α))o

≤0

among 5000 data samples, for the bandwidths h= 0.12,0.3 , and for the range of (1−α) . The results show that the bootstrap procedure is rather conservative for both h = 0.12 and h= 0.3 , however, the larger bandwidth yields bigger coverage probabilities.

Table 1: Effective coverage probabilities for the local constant regression Confidence levels

h 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50

0.12 0.971 0.947 0.917 0.888 0.863 0.830 0.800 0.769 0.738 0.702 0.3 0.982 0.963 0.942 0.918 0.895 0.868 0.842 0.815 0.784 0.750

4.6 Correction for multiplicity

Here we compare the Y and the bootstrap corrections for multiplicity, i.e. the values c(α) and c^ab(α) defined in (1.8) and (2.4). The numerical results in Tables 2, 3 are based on 10⁴ {Y_i} ∼ (4.1) independent samples and 10⁴ independent bootstrap sam-ples {u_i} ∼ N(1,1) . The second line in Tables 2, 3 contains the range of the nominal confidence levels (1−α) = 0.95,0.9, . . . ,0.5 (similarly to the Table 1). The first col-umn contains the values of the bandwidth h = 0.12,0.3 , and the second column – the resampling scheme: Monte Carlo (MC) or bootstrap (B). The Monte Carlo experiment yields the corrected confidence levels 1−c(α) , and the bootstrap yields 1−c^ab(α) . The lines 3–6 contain the average values of 1−c(α) and 1−c^ab(α) over all the experiments.

The results show that for the smaller bandwidth both the MC and bootstrap corrections are bigger than the ones for the larger bandwidth. In the case of a smaller bandwidth the local models have less intersections with each other, and hence, the corrections for multiplicity are closer to the Bonferroni’s bound.

Remark 4.1. The theoretical results of this paper can be extended to the case when a set of considered local models has cardinality of the continuum, and the confidence bands are uniform w.r.t. the local parameter. This extension would require some uniform statements such as locally uniform square-root Wilks approximation (see e.g. Spokoiny and Zhilova(2013)).

Remark 4.2. The use of the bootstrap procedure in the problem of choosing an optimal bandwidth is considered inSpokoiny and Willrich(2015).

Table 2: Local constant regression:

MC vs Bootstrap confidence levels corrected for multiplicity Confidence levels

h r.m. 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50

0.12 MC 0.997 0.994 0.989 0.985 0.980 0.975 0.969 0.963 0.956 0.949 B 0.998 0.995 0.991 0.988 0.984 0.979 0.975 0.969 0.963 0.957 0.3 MC 0.993 0.983 0.973 0.962 0.949 0.936 0.922 0.906 0.891 0.873 B 0.994 0.986 0.977 0.968 0.958 0.947 0.935 0.922 0.908 0.893

Table 3: Local quadratic regression:

MC vs Bootstrap confidence levels corrected for multiplicity Confidence levels

h r.m. 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50

0.12 MC 0.997 0.993 0.989 0.985 0.979 0.974 0.968 0.961 0.954 0.946 B 0.998 0.995 0.991 0.988 0.984 0.979 0.974 0.969 0.963 0.956 0.3 MC 0.993 0.983 0.973 0.961 0.949 0.936 0.921 0.904 0.887 0.868 B 0.996 0.991 0.985 0.978 0.971 0.963 0.954 0.944 0.934 0.923

5 Conditions

Here we show necessary conditions for the main results. The conditions in Section 5.1 come from the general finite sample theory by Spokoiny (2012a), they are required for the results of SectionsB.1and B.2. The conditions in Section5.2are necessary to prove the statements on multiplier bootstrap validity.

5.1 Basic conditions

Introduce the stochastic part of the k-th likelihood process: ζ_k(θ)^def= L_k(θ)−IEL_k(θ) , and its marginal summand: ζ_i,k(θ)^def= `_i,k(θ)−IE`_i,k(θ) for `_i,k(θ) defined in (2.1).

(ED0) For each k= 1, . . . , K there exist a positive-definite p_k×p_k symmetric matrix V_k² and constants gk >0, ν_k ≥1 such that Var{∇_θζ_k(θ^∗_k)} ≤V_k² and

sup

γ∈IRpk logIEexp

λγ^>∇_θζk(θ^∗_k) kV_kγk

≤ν_k²λ²/2, |λ| ≤gk.

(ED2) For each k = 1, . . . , K there exist a constant ω_k > 0 and for each r > 0 a

constant g2,k(r) such that it holds for all θ∈Θ_0,k(r) and for j= 1,2 sup

γ_j∈IR^pk kγ_jk≤1

logIEexp λ

ω_kγ^>₁D⁻¹_k ∇²_θζ_k(θ)D⁻¹_k γ₂

≤ν_k²λ²/2, |λ| ≤g_2,k(r).

(L0) For each k= 1, . . . , K and for each r>0 there exists a constant δ_k(r)≥0 such that for r≤r0,k (r0,k come from condition (B.1) of TheoremB.1 in SectionB.1) δ(r)≤1/2, and for all θ∈Θ_0,k(r) it holds

kD_k⁻¹Dˇ_k²(θ)D⁻¹_k −Ip_kk ≤δ_k(r),

where Dˇ²_k(θ)^def= −∇²_θIEL_k(θ) and Θ_0,k(r)^def= {θ∈Θ_k:kD_k(θ−θ^∗_k)k ≤r}. (I) There exist constants a_k>0 for all k= 1, . . . , K s.t.

a²_kD_k²≥V_k². Denote ba^{2 def}= max1≤k≤Ka²_k.

(Lr) For each k= 1, . . . , K and r≥r0,k there exists a value bk(r)>0 s.t.

rb_k(r)→ ∞ for r→ ∞ and ∀θ∈Θ_k:kD_k(θ−θ^∗_k)k=r it holds

−2{IEL_k(θ)−IEL_k(θ^∗_k)} ≥r²b_k(r).

5.2 Conditions required for the bootstrap validity

(\SmB) There exists a constant bδsmb ≥0 such that it holds for the matrices B_k² and H_k² defined in (3.5):

1≤k≤Kmax

H_k⁻¹B_k²H_k⁻¹

≤bδ_smb² , bδ_smb² ≤C

n p¹³_max

1/8

log^−7/8(K) log^−3/8(npsum).

(ED_2m) For each k= 1, . . . , K, r>0, i= 1, . . . , n, j= 1,2 and for all θ∈Θ_0,k(r) it holds for the values ω_k≥0 and g2,k(r) from the condition (ED₂):

sup

γ_j∈IR^pk kγ_jk≤1

logIEexp λ

ω_kγ^>₁D⁻¹_k ∇²_θζ_i,k(θ)D⁻¹_k γ₂

≤ ν₀²λ²

2n , |λ| ≤g_2,k(r),

(L0m) For each k= 1, . . . , K, r>0, i= 1, . . . , n and for all θ ∈Θ_0,k(r) there exists a value Cm,k(r)≥0 such that

kD_k⁻¹∇²_θIE`i,k(θ)D⁻¹_k k ≤Cm,k(r)n⁻¹.

(I_B) For each k= 1, . . . , K there exists a constant a²_B,k >0 s.t.

a²_B,kD²_k≥B_k². Denote ba²_B^def= max1≤k≤Ka²_B,k.

(SDd₁) There exists a constant 0≤δ²_v∗ ≤Cpsum/n such that it holds for all i= 1, . . . , n with exponentially high probability

Hb⁻¹

g_ig^>_i −IE h

g_ig^>_i io

Hb⁻¹ ≤δ_v²^∗, where

g_i ^def=

∇_θ`_i,1(θ^∗₁)^>, . . . ,∇_θ`_i,K(θ^∗_K)^>>

∈IR^p^sum, Hb^{2 def}= Xn

i=1IE n

g_ig^>_i o

, psum

def= p1+· · ·+pK.

(Eb) The i.i.d. bootstrap weights u_i are independent of Y , and for all i= 1, . . . , n it holds for some constants gk >0, νk ≥1

IEui = 1, Varui = 1,

logIEexp{λ(u_i−1)} ≤ν₀²λ²/2, |λ| ≤g.

5.3 Dependence of the involved terms on the sample size and cardinal-ity of the parameters’ set

Here we consider the case of the i.i.d. observations Y1, . . . , Yn and x=Clogn in order to specify the dependence of the non-asymptotic bounds on n and p. In the paper by Spokoiny and Zhilova(2014) (the version of 2015) this is done in detail for the i.i.d. case, generalized linear model and quantile regression.

Example 5.1 inSpokoiny(2012a) demonstrates that in this situation gk=C√ n and ωk=C/√

n. then Z_k(x) =C√

pk+x for some constant C≥1.85 , for the function Z_k(x) given in (B.3) in Section B.1. Similarly it can be checked that g_2,k(r) from condition

(ED₂) is proportional to √

n: due to independence of the observations logIEexp

λ ωk

γ^>₁D⁻¹_k ∇²_θζ_k(θ)D⁻¹_k γ₂

=Xn

i=1logIEexp λ

√n 1 ω_k√

nγ^>₁d⁻¹_k ∇²_θζi,k(θ)d⁻¹_k γ₂

≤nλ²

nC for|λ| ≤g_2,k(r)√ n,

where ζ_i,k(θ) ^def= `_i,k(θ)−IE`_i,k(θ) , d²_k ^def= −∇²_θIE`_i,k(θ^∗_k) and D²_k = nd²_k in the i.i.d.

case. Function g_2,k(r) denotes the marginal analog of g2,k(r) .

Let us show, that for the value δ_k(r) from the condition (L0) it holds δ_k(r) = Cr/√

n. Suppose for all θ∈Θ_0,k(r) and γ ∈IR^p^k :kγk= 1 kD_k⁻¹γ^>∇³_θIEL_k(θ)D⁻¹_k k ≤ C, then it holds for some θ∈Θ_0,k(r) :

kD_k⁻¹D²(θ)D⁻¹_k −Ipkk=kD⁻¹_k (θ^∗_k−θ)^>∇³_θIELk(θ)D_k⁻¹k

= kD⁻¹_k (θ^∗_k−θ)^>DkD⁻¹_k ∇³_θIELk(θ)D⁻¹_k k

≤ rkD_k⁻¹kkD⁻¹_k γ^>∇³_θIELk(θ)D_k⁻¹k ≤Cr/√ n.

Similarly Cm,k(r)≤Cr/√

n+C in condition (L0m).

The next remark helps to check the global identifiability condition (Lr) in many situations. Suppose that the parameter domain Θ_k is compact and n is sufficiently large, then the value bk(r) from condition (Lr) can be taken as C{1−r/√

n} ≈ C. Indeed, for θ:kD_k(θ−θ^∗_k)k=r

−2{IEL_k(θ)−IEL_k(θ^∗_k)} ≥ r² n

1−rkD⁻¹_k kkD⁻¹_k γ^>∇³_θIEL_k(θ)D_k⁻¹ko

≥ r²(1−Cr/√ n).

Due to the obtained orders, the conditions (B.1) and (B.9) of TheoremsB.1 and B.5 on concentration of the MLEs eθ_k,eθ_k^ab require r0,k ≥C√

p_k+x.

A Approximation of the joint distributions of `

₂

-norms

Let us previously introduce some notations:

1_K ^def= (1, . . . ,1)^>∈IR^K;

k · k is the Euclidean norm for a vector and spectral norm for a matrix;

k · k_max is the maximum of absolute values of elements of a vector or of a matrix;

k · k₁ is the sum of absolute values of elements of a vector or of a matrix.

Consider K random centered vectors φ_k ∈ IR^p^k for k = 1, . . . , K. Each vector equals to a sum of n centered independent vectors:

φ_k=φ_k,1+· · ·+φ_k,n, IEφ_k=IEφ_k,i= 0 ∀1≤i≤n.

(A.1) Introduce similarly the vectors ψ_k∈IR^p^k for k= 1, . . . , K:

ψ_k=ψ_k,1+· · ·+ψ_k,n, IEψ_k=IEψ_k,i= 0 ∀1≤i≤n,

(A.2)

with the same independence properties as φ_k,i, and also independent of all φ_k,i. The goal of this section is to compare the joint distributions of the `2-norms of the sets of vectors φ_k and ψ_k, k= 1, . . . , K (i.e. the probability laws L(kφ₁k, . . . ,kφ_Kk) and L(kψ₁k, . . . ,kψ_Kk) ), assuming that their correlation structures are close to each other.

Denote

pmax def

= max

1≤k≤Kp_k, psum

def= p1+· · ·+pK, λ²_φ,max ^def= max

1≤k≤KkVar(φ_j)k, λ²_ψ,max^def= max

1≤k≤KkVar(ψ_j)k, zmax

def= max

1≤k≤Kzk, zmin

def= min

1≤k≤Kzk, δ_z,max ^def= max

1≤k≤Kδ_z_k, δ_z,min^def= min

1≤k≤Kδ_z_k, let also

∆_ε ^def=

p³_max n

1/8

log^9/16(K) log^3/8(npsum)z_min^1/8 (A.3)

×max{λ_φ,max, λ_ψ,max}^3/4log^−1/8(5n^1/2).

The following conditions are necessary for the PropositionA.1

(C1) For some g_k, ν_k,c_φ,c_ψ >0and for all i= 1, . . . , n, k= 1, . . . , K sup

γ_k∈IRpk, kγ_kk=1

logIEexp n

λ√

nγ^>_kφ_k,i/cφ

≤ λ²ν_k²/2, |λ|<gk,

sup

γ_k∈IR^pk, kγ_kk=1

logIEexpn λ√

nγ^>_kψ_k,i/c_ψo

≤ λ²ν_k²/2, |λ|<g_k,

where c_φ≥Cλφ,max and c_ψ ≥Cλφ,max. (C2) For some δ²_Σ ≥0

1≤kmax1, k2≤K

Cov(φ_k₁,φ_k₂)−Cov(ψ_k₁,ψ_k₂)

max≤δ²_Σ. (A.4) Proposition A.1(Approximation of the joint distributions of `₂-norms). Consider the centered random vectors φ₁, . . . ,φ_K and ψ₁, . . . ,ψ_K given in (A.1), (A.2). Let the conditions(C1) and (C2) be fulfilled, and the values z_k≥√

p_k+∆ε and δz_k ≥0 be s.t.

Cmax{n^−1/2, δ_z,max} ≤∆_ε ≤Cz_max⁻¹ , then it holds with dominating probability IP

k=1{kφ_kk> zk}

−IP

k=1{kψ_kk> zk−δzk}

≥ −∆_`₂, IP

k=1{kφ_kk> zk}

−IP

k=1{kψ_kk> zk+δzk}

≤ ∆`2

for the deterministic non-negative value

∆_`₂≤12.5C p³_max

n 1/8

log^9/8(K) log^3/8(npsum) max{λ_φ,max, λ_ψ,max}^3/4

+ 3.2Cδ²_Σ p³_max

n 1/4

pmaxz^1/2_minlog²(K) log^3/4(npsum) max{λ_φ,max, λψ,max}^7/2

≤25C p³_max

n 1/8

log^9/8(K) log^3/8(npsum) max{λ_φ,max, λ_ψ,max}^3/4, where the last inequality holds for

δ²_Σ ≤ 4C n

p¹³_max 1/8

log^−7/8(K) log^−3/8(npsum) (max{λ_φ,max, λ_ψ,max})^−11/4.

Remark A.1. The approximating error term ∆`2 consists of three errors, which cor-respond to: the Gaussian approximation result (Lemma A.2), Gaussian comparison (Lemma A.7), and anti-concentration inequality (Lemma A.8). The bound on ∆_`₂ above implies that the number K of the random vectors φ₁, . . . ,φ_K should satisfy logK (n/p³_max)^1/12 in order to keep the approximating error term ∆_`₂ small. This condition can be relaxed by using a sharper Gaussian approximation result. For instance, using in LemmaA.2the Slepian-Stein technique plus induction argument from the recent paper by Chernozhukov et al.(2014b) instead of the Lindeberg’s approach, would lead to the improved bound: C

_p3 max

1/6

multiplied by a logarithmic term.

A.1 Joint Gaussian approximation of `₂-norm of sums of independent vectors by Lindeberg’s method

Introduce the following random vectors from IR^p^sum: Φ^def=

φ^>₁, . . . ,φ^>_K>

, Φ_i ^def=

φ^>_1,i, . . . ,φ^>_K,i>

, i= 1, . . . , n, Φ=Xn

i=1Φ_i, IEΦ=IEΦ_i = 0.

(A.5)

Define their Gaussian analogs as follows: Lemma A.2 (Joint GAR with equal covariance matrices). Consider the sets of ran-dom vectors φ_j and φ_j, j = 1, . . . , K defined in (A.1), and (A.5)– (A.8). If the conditions of Lemmas A.4 are A.5 are fulfilled, then it holds for all ∆, β > 0, z_j ≥ max

∆+√

pj,2.25 log(K)/β with dominating probability IP

Proof of Lemma A.2.

IP Let us approximate the max1≤j≤K function using the smooth maximum:

h_β({x_j})^def= β⁻¹log The indicator function 1I{x > 0} is approximated with the three times differentiable function g(x) growing monotonously from 0 to 1 :

g(x) ^def=

Therefore

where the last inequality holds for zj ≥2.25 log(K)/β. Denote z^def= (z₁, . . . , z_K)^>∈IR^K, z_j >0.

Then by (A.10) and (A.11) IP LemmaA.6checks that F_{∆, β}(·,z) admits applying the Lindeberg’s telescopic sum device (seeLindeberg(1922)) in order to approximate IEF_{∆, β}(Φ,z) with IEF_{∆, β} Φ,z

The difference F_{∆, β}(Φ,z)−F_{∆, β} Φ,z

can be represented as the telescopic sum:

F_{∆, β}(Φ,z)−F_{∆, β} Φ,z and the same bound holds for IEmax1≤j≤K

kS_j,i+φ_j,ik⁶ ^1/2. Denote

probability ≥1−6 exp(−x) as follows

The derived bounds imply:

The next lemma is formulated separately, since it is used for a proof of another result.

Lemma A.3(Smooth uniform GAR). Under the conditions of Lemma A.2it holds with dominating probability for the function F∆, β(·,z) given in (A.12):

1.1. IP

Proof of Lemma A.3. The first inequality 1.1 is obtained in (A.16), the second inequality 1.2 follows similarly from (A.14) and (A.15). The inequalities 2.1 and 2.2 are given in (A.13) and (A.14).

Lemma A.4. Let for some c_φ,g1, ν₀>0and for all i= 1, . . . , n, j= 1, . . . , psum

logIEexpn λ√

n|φ^j_i|/c_φo

≤λ²ν₀²/2, |λ|<g1,

here φ^j_i denotes the j-th coordinate of vector φ_i. Then it holds for all i= 1, . . . , n and m, t >0

1≤j≤pmaxsum

|φ^j_i|^m > t

≤ exp (

−nt^2/m

2c²_φν₀² + log(psum) )

Proof of Lemma A.4. Let us bound the maxj|φ^j_i| using the following bound for the maximum:

1≤j≤pmax_sum|φ^j_i| ≤lognXpsum

j=1 exp |φ^j_i|o . By the Lemma’s condition

IEexp

1≤j≤pmax λ√

n c_φ |φ^j_i|

≤ exp λ²ν₀²/2 + logpsum

. Thus, the statement follows from the exponential Chebyshev’s inequality.

Lemma A.5. If for the centered random vectors φ_j ∈IR^p^j j= 1, . . . , K sup

γ∈IR^pj, kγk6=0

logIEexp (

λ γ^>φ_j kVar^1/2(φ_j)γk

)

≤ ν₀²λ²/2, |λ| ≤g

for some constants ν₀ >0 and g≥ν₀⁻¹max1≤j≤K

p2p_jlog(K), then IE max

1≤j≤K

kφ_jk ≤ Cν0 max

1≤j≤KkVar^1/2(φ_j)kp

2pmaxlog(K),

IE max

1≤j≤K

kφ_jk⁶ 1/2

≤ Cν₀ max

1≤j≤KkVar^1/2(φ_j)k³p

2p_maxlog(K)(p_max+ 6x), The second bound holds with probability ≥1−2e^−x.

Proof of Lemma A.5. Let us take for each j= 1, . . . , K finite ε_j-grids G_j(ε)⊂IR^p^j on the (pj−1) -spheres of radius 1 s.t

∀γ∈IR^p^j s.t. kγk= 1 ∃γ₀∈Gj(ε) : kγ−γ₀k ≤ε, kγ₀k= 1.

Then Hence, by inequality (A.9) and the imposed condition it holds for all

0< µ <g/max1≤j≤KkVar^1/2(φ_j)k:

For the second part of the statement we combine the first part with the result of Theorem B.3 on deviation of a random quadratic form: it holds with dominating probability for

V_φ²

Proof of Lemma A.6. Denote

s(Γ)^def= XK

j=1exp βkγ_jk²−z_j² 2zj

, hβ(s(Γ))^def= β⁻¹log{s(Γ)}, (A.17)

then F_β,∆(Γ,z) =g ∆⁻¹h_β(s(Γ))

. Let γ^q denote the q-th coordinate of the vector Γ ∈IR^p^sum. It holds for q, l, b, r= 1, . . . , psum:

dγ^qF_β,∆(Γ,z) = 1

∆g⁰

∆⁻¹h_β(s(Γ)) d

dγ^qh_β(s(Γ)), d²

dγ^qdγ^lF_β,∆(Γ,z) = 1

∆²g⁰⁰

∆⁻¹h_β(s(Γ)) d

dγ^qh_β(s(Γ)) d

dγ^lh_β(s(Γ)) + 1

∆g⁰

∆⁻¹h_β(s(Γ)) d²

dγ^qdγ^lh_β(s(Γ)), d³

dγ^qdγ^ldγ^bF_β,∆(Γ,z) = 1

∆³g⁰⁰⁰

∆⁻¹h_β(s(Γ)) d

dγ^qh_β(s(Γ)) d

dγ^lh_β(s(Γ)) d

dγ^bh_β(s(Γ)) + 1

∆²g⁰⁰

∆⁻¹h_β(s(Γ))

( d²

dγ^qdγ^bh_β(s(Γ)) d

dγ^lh_β(s(Γ)) + d

dγ^qh_β(s(Γ)) d²

dγ^ldγ^bh_β(s(Γ)) + d

dγ^bh_β(s(Γ)) d²

dγ^qdγ^lh_β(s(Γ)) )

+ 1

∆g⁰

∆⁻¹h_β(s(Γ)) d³

dγ^qdγ^ldγ^bh_β(s(Γ)).

Let for 1≤q ≤psum j(q) denote an index from 1 to K s.t. the coordinate γ^q of the vector Γ = γ^>₁, . . . ,γ^>_K>

belongs to its sub-vector γ_j(q).

dγ^qhβ(s(Γ)) = 1 β

1 s(Γ)

dγ^qs(Γ) = 1 s(Γ)

γ^q

z_j(q)exp β

kγ_j(q)k²−z_j(q)² 2z_j(q)

! ,

d²

The following Lemma shows how to compare the expected values of a twice differentiable function evaluated at the independent centered Gaussian vectors. This statement is used

for the Gaussian comparison step in the scheme (3.1). The proof of the result is based on the Gaussian interpolation method introduced by Stein (1981) and Slepian (1962) (see alsoR¨ollin(2013) andChernozhukov et al.(2013b) and references therein). The proof is given here in order to keep the text self-contained.

Lemma A.7(Gaussian comparison using Slepian interpolation). Let the IR^p^sum-dimensional random centered vectors Φ and Ψ be independent and normally distributed, f(Z) : IR^p^sum 7→IR is any twice differentiable function s.t. the expected values in the expression below are bounded. Then it holds

Proof of Lemma A.7. Introduce for t ∈ [0,1] the Gaussian vector process Zt and the deterministic scalar-valued function κ(t) :

Z_t ^def= Φ√

Further we use the Gaussian integration by parts formula (see e.g Section A.6 in Tala-grand(2003)): if (x1, . . . , xpsum)^> is a centered Gaussian vector and f(x1, . . . , xpsum) is partial derivative of the vectors f(Zt) w.r.t. the j-th coordinate of Zt. Then it holds

due to (A.19):

Similarly for the second term in (A.18):

IEn

A.3 Simultaneous anti-concentration for `₂-norms of Gaussian vectors

Lemma A.8(Simultaneous Gaussian anti-concentration). Let

Proof of Lemma A.8.

and the inequality (A.20) continues as IP inequal-ity by Chernozhukov et al. (2014c) for the maximum of a centered high-dimensional

Gaussian vector (see TheoremA.9below), applied to max1≤j≤Ksup_γ∈G_j_(ε_j₎

, then (A.21) holds with exponentially high probability due to Gaussianity of the vectors φ_j and Theorem 1.2 in Spokoiny(2012b), hence

Theorem A.9(Anti-concentration inequality for maxima of a Gaussian random vector, Chernozhukov et al.(2014c)). Let (X1, . . . , Xp)^> be a centered Gaussian random vector

where Cac depends only on σ and σ. When the variances are all equal, namely σ = σ=σ, log(p/) on the right side can be replaced by logp.

A.4 Proof of Proposition A.1

Proof of Proposition A.1. Let Φ ^def= φ^>₁, . . . ,φ^>_K>

∈ IR^p^sum for psum

def= p₁+· · ·+p_K (as in (A.5)), and similarly Ψ ^def= ψ^>₁, . . . ,ψ^>_K>

∈ IR^p^sum. Let also Φ ∼ N(0,VarΦ) and Ψ ∼ N(0,VarΨ) . Introduce the following value, which comes from LemmaA.7 on Gaussian comparison:

It holds bound in the inverse direction is derived similarly. Denote the approximating error term obtained in (A.26) as

∆_`₂^def= 1

Consider this term in more details, by inequality (A.23)

∆ac

Let us take β = ^log(K)_∆ , then

where the second inequality holds for δ_z,min+ 5∆≤ 1/(2z_max) , and the last one holds for δz,max≤∆ and ∆≥n^−1/2.

After minimizing the sum of the expressions (A.28) and (A.29) w.r.t ∆, we have

∆_`₂≤ 12.5C where the last inequality holds for

δ_Σ² ≤4Cp⁻¹_maxz_min^−1/2 p³_max

−1/8

log^−7/8(K) log^−3/8(npsum) (max{λ_φ,max, λ_ψ,max})^−11/4.

B Square-root Wilks approximations

This section’s goal is to derive square root Wilks approximations simultaneously for K parametric models, for the Y and bootstrap worlds. This is done in Section B.3 below. Both of the results are used in the approximating scheme (3.1) for the bootstrap justification. In order to make the text self-contained we recall in SectionB.1some results from the general finite sample theory by Spokoiny (2012a,b, 2013). In Section B.2 we recall similar finite sample results for the bootstrap world for a single parametric model, obtained inSpokoiny and Zhilova(2014).

B.1 Finite sample theory

Let us use the notations given in the introduction: L_k(θ) , k = 1, . . . , K are the log-likelihood processes, which depend on the data Y and correspond to the regular

Im Dokument Simultaneous likelihood-based bootstrap confidence sets for a large number of models (Seite 15-0)