Modulation Estimators and Condence Sets
Rudolf Beran and Lutz Dumbgen
yUniversity of California, Berkeley and Universitat Heidelberg January 1996, Revised August 1997
Abstract.
An unknown signal plus white noise is observed atn
discrete time points.Within a large convex class of linear estimators of
, we choose the estimator b that minimizes estimated quadratic risk. By construction, b is nonlinear. This estimation is done after orthogonal transformation of the data to a reasonable coordinate system.The procedure adaptively tapers the coecients of the transformed data. If the class of candidate estimators satises a uniform entropy condition, then
bis asymptotically mini- max in Pinsker's sense over certain ellipsoids in the parameter space and shares one such asymptotic minimax property with the James-Stein estimator. We describe computational algorithms forband construct condence sets for the unknown signal. These condence sets are centered at b, have correct asymptotic coverage probability, and have relatively small risk as set-valued estimators of.AMS 1991 subject classications. Primary 62H12 secondary 62M10.
Key words and phrases. Adaptivity, asymptotic minimax, bootstrap, bounded varia- tion, coverage probability, isotonic regression, orthogonal transformation, signal recovery, Stein's unbiased estimator of risk, tapering.
Research supported in part by National Science Foundation Grant DMS95-30492 and in part by Sonderforschungsbereich 373 at Humboldt-Universitat zu Berlin.
yResearch supported in part by European Union Human Capital and Mobility Program ERB CHRX-CT 940693.
1 Introduction
The problem of recovering a signal from observation of the signal plus noise may be formulated as follows. Let
X
=X
n=X
(t
)]t2T be a random function observed on the setT
=T
n=f12:::n
g. The componentsX
(t
) are independent with IEX
(t
) =(t
) =n(t
) and VarX
(t
)] = 2 for everyt
2T
. Working with functions onT
rather than vectors inR
n is very convenient for the present purposes. As just indicated, we will usually drop the subscriptn
for notational simplicity. The signal and the noise variance 2 are both unknown. For simplicity we assume throughout thatX
is Gaussian. Portions of the argument that hold for non-GaussianX
are expressed by the lemmas in Section 6.2.For any
g
2R
T, the space of real-valued functions dened onT
, let ave(g
) :=n
;1Xt2T
g
(t
):
The loss of any estimatorbfor is dened to beL
(b ) := ave(b;)2] (1.1)and the corresponding risk of
bis (b 2) := IEL
(b ):
The rst goal is to devise an estimator that is ecient in terms of this risk. If
andX
are electrical voltages, then ave(2) andL
(b ) are the time-averaged powers dissipated in passing the signal and the errorb; through a unit resistance.Any estimator
bof is governed by the asymptotic minimax bound liminfn!1 infb
ave(sup2)c
(b 2) 2c
2+c
(1.2)for every positive
c
and2. Inequality (1.2) follows from a more general bound proved by Pinsker (1980) for signal recovery in Gaussian noise (see Nussbaum 1996 and Section 2). It may also be derived from ideas in Stein (1956) by considering best orthogonally equivariant estimators in the submodel where ave(2) =c
(see Beran 1996b). Let b2 = b2n be an estimator of2 that is consistent as in display (2.2) of Section 2. Then bS := 1;b2=
ave(X
2)]+X
is essentially the James-Stein (1961) estimator, where ]+ denotes the positive part func- tion. It achieves the Pinsker bound (1.2) because
lim
n!1 ave(sup
2)c
(bS2) = 2c
2+c
(1.3)for every positive
c
and2. The limit (1.3) follows from Corollary 2.3 or from asymptotics in Casella and Hwang(1982). For the maximum likelihood estimatorbML =X
, the risk is always2, which is strictly greater than the Pinsker bound.Section 2 of this paper constructs estimators of
that are asymptotically minimax over a variety of ellipsoids in the parameter space while achieving, in particular, the asymptotic minimax bound (1.2) for everyc >
0. These modulation estimators take the formfX
b =f
b(t
)X
(t
)]t2T. Heref
b:T
!01] depends onX
and is chosen to minimize the estimated risk of the linear estimatorfX
over all functionsf
in a classF =Fn01]T. Many well-known estimators are of this form with special classesF. In the present paper we analyze such estimators under rather general assumptions on F. How large this class may be is at the heart of the analysis. Taking F to be the set of all functions fromT
to 01] leads to a poor modulation estimator. Successful is to let F be a closed convex set of functions with well-behaved uniform covering numbers. One example is the set of all functions in 01]T that are nonincreasing. The asymptotic theory of such modulation estimators, including links with the literature, is the subject of Section 2. Section 4 develops algorithms for computingfX
b in the example ofF just cited.Section 3 constructs condence sets that are centered at a modulation estimator
fX
b and have asymptotic coverage probability for . The risk of the modulation estimator at the center is shown to determine the risk of the condence set, when that is viewed as a set-valued estimator for. In this manner, eciency of a modulation estimator determines the eciency of the associated condence set.Before estimation of
, the dataX
may be transformed orthogonally without changing its Gaussian character. A modulation estimator computed in the new coordinate sys- tem can be transformed back into the original coordinate system to yield an estimator of . Standard choices for such preliminary orthogonal transformation include Fourier transforms, wavelet transforms, or analysis-of-variance transforms. When applied in thismanner, modulation estimators perform data-driven tapering of empirical Fourier, wavelet or analysis-of-variance coecients. Section 5 includes numerical examples of modulation estimators and condence bounds after Fourier transformation.
2 Modulation estimators
After dening modulation estimators, this section obtains uniform asymptotic approxima- tions to their risks. LetF =Fnbe a given subset of 0
1]T. Each functionf
2F is called a modulator and denes a candidate linear estimatorfX
=f
(t
)X
(t
)]t2T for. The risk of this candidate estimator under quadratic loss (1.1) is (fX
2) = IEL
(fX
) = ave2f
2+2(1;f
)2]:
(2.1)For brevity, we will write
R
(f
2) in place of (fX
2).We will rst construct a suitably consistent estimator
R
b(f
) of this risk. Suppose thatb
2=bn2 is an estimator of2, constructed (for instance) by one of the methods described later. LetX
be a bootstrap random vector inR
T such thatL(X
jX
b2) =NT(X
b2I
).The corresponding bootstrap risk estimator for
R
(f
2) is IEL
(fX X
)X
b2 =R
(fX
b2):
We call
R
(fX
b2) the naive risk estimator because it is badly biased upwards, even asymptotically. The key point isIE
R
(fX
2) = avef
22+ (1;f
)2(2+2)] =R
(f
2) + ave(1;f
)22]:
Two possible corrections to the naive risk estimator areR
bC(f
) := avef
2b2+ (1;f
)2(X
2;b2)] =R
(fX
b2);ave(1;f
)2b2]R
bB(f
) := maxnave(f
2b2)R
bC(f
)o = ave(f
2b2) + ave(1;f
)2(X
2;b2)]+:
Risk estimatorR
bC is essentially Mallows' (1973)C
L criterion or Stein's (1981) unbiased estimator of risk, with estimation of 2 incorporated. Risk estimatorR
bB corrects the possible negativity in ave(1;f
)2(X
2;b2)] as an estimator for ave(1;f
)22]. LetX
be a random vector in
R
T such that L(X
jX
b2) is NT(b b2I
), where b=b(X
b2) is a vector such that ave(b2) = ave(1;f
)2(X
2;b2)]+=
ave(1;f
)2X
2]. Then the bootstrap risk estimator IEL
(fX
b2)jX
b2] is preciselyR
bB.Let
R
b denote eitherR
bC orR
bB. We propose to estimate by the modulation estimatorfX
b , wheref
b is any function in F that minimizesR
b(f
). Unless stated otherwise it is assumed throughout thatF is a closed convex subset of 0
1]T containing all constantsc
201].Because both
R
bC() andR
bB() are convex functions on 01]T, the minimizerf
bover F exists in each case. These minimizers are unique with probability one becauseR
bC(f
) is strictly convex inf
wheneverX
(t
) 6= 0 for everyt
2T
. Similarly, the risk functionR
(f
2) dened through (2.1) is strictly convex over 01]T, with unique minimizerf
e.REMARK A. The modulation estimator
fX
b behaves poorly when the class F is too large. For instance, letFbe the class of all functions in 01]T. The minimizer ofR
(2) over 01]T is the \oracle" modulator (cf. Donoho and Johnstone 1994)e
g
:= 2=
(2+2)the division being componentwise, while the minimizer of
R
b() overF is now the greedy modulatorg
b+, whereb
g
:= (X
2;b2)=X
2:
To simplify the discussion, suppose that
2 is known andb2 2. Then the estimatorb
g
+X
is of the general formb:=S X
(t
)]t2T
for some measurable function
S
on the line.Since the maximum likelihood estimator
X
is componentwise admissible, the risk function (b2) of bis either identical to (X
2) 2 or there is a real number such thatR(
;S
)2d
N(2)>
2. Then, if () , (b 2)>
2 = (X
2)>
22=
(2+2)the latter being the asymptotic risk of the James-Stein estimator
bS. Thus, the maximum risk ofg
b+X
is worse than that of estimators achieving Pinsker's asymptotic minimax bound (1.2) and is even worse than that of the naive estimatorX
.It should be mentioned that greedy modulation can be made successful in some sense if one overestimates the variance
2 systematically. Donoho and Johnstone (1994) propose threshold estimators of the formb= (1;n=
jX
j)+X
orb= 1fjX
jngX
, and prove that they have surprising optimality properties ifn= (2logn
)1=2(1+n) with a suitable sequence (n)ntending to zero. These estimators are similar tobg
+X
if bg
is computed withb
n2 :=2n2. While showing good performance in case of \sparse signals", these estimators do not achieve the Pinsker bound (1.2) or the minimax bounds in Corollary 2.3 below.Also, the construction of condence bounds for their loss seems to be intractable. Section 5 illustrates the possibly poor performance of hard thresholding for non-sparse signals.
REMARK B. Kneip's (1994) ordered linear smoothers are equivalent to certain modu- lation estimators computed after suitable orthogonal transformation of
X
. The conditions that we impose onFin this paper are substantially weaker than the ordering ofF required by Kneip. Consequently, our results also apply to the ridge regression, spline estimation, and kernel estimation examples discussed in Kneip's paper. The earlier paper of Li (1987) treated non-diagonal linear estimators indexed by a parameterh
. Li's optimality result may be compared with Theorem 2.1 below. However, it does not seem easy to relate Li's conditions on the range ofh
to our conditions onF. The latter conditions give access to empirical process results that yield asymptotic distributions for the loss offX
b and hence condence sets for centered at modulation estimators.REMARK C. Nussbaum (1996) surveyed constructions of adaptive estimators that achieve Pinsker-type asymptotic minimax bounds. For instance, Golubev and Nuss- baum (1992) treated adaptive, asymptotically minimax estimation when
i=g
(x
i) andg
lies in an ellipsoid of unknown radius within a Sobolev space of unknown order. Corol- lary 2.3 below is of related character. However, our results make no smoothness assump- tions on . For instance, sample paths up to timen
of suitably scaled, discrete-time, independent white noise ultimately lie, asn
!1, within the ball ave(2)c
.Useful classes of modulators F can be characterized through their uniform covering numbers, which are dened as follows. For any probability measure
Q
onT
, consider thepseudo-distance
d
Q(fg
)2:=R(f
;g
)2dQ
on 01]T. For every positiveu
, letN
(u
Fd
Q) := minn#Fo:Fo F inff
o 2F
o
d
Q(f
0f
)u
8f
2Fo:
Dene the uniform covering numberN
(u
F) := supQ
N
(u
Fd
Q) where the supremum is taken over all probabilities onT
. LetJ
(F) := Z01qlogN
(u
F)du:
Throughout
C
denotes a generic universal real constant which does not depend onn
, , 2 orF, but whose value may be dierent in various places.THEOREM 2.1
LetF be any closed subset of01]T containing0, letf
ebe a minimizer ofR
(f
2) overf
2F, and letf
bminimize eitherR
bC(f
) orR
bB(f
) overf
2F. ThenIE
G
;R
(f
e 2)C
J
(F)2+pave(2)p
n
+ IEjb2;2j whereG
is any one of the following quantities:L
(fX
b ) inff2F
L
(fX
)R
bC(f
b)R
bB(f
b):
In particular,
(
fX
b 2);R
(f
e 2)C
J
(F)2+pave(2)p
n
+ IEjb2;2j:
This theorem is about convergence of losses and risks. The next result uses convexity ofF to establish that
f
bandf
e, as well asfX
b andfX
e , converge to one another. Note that the second bound holds uniformly in 2R
T.THEOREM 2.2
Letf
bbe the minimizer ofR
bC. ThenIEave(
2+2)(f
b;f
e)2CJ
(F)2+pave(2)p
n
+ IEjb2;2j IE ave(fX
b ;fX
e )2CJ
(F) 2p
n
+ IEjb2;2j:
Given consistency of
b and boundedness of 2+ ave(2), a key assumption on F that ensures success of the modulation estimatorfX
b dened above is thatJ
(F) =o
(n
1=2).Here are some examples of modulator classes F to which Theorem 2.1 applies.
EXAMPLE 1 (Stein shrinkage). Suppose that F consists of all constant functions in 0
1]T. The minimizer overF ofR
(f
2) isf
eS 1;2=
2+ ave(2)]:
The minimizer of bothR
bC andR
bB isf
bS 1;b2=
ave(X
2)]+:
The resulting modulation estimator
f
bSX
is the (modied) James-Stein (1981) estimator bS. Here one easily shows thatN
(u
F) 1 + (2u
);1, whenceJ
(F) is bounded by a universal constant.EXAMPLE 2 (Multiple Stein shrinkage). Let B=Bn be a partition of
T
and deneF :=nX
B2B
1B
c
(B
) :c
201]Bowhere 1B is the indicator function of
B
. The values ofc
(B
) that denef
eandf
b, respec- tively, aree
c
(B
) = ave(1B2)=
ave1B(2+2)]b
c
(B
) = ave1B(X
2;b2)]+=
ave(1BX
2):
The modulation estimator
fX
b now has the asymptotic form of the multiple shrinkage estimator in Stein (1966). Elementary calculations show thatN
(u
F)1 + (2u
);1]#B. ThusJ
(F) is bounded by a universal constant times (#B)1=2, so thatJ
(F) =o
(n
1=2) follows from the intuitively appealing condition #B=o
(n
).EXAMPLE 3 (Monotone shrinkage). LetFmonbe the set of all nonincreasing functions in 0
1]T. The class of candidate estimatorsffX
:f
2Fmongincludes the nested model- selection estimatorsf
kX
, 0k
n
, dened byf
k(t
) := 1ft
k
g. In fact, Fmon is theconvex hull ofDMS:=f
f
0f
1:::f
ng. Elementary calculations show thatN
(u
DMS) 1 +u
;2 2u
;2for 0
< u
1. Together with Theorem 5.1 of Dudley (1987) it follows that logN
(u
Fmon)Cu
;1 for allu
2 ]01]:
EXAMPLE 4 (Monotone shrinkage with respect to a quasi-order). Let be a quasi- order relation on
T
(cf. Robertson et al. 1988, Chapter 1.3), and let F be the set of all functions in 01]T that are nonincreasing with respect to. That means, for allf
2F andst
2T
,f
(s
)f
(t
) ifs
t:
Here one can easily deduce from the conclusion of Example 3 that log
N
(u
F)CN
u
;1for 0
< u
1, whereN
=N
n is the minimal cardinality of a partition of (T
) into totally ordered subsets. ThusJ
(F) is of orderO
(N
1=2). To give an example, suppose thatX
consists ofn
= 2k+1 ;1 empirical Haar (or wavelet) coecients, arranged as a binary tree. If this tree is equipped with its natural order , then the monotonicity constraintf
b2F means thatfX
b is a mixture of histogram estimators (cf. Engel 1994).Here
N
= 2k> n=
2. Therefore, in order to apply our theory one has to replace the classF
with suitable subclasses.
EXAMPLE 5 (Shrinkage with bounded total variation). Let F(M) be all functions
f
in 01]T with total variation not greater thanM
=M
n, i.e.n
X
t=2
j
f
(t
);f
(t
;1)jM:
For instance, the class of functions
f
(t
) := maxfminfp
(t
)1g0g, wherep
is a polynomial of degree less than or equal toM
, belongs to F(M). Anyf
2 F(M) can be written as (M
+ 1)(f
1;f
2) withf
1f
2 2Fmon. Hencelog
N
(u
F(M)) 2logN
2(M
+ 1)];1u
Fmon
C
(M
+ 1)u
;1for 0
< u
1. In particular,J
(F(M)) =O
(M
+ 1)1=2].The minimizers
f
eandf
bin Examples 3-5 lack closed forms. Section 4 describes com- putational algorithms forf
eandf
bin Examples 3-4. Example 5 diers from the remaining examples both theoretically as well as computationally and will be treated in detail else- where.A particular consequence of Theorem 2.1 is that the modulation estimators are asymp- totically minimax optimal for a large class of submodels for (
2). Namely, fora
211]T andc >
0 dene the linear minimax risk 2(ac
2) := infg 2 01]T ave(sup
a
2)c
R
(g
2):
It is shown by Pinsker (1980) that the linear minimax risk approximates the unrestricted minimax risk in that
inf
b
ave(supa2)c
(b 2)=
2(ac
2) ! 1 asn
2(ac
2)!1:
Moreover, 2(ac
2) = supave(a
2)c
R
(g
o2) =R
(g
oo2)where
g
o := 1;(a=
o)1=2]+, o2 := 2(o=a
)1=2 ;1]+, and o>
0 is the unique real number satisfying ave(a
(o=a
)1=2;1]+) =c=
2. The special casea
1 yields (1.2).If the minimax modulator
g
o =g
o(jac=
2) happens to be in F, which is certainly true fora
1, thenave(supa2)c
(fX
b 2) ave(supa 2)c
(
fX
b 2);R
(f
e 2)+2(ac
2):
Thus Theorem 2.1 immediately implies the following minimax result, where the distribu- tion of (X
b2) is assumed to depend on (2) only.COROLLARY 2.3
Suppose thatJ
(F) =o
(n
1=2), and that for everyc
2>
0, n(c
2) := supave(
2)cIEjb
2;2j ! 0 (n
!1):
(2.2)Then the modulation estimator
fX
b achieves the asymptotic minimax bound (1.2).More generally, let
a
=a
n211]T such that1;(
a=
)1=2]+ 2 F for all constants 1:
(2.3)Then for every
c
2>
0,ave(supa2)c
(fX
b 2) 2(ac
2) +O n
;1=2J
(F) +n(c
2)]:
2 Specically, leta
(t
) = 1 fort
2A
T
anda
(t
) =1 otherwise. Then ave(a
2)c
is equivalent to ave(2)c
and 2= 0 onT
nA
. Here one can easily see that condition (2.3) is equivalent to 1A 2F. The linear minimax risk equals 2(ac
2) = 2ave(1A)c
2ave(1A) +c
which can be signicantly smaller than the bound in (1.2).In case of F =Fmon condition (2.3) is equivalent to
a
being nondecreasing onT
. We end this section with some examples forb. Internal estimators of2depend only onX
and require additional smoothness or dimensionality restrictions on the possible values of to achieve the consistency property (2.2). One internal estimator of 2, analyzed by Rice (1984) and by Gasser et al. (1986) is b2(1)= 2(n
;1)];1Xnt=2
X
(t
);X
(t
;1)]2:
(2.4)Here IEjb
2n;n2j!0 asn
!1 andn
;1Xnt=2
n(t
);n(t
;1)]2 ! 0:
External estimators of variance are available in linear models, where one observes an
N
-dimensional normal random vectorY
with mean IEY
=D
and covariance matrix Cov(Y
) = 2I
N for some design matrixD
2R
Nn,N
=N
n> n
. After suitable linear transformation ofY
and one may assume that is the expectation of the vectorX
:= (Y
1Y
2:::Y
n). Then the standard estimator for 2 is given by b2(2) := (N
;n
);1 XNi=n+1
Y
i2which is independent from
X
with (N
;n
);2b2(2)N;n. This estimator also satises (2.2), provided thatN
;n
!1.3 Condence sets
Having replaced the maximum likelihood estimator
X
withfX
b , a natural question is to what extentfX
b is closer to the unknown signal thanX
. More precisely we want to compare the distanceL
(X fX
b )1=2 with an upper condence boundr
b = br
(X
b2) forL
(fX
b )1=2. In geometrical terms, the condence ball of primary interest is dened byC
b =C
bn := f2R
T :L
(fX
b )r
b2g:
The radius
r
bis chosen so that the coverage probability IP(2C
b) converges to 2]01asn
increases. The full denition ofC
b follows the theorem below. Underlying the construction is the condence set idea sketched at the end of Stein (1981). The quality ofC
b as a set-valued estimator of will be measured through the quadratic lossL
(C
b ) := sup2 b
C
L
() =L
(fX
b )1=2+r
b]2:
(3.1)This is a natural extension of the quadratic loss dened in (1.1) and has an appealing projection-pursuit interpretation see Beran (1996a).
One main assumption for this section is that
X
nand bn2 are independent with L(n;2bn2) depending only onn
(3.2)such that lim
n!1
m
Ln
1=2(;2bn2;1)]N(02) = 0:
Here
2 0 is a given constant andm
() metrizes weak convergence of distributions on the line. For instance, the estimator b2(2) of Section 2 satises Condition (3.2) with := 2limn!1n=
(N
n;n
), provided that this limit exists. Condition (3.2) is made for the sake of simplicity. It could be replaced with weaker, but more technical conditions in order to include special internal estimators of variance such asb2(1). A second key assumption is thatZ 1 0
rsup
n
N
(u
Fn)du <
1:
(3.3)Roughly speaking, this condition allows us to pretend that
f
bis equal tof
e. It is satised in all Examples 1-5, provided that #Bn=O
(1) in Example 2,N
n=O
(1) in Example 4, andM
n=O
(1) in Example 5.At rst let us consider condence balls centered at the naive estimator
X
. Sincen
;2ave(X
;)2] has a chi-squared distribution withn
degrees of freedom, we considerC
bN :=n2R
T : ave(X
;)2]b2(1 +n
;1=2c
)ofor some xed
c
. The inequality ave(X
;)2]b2(1 +n
;1=2c
) is equivalent ton
1=2;2ave(X
;)2];1;n
1=2(;2b2;1) ;2b2c
=c
+o
p(1):
Thus the Central Limit Theorem for the chi-squared distribution together with Condi- tion 3.2 implies that
c
= (2 +2)1=2;1( ) yields a condence setC
bN withlim
n!1 sup
2R T
2
>0
IPf
2C
bNg;
= 0
where ;1( ) denotes the -th quantile ofN(01). Moreover,lim
n!1
sup
2R T
IPnj
L
(C
bN);42j>
o = 0 8>
0:
In what follows we shall see that condence sets centered at a good modulation estimator
fX
b dominate the naive condence setC
bN in terms of the lossL
(C
b ).To construct these condence sets, we rst determine the asymptotic distribution of
d
b=d
bn :=n
1=2L
(fX
b );R
bC(f
b)]:
This dierence compares the loss of
fX
b with an estimate for the expected loss offX
b .THEOREM 3.1
Under Conditions (3.2, 3.3), limn!1
ave(sup2)c
m
L(d
b)N(02)] = 0 for arbitraryc
2>
0, where 2 =n2(2) := 24ave(2f
e;1)2] +24ave(2f
e;1)]2+ 42ave2(1;f
e)2]:
A consistent estimatorb2=bn2 of 2is obtained by substituting b2 for2,f
bforf
eandX
2;b2for2in the expression for2. The implied estimator of the approximating normaldistributionN(0
2) isN(0b2). This leads to the following denition of a condence ball for that is centered at the modulation estimator
fX
b :C
b := n2R
T :L
(fX
b )R
bC(f
b) +n
;1=2b;1( )o:
The intended coverage probability of
C
b is . The next theorem establishes asymptotic properties of this condence set construction. Beran (1994) treats in detail the example wherefX
b is the James-Stein estimator. That situation is much easier to analyze than the general case.THEOREM 3.2
Under the conditions of Theorem 3.1, for arbitraryc
2>
0, limn!1K!1
ave(sup2)cIPnj
L
(C
b );4R
(f
e 2)jKn
;1=2o = 0and lim
n!1K!1
ave(sup2)c IPnjb
r
2;R
(f
e 2)jKn
;1=2o = 0:
Moreover, b2 is consistent in thatlim
n!1 ave(sup
2)cIPnjb
2;2j>
o = 0 8>
0:
Ifliminfn!1 ave(inf
2)c
n2(2)>
0 (3.4)then
lim
n!1
ave(sup2)c
IPf
2C
bg; = 0:
A sucient condition for (3.4) is the following: For every
n
, F =Fn is such that 1ff
c
gf
2 F for allf
2F andc
201]:
(3.5)
Condition (3.4) ensures that L(
d
b) does not approach a degenerate distribution. Note that Condition (3.5) is satised in Examples 1-4. WhenR
(f
e 2) =O
(n
;1=2) our con- dence ball has lossL
(C
b ) =O
p(n
;1=2). In fact, according to Theorem 2.1 of Li (1989) this is the smallest possible order of magnitude for a Euclidean condence ball, unless one imposes further constraints on the signal. The result (3.2) on asymptotic coverage ofC
b may be compared with the lower bound in Theorem 3.2 of Li (1989).A key step in the proof of Theorem 3.1 is that in the denition of
d
bone may replacef
e withf
b. Instead of the normal approximation underlyingC
b a bootstrap approximation ofH
=H
n:=L(d
b) that imitates the estimation off
eseems to be more reliable in moderate dimension. Precisely, letH
b =H
bn be the conditional distribution (function) ofd
b given (X
b2), whered
b is computed asd
bwith the pair (X
b2 ) in place of (X
b2). More precisely, let b= b(jX
b2) be an estimator for . LetS
n2 be a random variable with a specied distribution depending only onn
such thatlim
n!1
m
Ln
1=2(S
n2;1)]N(02) = 0 whereS
n and (X
b2) are independent. ThenL(
X
b2 jX
b2) = N(bb2I
)L(b2S
n2jX
b2)the product of the probability measures N(
bb2I
) and L(b2S
n2jX
b2). The resulting bootstrap condence boundr
bb( ) forL
(fX
b ) is given byr
b2b( ) =R
b(f
b) +n
;1=2H
b;1( ):
The last theorem of this section states conditions, under which
H
b is a consistent estimator forH
. An interesting fact is that neither b=X
norb=fX
b satisfy these conditions.THEOREM 3.3
Under the assumptions of Theorem 3.1, limn!1 ave(sup
2)cIPnj
m
(H
bnH
n)j>
o = 0 8>
0 provided thatf
b = argminf2F
R
(f
b b2) almost surely (3.6)limsup
n!1K!1
ave(sup2)c IPfave(
b2)> K
g= 0 (3.7)lim
n!1
ave(sup2)c IPnave
b2(1;f
b)2];ave2(1;f
e)2]>
o = 0 8>
0:
(3.8)In particular, suppose that each Fn has the following property: For all