Modulation Estimators and Condence Sets

(1)

Modulation Estimators and Condence Sets

Rudolf Beran and Lutz Dumbgen

^y

University of California, Berkeley and Universitat Heidelberg January 1996, Revised August 1997

Abstract.

An unknown signal plus white noise is observed at

n

discrete time points.

Within a large convex class of linear estimators of

, we choose the estimator

^b that minimizes estimated quadratic risk. By construction,

^b is nonlinear. This estimation is done after orthogonal transformation of the data to a reasonable coordinate system.

The procedure adaptively tapers the coecients of the transformed data. If the class of candidate estimators satises a uniform entropy condition, then

^bis asymptotically minimax in Pinsker's sense over certain ellipsoids in the parameter space and shares one such asymptotic minimax property with the James-Stein estimator. We describe computational algorithms for

^band construct condence sets for the unknown signal. These condence sets are centered at

^b, have correct asymptotic coverage probability, and have relatively small risk as set-valued estimators of

.

AMS 1991 subject classications. Primary 62H12 secondary 62M10.

Key words and phrases. Adaptivity, asymptotic minimax, bootstrap, bounded variation, coverage probability, isotonic regression, orthogonal transformation, signal recovery, Stein's unbiased estimator of risk, tapering.

Research supported in part by National Science Foundation Grant DMS95-30492 and in part by Sonderforschungsbereich 373 at Humboldt-Universitat zu Berlin.

yResearch supported in part by European Union Human Capital and Mobility Program ERB CHRX-CT 940693.

(2)

1 Introduction

The problem of recovering a signal from observation of the signal plus noise may be formulated as follows. Let

X

=

X

ⁿ=

X

(

t

)]^t2T be a random function observed on the set

T

=

T

ⁿ=^f1

2

:::n

^g. The components

X

(

t

) are independent with IE

X

(

t

) =

(

t

) =

ⁿ(

t

) and Var

X

(

t

)] =

² for every

t

²

T

. Working with functions on

T

rather than vectors in

R

ⁿ is very convenient for the present purposes. As just indicated, we will usually drop the subscript

n

for notational simplicity. The signal

and the noise variance

² are both unknown. For simplicity we assume throughout that

X

is Gaussian. Portions of the argument that hold for non-Gaussian

X

are expressed by the lemmas in Section 6.2.

For any

g

²

R

^T, the space of real-valued functions dened on

T

, let ave(

g

) :=

n

^;¹^X

t2T

g

(

t

)

:

The loss of any estimator

^bfor

is dened to be

L

(

^b ) := ave(

^b^;

)²] (1.1)

and the corresponding risk of

^bis

(

^b ²) := IE

L

(

^b )

:

The rst goal is to devise an estimator that is ecient in terms of this risk. If

and

X

are electrical voltages, then ave(

²) and

L

(

^b ) are the time-averaged powers dissipated in passing the signal

and the error

^b^;

through a unit resistance.

Any estimator

^bof

is governed by the asymptotic minimax bound liminfn!1 inf

b

ave(sup²)^c

(

^b ²)

²

c

²+

c

(1.2)

for every positive

c

and

². Inequality (1.2) follows from a more general bound proved by Pinsker (1980) for signal recovery in Gaussian noise (see Nussbaum 1996 and Section 2). It may also be derived from ideas in Stein (1956) by considering best orthogonally equivariant estimators in the submodel where ave(

²) =

c

(see Beran 1996b). Let

^b² =

^b²ⁿ be an estimator of

² that is consistent as in display (2.2) of Section 2. Then

b^S := 1^;

^b²

=

ave(

X

²)]⁺

X

(3)

is essentially the James-Stein (1961) estimator, where ]⁺ denotes the positive part function. It achieves the Pinsker bound (1.2) because

lim

n!1 _ave(sup

2)^c

(

^b^S

²) =

²

c

²+

c

(1.3)

for every positive

c

and

². The limit (1.3) follows from Corollary 2.3 or from asymptotics in Casella and Hwang(1982). For the maximum likelihood estimator

^bML =

X

, the risk is always

², which is strictly greater than the Pinsker bound.

Section 2 of this paper constructs estimators of

that are asymptotically minimax over a variety of ellipsoids in the parameter space while achieving, in particular, the asymptotic minimax bound (1.2) for every

c >

0. These modulation estimators take the form

fX

^b =

f

^b(

t

)

X

(

t

)]^t2T. Here

f

^b:

T

^!0

1] depends on

X

and is chosen to minimize the estimated risk of the linear estimator

fX

over all functions

f

in a class^F =^Fⁿ0

1]^T. Many well-known estimators are of this form with special classes^F. In the present paper we analyze such estimators under rather general assumptions on ^F. How large this class may be is at the heart of the analysis. Taking ^F to be the set of all functions from

T

to 0

1] leads to a poor modulation estimator. Successful is to let ^F be a closed convex set of functions with well-behaved uniform covering numbers. One example is the set of all functions in 0

1]^T that are nonincreasing. The asymptotic theory of such modulation estimators, including links with the literature, is the subject of Section 2. Section 4 develops algorithms for computing

fX

^b in the example of^F just cited.

Section 3 constructs condence sets that are centered at a modulation estimator

fX

^b and have asymptotic coverage probability for

. The risk of the modulation estimator at the center is shown to determine the risk of the condence set, when that is viewed as a set-valued estimator for

. In this manner, eciency of a modulation estimator determines the eciency of the associated condence set.

Before estimation of

, the data

X

may be transformed orthogonally without changing its Gaussian character. A modulation estimator computed in the new coordinate system can be transformed back into the original coordinate system to yield an estimator of

. Standard choices for such preliminary orthogonal transformation include Fourier transforms, wavelet transforms, or analysis-of-variance transforms. When applied in this

(4)

manner, modulation estimators perform data-driven tapering of empirical Fourier, wavelet or analysis-of-variance coecients. Section 5 includes numerical examples of modulation estimators and condence bounds after Fourier transformation.

2 Modulation estimators

After dening modulation estimators, this section obtains uniform asymptotic approxima- tions to their risks. Let^F =^Fⁿbe a given subset of 0

1]^T. Each function

f

²^F is called a modulator and denes a candidate linear estimator

fX

=

f

(

t

)

X

(

t

)]^t2T for

. The risk of this candidate estimator under quadratic loss (1.1) is

(

fX

²) = IE

L

(

fX

) = ave

²

f

²+

²(1^;

f

)²]

:

(2.1)

For brevity, we will write

R

(

f

²) in place of

(

fX

²).

We will rst construct a suitably consistent estimator

R

^b(

f

) of this risk. Suppose that

b

²=^b

ⁿ² is an estimator of

², constructed (for instance) by one of the methods described later. Let

X

be a bootstrap random vector in

R

^T such that^L(

X

^j

X

^b²) =^N^T(

X

^b²

I

).

The corresponding bootstrap risk estimator for

R

(

f

²) is IE

L

(

fX X

)

X

^b

² =

R

(

fX

^b²)

:

We call

R

(

fX

^b²) the naive risk estimator because it is badly biased upwards, even asymptotically. The key point is

IE

R

(

fX

²) = ave

f

²

²+ (1^;

f

)²(

²+

²)] =

R

(

f

²) + ave(1^;

f

)²

²]

:

Two possible corrections to the naive risk estimator are

R

b^C(

f

) := ave

f

²^b

²+ (1^;

f

)²(

X

²^;

^b²)] =

R

(

fX

^b

²)^;ave(1^;

f

)²

^b²]

R

b^B(

f

) := maxⁿave(

f

²^b

²)

R

^b^C(

f

)^o = ave(

f

²

^b²) + ave(1^;

f

)²(

X

²^;

^b²)]⁺

:

Risk estimator

R

^b^C is essentially Mallows' (1973)

C

^L criterion or Stein's (1981) unbiased estimator of risk, with estimation of

² incorporated. Risk estimator

R

^b^B corrects the possible negativity in ave(1^;

f

)²(

X

²^;

^b²)] as an estimator for ave(1^;

f

)²

²]. Let

X

(5)

be a random vector in

R

^T such that ^L(

X

^j

X

^b²) is ^N^T(

^b ^b

²

I

), where

^b=

^b(

X

^b

²) is a vector such that ave(

^b²) = ave(1^;

f

)²(

X

²^;^b

²)]⁺

=

ave(1^;

f

)²

X

²]. Then the bootstrap risk estimator IE

L

(

fX

^b²)^j

X

^b

²] is precisely

R

^b^B.

Let

R

^b denote either

R

^b^C or

R

^b^B. We propose to estimate

by the modulation estimator

fX

b , where

f

^b is any function in ^F that minimizes

R

^b(

f

). Unless stated otherwise it is assumed throughout that

F is a closed convex subset of 0

1]^T containing all constants

c

²0

1].

Because both

R

^b^C() and

R

^b^B() are convex functions on 0

1]^T, the minimizer

f

^bover ^F exists in each case. These minimizers are unique with probability one because

R

^b^C(

f

) is strictly convex in

f

whenever

X

(

t

) ⁶= 0 for every

t

²

T

. Similarly, the risk function

R

(

f

²) dened through (2.1) is strictly convex over 0

1]^T, with unique minimizer

f

^e.

REMARK A. The modulation estimator

fX

^b behaves poorly when the class ^F is too large. For instance, let^Fbe the class of all functions in 0

1]^T. The minimizer of

R

(

²) over 0

1]^T is the \oracle" modulator (cf. Donoho and Johnstone 1994)

e

g

:=

²

=

(

²+

²)

the division being componentwise, while the minimizer of

R

^b() over^F is now the greedy modulator

g

^b⁺, where

b

g

:= (

X

²^;^b

²)

=X

²

:

To simplify the discussion, suppose that

² is known and

^b²

². Then the estimator

b

g

⁺

X

is of the general form

^b:=

S X

(

t

)]

t2T

for some measurable function

S

on the line.

Since the maximum likelihood estimator

X

is componentwise admissible, the risk function

(

^b

²) of

^bis either identical to

(

X

²)

² or there is a real number

such that

R(

^;

S

)²

d

^N(

²)

>

². Then, if

()

,

(

^b ²)

>

² =

(

X

²)

>

²

=

(

²+

²)

the latter being the asymptotic risk of the James-Stein estimator

^b^S. Thus, the maximum risk of

g

^b⁺

X

is worse than that of estimators achieving Pinsker's asymptotic minimax bound (1.2) and is even worse than that of the naive estimator

X

.

(6)

It should be mentioned that greedy modulation can be made successful in some sense if one overestimates the variance

² systematically. Donoho and Johnstone (1994) propose threshold estimators of the form

^b= (1^;

ⁿ

=

^j

X

^j)⁺

X

or

^b= 1^fj

X

^j

ⁿ

^g

X

, and prove that they have surprising optimality properties if

ⁿ= (2log

n

)¹⁼²(1+

ⁿ) with a suitable sequence (

ⁿ)ⁿtending to zero. These estimators are similar to^b

g

⁺

X

if ^b

g

is computed with

b

ⁿ² :=

²ⁿ

². While showing good performance in case of \sparse signals", these estimators do not achieve the Pinsker bound (1.2) or the minimax bounds in Corollary 2.3 below.

Also, the construction of condence bounds for their loss seems to be intractable. Section 5 illustrates the possibly poor performance of hard thresholding for non-sparse signals.

REMARK B. Kneip's (1994) ordered linear smoothers are equivalent to certain modulation estimators computed after suitable orthogonal transformation of

X

. The conditions that we impose on^Fin this paper are substantially weaker than the ordering of^F required by Kneip. Consequently, our results also apply to the ridge regression, spline estimation, and kernel estimation examples discussed in Kneip's paper. The earlier paper of Li (1987) treated non-diagonal linear estimators indexed by a parameter

h

. Li's optimality result may be compared with Theorem 2.1 below. However, it does not seem easy to relate Li's conditions on the range of

h

to our conditions on^F. The latter conditions give access to empirical process results that yield asymptotic distributions for the loss of

fX

^b and hence condence sets for

centered at modulation estimators.

REMARK C. Nussbaum (1996) surveyed constructions of adaptive estimators that achieve Pinsker-type asymptotic minimax bounds. For instance, Golubev and Nuss- baum (1992) treated adaptive, asymptotically minimax estimation when

ⁱ=

g

(

x

ⁱ) and

g

lies in an ellipsoid of unknown radius within a Sobolev space of unknown order. Corol- lary 2.3 below is of related character. However, our results make no smoothness assumptions on

. For instance, sample paths up to time

n

of suitably scaled, discrete-time, independent white noise ultimately lie, as

n

^!¹, within the ball ave(

²)

c

.

Useful classes of modulators ^F can be characterized through their uniform covering numbers, which are dened as follows. For any probability measure

Q

on

T

, consider the

(7)

pseudo-distance

d

^Q(

fg

)²:=^R(

f

^;

g

)²

dQ

on 0

1]^T. For every positive

u

, let

N

(

u

^F

d

^Q) := minⁿ#^F^o:^F^o ^F

inf

f

o 2F

o

d

^Q(

f

0

f

)

u

⁸

f

²^F^o

:

Dene the uniform covering number

N

(

u

^F) := sup

Q

N

(

u

^F

d

^Q)

where the supremum is taken over all probabilities on

T

. Let

J

(^F) := ^Z₀¹^qlog

N

(

u

^F)

du:

Throughout

C

denotes a generic universal real constant which does not depend on

n

,

² or^F, but whose value may be dierent in various places.

THEOREM 2.1

Let^F be any closed subset of0

1]^T containing0, let

f

^ebe a minimizer of

R

(

f

²) over

f

²^F, and let

f

^bminimize either

R

^b^C(

f

) or

R

^b^B(

f

) over

f

²^F. Then

IE

G

^;

R

(

f

^e ²)

C

J

(^F)

²+

^pave(

²)

p

n

^{+ IE}^jb

²^;

²^j

where

G

is any one of the following quantities:

L

(

fX

^b )

inf

f2F

L

(

fX

)

R

^b^C(

f

^b)

R

^b^B(

f

^b)

:

In particular,

(

fX

^b ²)^;

R

(

f

^e ²)

C

J

(^F)

²+

^pave(

²)

p

n

^{+ IE}^jb

²^;

²^j

:

This theorem is about convergence of losses and risks. The next result uses convexity of^F to establish that

f

^band

f

^e, as well as

fX

^b and

fX

^e , converge to one another. Note that the second bound holds uniformly in

²

R

^T.

THEOREM 2.2

Let

f

^bbe the minimizer of

R

^b^C. Then

IEave(

²+

²)(

f

^b^;

f

^e)²

CJ

(^F)

²+

^pave(

²)

p

n

^{+ IE}^jb

²^;

²^j

IE ave(

fX

^b ^;

fX

^e )²

CJ

(^F)

²

p

n

^{+ IE}^jb

²^;

²^j

:

(8)

Given consistency of

^b and boundedness of

²+ ave(

²), a key assumption on ^F that ensures success of the modulation estimator

fX

^b dened above is that

J

(^F) =

o

(

n

¹⁼²).

Here are some examples of modulator classes ^F to which Theorem 2.1 applies.

EXAMPLE 1 (Stein shrinkage). Suppose that ^F consists of all constant functions in 0

1]^T. The minimizer over^F of

R

(

f

²) is

f

e^S 1^;

²

=

²+ ave(

²)]

:

The minimizer of both

R

^b^C and

R

^b^B is

f

b^S 1^;^b

²

=

ave(

X

²)]⁺

:

The resulting modulation estimator

f

^b^S

X

is the (modied) James-Stein (1981) estimator

b^S. Here one easily shows that

N

(

u

^F) 1 + (2

u

)^;¹, whence

J

(^F) is bounded by a universal constant.

EXAMPLE 2 (Multiple Stein shrinkage). Let ^B=^Bⁿ be a partition of

T

and dene

F :=ⁿ^X

B2B

1^B

c

(

B

) :

c

²0

1]^B^o

where 1^B is the indicator function of

B

. The values of

c

(

B

) that dene

f

^eand

f

^b, respec- tively, are

e

c

(

B

) = ave(1^B

²)

=

ave1^B(

²+

²)]

b

c

(

B

) = ave1^B(

X

²^;

^b²)]⁺

=

ave(1^B

X

²)

:

The modulation estimator

fX

^b now has the asymptotic form of the multiple shrinkage estimator in Stein (1966). Elementary calculations show that

N

(

u

^F)1 + (2

u

)^;¹]^#^B. Thus

J

(^F) is bounded by a universal constant times (#^B)¹⁼², so that

J

(^F) =

o

(

n

¹⁼²) follows from the intuitively appealing condition #^B=

o

(

n

).

EXAMPLE 3 (Monotone shrinkage). Let^Fmonbe the set of all nonincreasing functions in 0

1]^T. The class of candidate estimators^f

fX

:

f

²^Fmon^gincludes the nested model- selection estimators

f

^k

X

, 0

k

n

, dened by

f

^k(

t

) := 1^f

t

k

^g. In fact, ^Fmon is the

(9)

convex hull of^D^M^S:=^f

f

0

f

1

:::f

ⁿ^g. Elementary calculations show that

N

(

u

^D^MS) 1 +

u

^;² 2

u

^;²

for 0

< u

1. Together with Theorem 5.1 of Dudley (1987) it follows that log

N

(

u

^Fmon)

Cu

^;¹ for all

u

² ]0

1]

:

EXAMPLE 4 (Monotone shrinkage with respect to a quasi-order). Let be a quasi- order relation on

T

(cf. Robertson et al. 1988, Chapter 1.3), and let ^F be the set of all functions in 0

1]^T that are nonincreasing with respect to. That means, for all

f

²^F and

st

²

T

,

f

(

s

)

f

(

t

) if

s

t:

Here one can easily deduce from the conclusion of Example 3 that log

N

(

u

^F)

CN

u

^;¹

for 0

< u

1, where

N

=

N

ⁿ is the minimal cardinality of a partition of (

T

) into totally ordered subsets. Thus

J

(^F) is of order

O

(

N

¹⁼²). To give an example, suppose that

X

consists of

n

= 2^k⁺¹ ^;1 empirical Haar (or wavelet) coecients, arranged as a binary tree. If this tree is equipped with its natural order , then the monotonicity constraint

f

^b²^F means that

fX

^b is a mixture of histogram estimators (cf. Engel 1994).

Here

N

= 2^k

> n=

2. Therefore, in order to apply our theory one has to replace the class

F

with suitable subclasses.

EXAMPLE 5 (Shrinkage with bounded total variation). Let ^F(^M) be all functions

f

in 0

1]^T with total variation not greater than

M

=

M

ⁿ, i.e.

n

X

t=2

j

f

(

t

)^;

f

(

t

^;1)^j

M:

For instance, the class of functions

f

(

t

) := max^fmin^f

p

(

t

)

1^g

0^g, where

p

is a polynomial of degree less than or equal to

M

, belongs to ^F(^M). Any

f

² ^F(^M) can be written as (

M

+ 1)(

f

1^;

f

2) with

f

1

f

2 ²^Fmon. Hence

log

N

(

u

^F(^M)) 2log

N

2(

M

+ 1)]^;¹

u

^Fmon

C

(

M

+ 1)

u

^;¹

(10)

for 0

< u

1. In particular,

J

(^F(^M)) =

O

(

M

+ 1)¹⁼²].

The minimizers

f

^eand

f

^bin Examples 3-5 lack closed forms. Section 4 describes computational algorithms for

f

^eand

f

^bin Examples 3-4. Example 5 diers from the remaining examples both theoretically as well as computationally and will be treated in detail else- where.

A particular consequence of Theorem 2.1 is that the modulation estimators are asymptotically minimax optimal for a large class of submodels for (

²). Namely, for

a

²1

¹]^T and

c >

0 dene the linear minimax risk

²(

ac

²) := inf

g 2 01]^T _ave(sup

a

2)^c

R

(

g

²)

:

It is shown by Pinsker (1980) that the linear minimax risk approximates the unrestricted minimax risk in that

inf

b

ave(sup^a²)^c

(

^b ²)

=

²(

ac

²) ^! 1 as

n

²(

ac

²)^!¹

:

Moreover,

²(

ac

²) = sup_ave(

a

2)^c

R

(

g

^o

²) =

R

(

g

^o

²)

where

g

^o := 1^;(

a=

^o)¹⁼²]⁺,

^o² :=

²(

^o

=a

)¹⁼² ^;1]⁺, and

^o

>

0 is the unique real number satisfying ave(

a

(

^o

=a

)¹⁼²^;1]⁺) =

c=

². The special case

a

1 yields (1.2).

If the minimax modulator

g

^o =

g

^o(^j

ac=

²) happens to be in ^F, which is certainly true for

a

1, then

ave(sup^a²)^c

(

fX

^b ²) _ave(sup

a 2)^c

(

fX

^b ²)^;

R

(

f

^e ²)+

²(

ac

²)

:

Thus Theorem 2.1 immediately implies the following minimax result, where the distribution of (

X

^b

²) is assumed to depend on (

²) only.

COROLLARY 2.3

Suppose that

J

(^F) =

o

(

n

¹⁼²), and that for every

c

²

>

0,

ⁿ(

c

²) := sup_ave(

2)^cIE^jb

²^;

²^j ^! 0 (

n

^!¹)

:

(2.2)

Then the modulation estimator

fX

^b achieves the asymptotic minimax bound (1.2).

(11)

More generally, let

a

=

a

ⁿ²1

¹]^T such that

1^;(

a=

)¹⁼²]⁺ ² ^F for all constants

1

:

(2.3)

Then for every

c

²

>

0,

ave(sup^a²)^c

(

fX

^b ²)

²(

ac

²) +

O n

^;¹⁼²

J

(^F) +

ⁿ(

c

²)]

:

² Specically, let

a

(

t

) = 1 for

t

²

A

T

and

a

(

t

) =¹ otherwise. Then ave(

a

²)

c

is equivalent to ave(

²)

c

and

²= 0 on

T

ⁿ

A

. Here one can easily see that condition (2.3) is equivalent to 1^A ²^F. The linear minimax risk equals

²(

ac

²) =

²ave(1^A)

c

²ave(1^A) +

c

which can be signicantly smaller than the bound in (1.2).

In case of ^F =^Fmon condition (2.3) is equivalent to

a

being nondecreasing on

T

. We end this section with some examples for

^b. Internal estimators of

²depend only on

X

and require additional smoothness or dimensionality restrictions on the possible values of

to achieve the consistency property (2.2). One internal estimator of

², analyzed by Rice (1984) and by Gasser et al. (1986) is

b₂₍₁₎= 2(

n

^;1)]^;¹^Xⁿ

t=2

X

(

t

)^;

X

(

t

^;1)]²

:

(2.4)

Here IE^jb

²ⁿ^;

ⁿ²^j^!0 as

n

^!¹ and

n

^;¹^Xⁿ

t=2

ⁿ(

t

)^;

ⁿ(

t

^;1)]² ^! 0

:

External estimators of variance are available in linear models, where one observes an

N

-dimensional normal random vector

Y

with mean IE

Y

=

D

and covariance matrix Cov(

Y

) =

²

I

^N for some design matrix

D

²

R

^Nⁿ,

N

=

N

ⁿ

> n

. After suitable linear transformation of

Y

and

one may assume that

is the expectation of the vector

X

:= (

Y

1

Y

2

:::Y

ⁿ). Then the standard estimator for

² is given by

b₂₍₂₎ := (

N

^;

n

)^;¹ ^X^N

i=ⁿ+1

Y

ⁱ²

which is independent from

X

with (

N

^;

n

)

^;²

^b₂₍₂₎

^N;n. This estimator also satises (2.2), provided that

N

^;

n

^!¹.

(12)

3 Condence sets

Having replaced the maximum likelihood estimator

X

with

fX

^b , a natural question is to what extent

fX

^b is closer to the unknown signal

than

X

. More precisely we want to compare the distance

L

(

X fX

^b )¹⁼² with an upper condence bound

r

^b = ^b

r

(

X

^b²) for

L

(

fX

^b )¹⁼². In geometrical terms, the condence ball of primary interest is dened by

C

b =

C

^bⁿ := ^f

²

R

^T :

L

(

fX

^b )

r

^b²^g

:

The radius

r

^bis chosen so that the coverage probability IP(

²

C

^b) converges to ²]0

1as

n

increases. The full denition of

C

^b follows the theorem below. Underlying the construction is the condence set idea sketched at the end of Stein (1981). The quality of

C

^b as a set-valued estimator of

will be measured through the quadratic loss

L

(

C

^b ) := sup

2 b

C

L

(

) =

L

(

fX

^b )¹⁼²+

r

^b]²

:

(3.1)

This is a natural extension of the quadratic loss dened in (1.1) and has an appealing projection-pursuit interpretation see Beran (1996a).

One main assumption for this section is that

X

ⁿand

^bⁿ² are independent with ^L(

ⁿ^;²

^bⁿ²) depending only on

n

(3.2)

such that lim

n!1

m

^L

n

¹⁼²(

^;²

^bⁿ²^;1)]

^N(0

²) = 0

:

Here

² 0 is a given constant and

m

(

) metrizes weak convergence of distributions on the line. For instance, the estimator

^b₂₍₂₎ of Section 2 satises Condition (3.2) with

:= 2lim^n!1

n=

(

N

ⁿ^;

n

), provided that this limit exists. Condition (3.2) is made for the sake of simplicity. It could be replaced with weaker, but more technical conditions in order to include special internal estimators of variance such as

^b₂₍₁₎. A second key assumption is that

Z 1 0

rsup

n

N

(

u

^Fⁿ)

du <

¹

:

(3.3)

Roughly speaking, this condition allows us to pretend that

f

^bis equal to

f

^e. It is satised in all Examples 1-5, provided that #^Bⁿ=

O

(1) in Example 2,

N

ⁿ=

O

(1) in Example 4, and

M

ⁿ=

O

(1) in Example 5.

(13)

At rst let us consider condence balls centered at the naive estimator

X

. Since

n

^;²ave(

X

^;

)²] has a chi-squared distribution with

n

degrees of freedom, we consider

C

bN :=ⁿ

²

R

^T : ave(

X

^;

)²]

^b²(1 +

n

^;¹⁼²

c

)^o

for some xed

c

. The inequality ave(

X

^;

)²]

^b²(1 +

n

^;¹⁼²

c

) is equivalent to

n

¹⁼²

^;²ave(

X

^;

)²]^;1^;

n

¹⁼²(

^;²

^b²^;1)

^;²^b

²

c

=

c

+

o

^p(1)

:

Thus the Central Limit Theorem for the chi-squared distribution together with Condi- tion 3.2 implies that

c

= (2 +

²)¹⁼²^;¹( ) yields a condence set

C

^bN with

lim

n!1 sup

2R T

2

>0

IP^f

²

C

^bN^g^;

= 0

where ^;¹( ) denotes the -th quantile of^N(0

1). Moreover,

lim

n!1

sup

2R T

IPⁿ^j

L

(

C

^bN

)^;4

²^j

>

^o = 0 ⁸

>

0

:

In what follows we shall see that condence sets centered at a good modulation estimator

fX

b dominate the naive condence set

C

^bN in terms of the loss

L

(

C

^b ).

To construct these condence sets, we rst determine the asymptotic distribution of

d

b=

d

^bⁿ :=

n

¹⁼²

L

(

fX

^b )^;

R

^b^C(

f

^b)]

:

This dierence compares the loss of

fX

^b with an estimate for the expected loss of

fX

^b .

THEOREM 3.1

Under Conditions (3.2, 3.3), lim

n!1

ave(sup²)^c

m

^L(

d

^b)

^N(0

²)] = 0

for arbitrary

c

²

>

0, where

² =

ⁿ²(

²) := 2

⁴ave(2

f

^e^;1)²] +

²

⁴ave(2

f

^e^;1)]²+ 4

²ave

²(1^;

f

^e)²]

:

A consistent estimator^b

²=

^bⁿ² of

²is obtained by substituting

^b² for

²,

f

^bfor

f

^eand

X

²^;b

²for

²in the expression for

². The implied estimator of the approximating normal

(14)

distribution^N(0

²) is^N(0

^b²). This leads to the following denition of a condence ball for

that is centered at the modulation estimator

fX

^b :

C

b := ⁿ

²

R

^T :

L

(

fX

^b )

R

^b^C(

f

^b) +

n

^;¹⁼²

^b^;¹( )^o

:

The intended coverage probability of

C

^b is . The next theorem establishes asymptotic properties of this condence set construction. Beran (1994) treats in detail the example where

fX

^b is the James-Stein estimator. That situation is much easier to analyze than the general case.

THEOREM 3.2

Under the conditions of Theorem 3.1, for arbitrary

c

²

>

0, lim

n!1K!1

ave(sup²)^cIPⁿ^j

L

(

C

^b )^;4

R

(

f

^e ²)^j

Kn

^;¹⁼²^o = 0

and lim

n!1K!1

ave(sup²)^c IPⁿ^jb

r

²^;

R

(

f

^e ²)^j

Kn

^;¹⁼²^o = 0

:

Moreover, ^b

² is consistent in that

lim

n!1 _ave(sup

2)^cIPⁿ^jb

²^;

²^j

>

^o = 0 ⁸

>

0

:

If

liminfn!1 _ave(inf

2)^c

ⁿ²(

²)

>

0

(3.4)

then

lim

n!1

ave(sup²)^c

IP^f

²

C

^b^g^; = 0

:

A sucient condition for (3.4) is the following: For every

n

, ^F =^Fⁿ is such that 1^f

f

c

^g

f

² ^F for all

f

²^F and

c

²0

1]

:

(3.5)

Condition (3.4) ensures that ^L(

d

^b) does not approach a degenerate distribution. Note that Condition (3.5) is satised in Examples 1-4. When

R

(

f

^e ²) =

O

(

n

^;¹⁼²) our condence ball has loss

L

(

C

^b ) =

O

^p(

n

^;¹⁼²). In fact, according to Theorem 2.1 of Li (1989) this is the smallest possible order of magnitude for a Euclidean condence ball, unless one imposes further constraints on the signal

. The result (3.2) on asymptotic coverage of

C

^b may be compared with the lower bound in Theorem 3.2 of Li (1989).

(15)

A key step in the proof of Theorem 3.1 is that in the denition of

d

^bone may replace

f

^e with

f

^b. Instead of the normal approximation underlying

C

^b a bootstrap approximation of

H

=

H

ⁿ:=^L(

d

^b) that imitates the estimation of

f

^eseems to be more reliable in moderate dimension. Precisely, let

H

^b =

H

^bⁿ be the conditional distribution (function) of

d

^b given (

X

^b²), where

d

^b is computed as

d

^bwith the pair (

X

^b² ) in place of (

X

^b²). More precisely, let

^b=

^b(^j

X

^b²) be an estimator for

. Let

S

ⁿ² be a random variable with a specied distribution depending only on

n

such that

lim

n!1

m

^L

n

¹⁼²(

S

ⁿ²^;1)]

^N(0

²) = 0

where

S

ⁿ and (

X

^b²) are independent. Then

L(

X

^b² ^j

X

^b

²) = ^N(

^b

^b²

I

)^L(^b

²

S

ⁿ²^j

X

^b²)

the product of the probability measures ^N(

^b

^b²

I

) and ^L(

^b²

S

ⁿ²^j

X

^b²). The resulting bootstrap condence bound

r

^b^b( ) for

L

(

fX

^b ) is given by

r

b²^b( ) =

R

^b(

f

^b) +

n

^;¹⁼²

H

^b^;¹( )

:

The last theorem of this section states conditions, under which

H

^b is a consistent estimator for

H

. An interesting fact is that neither

^b=

X

nor

^b=

fX

^b satisfy these conditions.

THEOREM 3.3

Under the assumptions of Theorem 3.1, lim

n!1 _ave(sup

2)^cIPⁿ^j

m

(

H

^bⁿ

H

ⁿ)^j

>

^o = 0 ⁸

>

0

provided that

f

b = argmin

f2F

R

(

f

^b

^b²) almost surely

(3.6)

limsup

n!1K!1

ave(sup²)^c IP^fave(

^b²)

> K

^g= 0

(3.7)

lim

n!1

ave(sup²)^c IPⁿave

^b²(1^;

f

^b)²]^;ave

²(1^;

f

^e)²]

>

^o = 0 ⁸

>

0

:

(3.8)

In particular, suppose that each ^Fⁿ has the following property: For all

X

²

R

^T with

X

²

>

0 and any

c

²0

1],

c

= ave1^f

f

^b=

c

^g(

X

²^;

^b²)]⁺

=

ave(1^f

f

^b=

c

^g

X

²) if^f

f

^b=

c

^g⁶=

c

= ave(1^f

f

^e=

c

^g

²)

=

ave1^f

f

^e=

c

^g(

²+

²)] if ^f

f

^e=

c

^g⁶=