Constructive Asymptotic Equivalence of Density Estimation and Gaussian White Noise

(1)

Constructive Asymptotic Equivalence of Density Estimation and Gaussian White Noise

Michael Nussbaum and Jussi Klemela Weierstrass Institute, Berlin

Rolf Nevanlinna Institute, University of Helsinki

Abstract

A recipe is provided for producing, from a sequence of procedures in the Gaus- sian regression model, an asymptotically equivalent sequence in the density estimation model with i. i. d. observations. The recipe is, to put it roughly, to calculate square roots of normalised frequencies over certain intervals, add a small random distortion, and pretend these to be observations from a Gaussian discrete regression model.

Mathematics Subject Classi cations:

62G07, 62B15, 62G20

Key Words:

Nonparametricexperiments,deciencydistance, Markov kernel, asymptotic minimax risk, curve estimation.

1 Introduction

In the rst lecture notes of L. Le Cam from 1969 it is said: "En general une experience est compliquee. Alors il faut l'approcher par une experience plus simple." So the The research of Jussi Klemela was nanced by Sonderforschungsbereich 373

\Quantikation und Simulation Okonomischer Prozesse" and to a smaller extent by Jenny and Antti Wihuri Foundation. The authors wish to thank G. Golubev for helpful discussions.

1

(2)

purpose is to approximate an experiment by a simpler one. The simplicity of an experiment is not dened precisely, but in a simple experiment one should be able to nd optimal estimators, or at least asymptotically optimal estimators. By optimal estimator we mean either a minimax estimator, or also a Bayes estimator. Gaussian experiments are primary examples of simple experiments. Simplicity is also achieved by reducing the dimension of observations, as in the classical case of nding sucient statistics whose dimension does not depend on the number of observations. Approxi- mations are especially useful when one is able to transform the optimal procedures of a simple experiment to the optimal procedures of the more complicated experiment.

The precise sense of "approximate" is given by Le Cam's -distance. When the -distance between two experiments vanishes asymptotically, we say that the two experiments are asymptotically equivalent. The known results about asymptotic equivalence of experiments include the asymptotic equivalence of Gaussian discrete regression and Gaussian signal recovery by Brown and Low (1996), the asymptotic equivalence of density estimation with i. i. d. observations and Gaussian signal recovery by Nussbaum (1996), the asymptotic equivalence of non-Gaussian and Gaussian regression by Grama and Nussbaum (1996), and the asymptotic equivalence of spec- tral density and regression estimation by Golubev and Nussbaum (1998). With the exception of Brown and Low (1996) the previous articles do not give an explicit recipe for transforming a sequence of procedures of a simple experiment to the asymptotically equivalent sequence of the more complex experiment. In this article we study the density estimation with i. i. d. observations and give construction of Markov kernel which makes it possible to transform all the procedures constructed for the Gaussian experiment to the asymptotically equivalent procedures of the density experiment.

The basic idea might be described as "reduction to the Gaussian case by small distortions" (cp. Chapter 11.8 of Le Cam, 1986). Roughly speaking, the idea is as follows. Assume a simple model with real-valued parameter in which a real-valued statistic ^Tⁿ is sucient and asymptotically normal. Then, by small distortions, involving some additional randomization, one smoothes ^Tⁿ in such a way that the

"randomized" statistic has a density. This density then should reasonably converge to a normal density, entailing total variation convergence of the laws. The recipe is then "take the sucient statistic, randomize, and pretend the resulting data are

2

(3)

Gaussian". Versions of such a theory (local and global) have been developed in parametric models by Le Cam (1986, Chapter 11.8) and by Muller (1981), and in a more indirect fashion in Shiryaev and Spokoiny (1993). In the framework of nonparametric asymptotic equivalence, a rst result of this type was given by Brown and Low (1996b) for nongaussian regression. Our result pertains to the i.i.d. model, and takes the empirical distribution function as a sucient statistic to start with.

Apart form the initial smoothing, our proof is totally dierent from that of Brown and Low it is inspired by Muller (1981), building upon rates of convergence in the functional central limit theorem. We do not attain the optimal smoothness index 1/2, but our method has a potential of being applicable wherever improvedfunctional CLT's hold.

Before stating the main results, we have to give preliminary denitions.

De nition 1 (i)

A measurable space( Â)is called STANDARD BOREL if there is a measurable space (⁰ Â⁰), such that ² Â⁰, Â is the trace of Â⁰ on , i. e. Â=^fA^\ :Â ²Â⁰^g, and there is a metric on ⁰ such that ⁰ becomes a Polish space and Â⁰ is the Borel sigma-algebra generated by the metric.

(ii)

An experimentÊ = ( Â (^P ²))is called POLISHif the measurable space ( Â) is standard Borel.

(iii)

An experiment Ê = ( Â (^P ²)) is called DOMINATED if there exists a -nite measure on ( Â) such that ^P , ².

(iv)

Given two measurable spaces (ⁱ ^Aⁱ), ⁱ = 0 1, a MARKOV KERNEL is a mapping ^K : ⁰^A¹ ^!0 1] such that

(a)

^K(^! ) is a probability measure on ^A¹ for each ^! ²⁰,

(b)

^K( Â) is a measurable function on (⁰ Â⁰) for each Â²Â¹.

(v)

^TheTOTAL VARIANCE DISTANCE between probability measures^P and^Q is

kP ;Qk

TV = sup

kg k1 1 Z

g(^dP ^;^dQ)^:

Given experimentsÊⁱ = (ⁱ Âⁱ (^P ⁱ ²)),ⁱ= 0 1, the set of Markov kernels associated with the measurable spaces (ⁱ Âⁱ) is denoted with^R(Ê⁰ Ê¹). Given^K ²

3

(4)

R(Ê⁰ Ê¹) and a probability measure^P on (⁰ Â⁰),^KP is a probability measure on (¹ Â¹), dened by ^KP(Â) =^R^K(^! Â)^dP(^!).

De nition 2

Let ^Eⁱ = (ⁱ ^Aⁱ (^P ⁱ ²)), ⁱ = 0 1, be two Polish dominated experiments.

(i)

TheDEFICIENCY of ^E⁰ with respect to ^E¹ is

(^E⁰ ^E¹) = inf

K2R(E

0 E

1 )

sup

2 kKP

0

;P

1 k

TV :

(ii)

^The -pseudodistance of ^E⁰ and ^E¹ is

(Ê⁰ Ê¹) = max^f(Ê⁰ Ê¹) (Ê¹ Ê⁰)^g^:

We say that the sequence of experimentsÊ¹ⁿ is asymptotically less informative than the sequence Ê⁰ⁿ, if (Ê⁰ⁿ Ê¹ⁿ) ^! 0. We say that the sequences of experi- mentsÊ⁰ⁿ andÊ¹ⁿ are asymptotically equivalent, if (Ê⁰ⁿ Ê¹ⁿ)^!0. Now we are ready to give a denition of central importance to this article. According to this def- inition, a sequence of Markov kernels is asymptotically sucient, if they achieve the inmum in the denition of the deciency and the Markov kernels of the sequence are between two asymptotically equivalent experiments.

De nition 3

^Let Êⁱⁿ = ⁿⁱ Âⁿⁱ (^Pⁿⁱ ²), ⁱ = 0 1, be two Polish dominated experiments. A sequence of Markov kernels ^Kⁿ ² ^R(Ê⁰ⁿ Ê¹ⁿ) is ASYMPTOTI- CALLY SUFFICIENT if

(i)

lim^n!1sup ²^kKⁿ^Pⁿ⁰^;^Pⁿ¹^k^TV = 0

(ii)

lim^n!1(^E⁰ⁿ ^E¹ⁿ) = 0^:

There are dierent ways of dening asymptotic suciency of a statistic or a sub-

-eld, stemming from Le Cam (1986). See for example Strasser (1985, Denition 81.3) or Laredo (1990, Corollary 1). One can dene for example, that a statistic is asymptotically sucient for an experiment if there is an experiment dened on the same probability space, which is asymptotically equivalent for the original experiment, and for which the statistic is exactly sucient. This case is contained in Denition 3.

4

(5)

The usefulness of a sequence of asymptotically sucient Markov kernels^Kⁿ lies in the fact that if we have a sequence of statistics ^Tⁿ¹for the experimentsÊⁿ¹, that is, measurable maps dened on (ⁿ¹ Âⁿ¹) to some other measurable space, then the sequence of statistics for the experimentsÊⁿ⁰, dened by

T

n0(^!⁰) =^Z ^Tⁿ¹(^!¹)^Kⁿ(^!⁰ ^d!¹)

has the same asymptotic minimax risk and Bayes risk for bounded continuous loss functions, as the sequence ^Tⁿ¹. Also, if we have a sequence of procedures ^Lⁿ¹ for the experiments Êⁿ¹, that is, Markov kernels dened on (ⁿ¹ Âⁿ¹) to some other measurable space, then the sequence of procedures for the experimentsÊⁿ⁰, dened by

L

n0(^!⁰ ^A) =^Z ^Lⁿ¹(^!¹ ^A)^Kⁿ(^!⁰ ^d!¹)

has the same asymptotic minimax risk and Bayes risk for bounded continuous loss functions, as the sequence ^Lⁿ¹.

Dene a parameter space of densities as follows. For ^s ²]1 2] and ^M ^>0, let

!^s(^M) be the set of functions 0 1]^!

R

satisfying the condition

f(^t)^;^f(^t⁰)^;^f⁰(^t⁰)(^t^;^t⁰) ^Mjt^;^t⁰^j^s^: (1) For ^>0 dene a set^F as the set of densities on 0 1] bounded below by :

F

=^f :^Z ¹

0

f = 1 ^f(^x) ^x ²0 1]^: Dene an a priori set, for given ^s²]1 2], ^M ^>0, ^>0,

= ^sM = !^s(^M)^\^F^:

In Nussbaum (1996) it was shown that density estimation with i. i. d. observations and Gaussian signal recovery are asymptotically equivalent when the Holder smoothness index ^s^>1⁼2 but we will have to assume that ^s^>3⁼2.

Let^X¹ ^:^:^: ^Xⁿ be i. i. d. random variables with the density function ^f ² . Let

P

fn be the distribution of ^X¹ ^:^:^: ^Xⁿ and

E

0n = 0 1]ⁿ ^Bⁿ^01] (^P^fn ^f ² )^: (2) 5

(6)

A simple experiment which is asymptotically equivalent to the experiment ^E⁰ⁿ is the following experiment of discrete Gaussian regression. Let

y

i=^q^f(^sⁱ) + 12^Nⁿ¹⁼²ⁱ ⁱ= 1 ^:^:^: ^N (3) where ⁱ are i. i. d. ^N(0 1),^sⁱ = (^t^i;1+^tⁱ)⁼2, and^tⁱ =^i=N. Let

Y

n= (^yⁱ)^i=1:::N

and let ^Q^fn be the distribution of ^Yⁿ. Let ^E¹ⁿ be the corresponding experiment,

E

1n=

R

^N ^B^N^R (^Q^fn ^f ² )^: (4) The main theorem states roughly that the recipe for transforming density data to regression data without loosing information asymptotically, is to rst calculate relative frequencies over the intervals ^t^i;1 ^tⁱ], multiply with ^N, add certain small randomization, and take square root from the result. Indeed, let

w

i =^Z ⁱ^d^F^ⁿ(1 +ⁿ^;1=2^z⁰) +ⁿ^;1=2ⁿ^zⁱ ⁱ= 1 ^:^:^: ^N (5) where

i =^N^I^t^i;1^tⁱ^]

^

F

n(^t) = ^Pⁿⁱ⁼¹^I^0t](^Xⁱ) is the empirical distribution function, ^zⁱ are i. i. d. ^N(0 1), and ⁿ^>0.

Theorem 1

Let ^s ^> 3⁼2 and ^N = ⁿ^(1+1)=(2s)] where 0 ^< ¹ ^< 2^s=3 ^;1. The Markov kernel

(^X¹ ^:^:^: ^Xⁿ)^7;^!^qmax^fwⁱ 0^g

i=1:::N

(6) from the density experiment ^E⁰ⁿ to the Gaussian experiment ^E¹ⁿ is asymptotically sucient, when we choose ⁿ in the denition of ^wⁱ as

n =^o(1) ⁿ^;1 =^o ⁿ^;3(1+¹^)=(4s)ⁿ¹⁼²(logⁿ)^;1^: (7) A recipe, almost similar to the one given in Theorem 1, was suggested by Donoho, Johnstone, Kerkyacharian, and Picard (1995, page 327).

6

(7)

Remark.

Suppose we observe^Yⁿ which has Poisson distribution with the inten- sity function ^nf,^f ² . Let

i =^Yⁿ(ⁱ⁼ⁿ)^;^Yⁿ((ⁱ^;1)⁼ⁿ) ⁱ= 1 ^:^:^:ⁿ and let ~^Fⁿ(^t) be the partial sum process formed from ⁱ,

~

F

n(^t) = ⁿ^;1^X^nt]

i=1

i =ⁿ^;1^Yⁿ(^nt]⁼ⁿ)^: Let ^P^fn be the distribution of ~^Fⁿ and

~

E

0n= (^D0 1] ^A (^P^fn ^f ² ))

where^D0 1] is the space of functions on 0 1], continuous from the right and with the left side limits. Replacing the modied empiricaldistribution function ^^Fⁿ(1+ⁿ^;1=2^z⁰) by ~^Fⁿ in the denition of ^wⁱ in (5), the statement of the Theorem 1 will hold also

for the experiment ~^E⁰ⁿ. ²

In the course of proving Theorem 1 we will also give an asymptotically sucient Markov kernel from the density experiment to the following heteroscedastic Gaussian experiment. Let

~

y

i =^f(^sⁱ) +^N

n

1=2

f

1=2(^sⁱ)ⁱ ⁱ= 1 ^:^:^: ^N where ⁱ are i. i. d. ^N(0 1),^sⁱ = (^t^i;1+^tⁱ)⁼2, and^tⁱ =^i=N. Let

~

Y

n= (~^yⁱ)^i=1:::N

and let ~^Q^fn be the distribution of ~^Yⁿ. Let ~^E¹ⁿ be the corresponding experiment,

~

E

1n=

R

^N ^B^N^R ( ~^Q^fn ^f ² )^: (8) The heteroscedastic experiment ~Ê¹ⁿ might be considered a bit more complicated as the homoscedastic experimentÊ¹ⁿ. The following theorem states that the otherwise similar Markov kernel than dened in Theorem 1, but this time we do not take square roots, is asymptotically sucient from the density experiment Êⁿ⁰ to the heteroscedastic Gaussian experiment ~Êⁿ¹.

7

(8)

Theorem 2

Let ^s ^>3⁼2 and ^N = ⁿ⁽¹⁺¹^)=(2s)] where 0^<¹ ^< 2^s=3^;1. Let ^wⁱ be as dened in (5). The Markov kernel

(^X¹ ^:^:^: ^Xⁿ)^7;^!(^wⁱ)î=1:::N (9) from the density experiment Ê⁰ⁿ to the heteroscedastic Gaussian experiment Ê~¹ⁿ _is asymptotically sucient, when we choose ⁿ in the denition of ^wⁱ as in (7).

1.1 Proof of Theorems 1 and 2

We will start by proving in Section 2 that if ~^Kⁿ ²^R(Êⁿ⁰ Ê~ⁿ¹) is the Markov kernel dened in (9), where Êⁿ⁰ is the density experiment dened in (2) and ~Êⁿ¹ is the heteroscedastic Gaussian experiment dened in (8), then

lim

n!1sup

f2

k^K~ⁿ^P^fn^;^Q~^fn^k^TV = 0^: (10) Secondly, we will prove in Section 3 that if %^Kⁿ ²^R( ~^Eⁿ¹ ^Eⁿ¹) is the Markov kernel dened by

(~^yⁱ)^i=1:::N ^;^!^qmax^f^y~ⁱ 0^g

i=1:::N

where ^Eⁿ¹ is the homoscedastic Gaussian experiment dened in (4), then lim

n!1sup

f2

k^K%ⁿ^Q~^fn^;^Q^fn^k^TV = 0^: (11) Thirdly, we will prove in Section 4 the following theorem which states that the homoscedastic Gaussian experiment^E¹ⁿis asymptoticallyequivalent to a continuous Gaussian white noise model.

Theorem 3

Let

E

2n= ^C0 1] ^B^C01] (^Q⁽²⁾^fn ^f ² ) where ^Q⁽²⁾^fn is the distribution of the process

dX

n(^t) =^q^f(^t)^dt+ 12ⁿ^;1=2^dW(^t) ^t ²0 1]^: (12) The experiments ^E¹ⁿ and ^E²ⁿ are asymptotically equivalent, that is,

(^E¹ⁿ ^E²ⁿ)^;^!0^: 8

(9)

Theorem 1 is proved by noting that from (10) and (11) it follows that the condition (i) of Denition 3 is satised with the Markov kernel dened in (6). Thus we have also proved that the homoscedastic Gaussian experiment is asymptotically less informative than the density experiment, that is, (Ê⁰ⁿ Ê¹ⁿ) ^! 0. We need still to prove that (Ê¹ⁿ Ê⁰ⁿ) ^! 0. From Theorem 3 we have that (Ê¹ⁿ Ê²ⁿ) ^! 0.

We know from Nussbaum (1996) that (^E²ⁿ ^E⁰ⁿ)^! 0. Thus we have proved also

(Ê¹ⁿ Ê⁰ⁿ)^! 0, and thus (Ê⁰ⁿ Ê¹ⁿ) ^!0. Thus also the condition (ii) of De- nition 3 is satised.

Theorem 2 is proved by noting that from (10) it follows that the condition (i) of Denition 3 is satised with the Markov kernel dened in (9). Thus we have also proved that the heteroscedastic Gaussian experiment is asymptotically less informative than the density experiment, that is, (Ê⁰ⁿ Ê~¹ⁿ) ^! 0. We need still to prove that ( ~Ê¹ⁿ Ê⁰ⁿ) ^! 0. From (11) it follows that ( ~Ê¹ⁿ Ê¹ⁿ) ^! 0. From Theorem 3 we have that (Ê¹ⁿ Ê²ⁿ) ^! 0. We know from Nussbaum (1996) that (Ê²ⁿ Ê⁰ⁿ) ^! 0. Thus we have proved also (Ê¹ⁿ Ê⁰ⁿ) ^! 0, and thus (Ê⁰ⁿ Ê¹ⁿ)^!0. Thus also the condition (ii) of Denition 3 is satised.

1.2 Inequalities for the Hellinger Distance

In the following, the -distance will be mostly estimated using Hellinger distance.

The Hellinger distance between probability measures ^P and ^Q is dened as

H

2(^P ^Q) =^Z ^f^P¹⁼²^;^f^Q¹⁼²²^:

where ^f^P and ^f^Q are probability densities of the distributions^P and ^Q with respect to any dominating measure. Note that ^kP ^;^Qk^T^V = ¹²^R ^jf^P ^;^f^Q^j: It holds that

12^H²(^P ^Q)^kP ^;^Qk^TV ^H(^P ^Q)^: (13) When ^P and ^Qare product measures, ^P =^Q^Nⁱ⁼¹^pⁱ, ^Q=^Q^Nⁱ⁼¹^qⁱ, then it holds that

H

2(^P ^Q)2^X^N

i=1 H

2(^pⁱ ^qⁱ)^: (14) The following inequalities hold:

H 2

N(¹ ²) ^N(² ²) (¹^;²)²

4² ^: (15)

9

(10)

From Golubev and Nussbaum (1998),

H

2(^N( ¹) ^N( ²)) 1

16^k ^;1¹ ^kk ^;1² ^k

X

k l

( ¹]^{k l}^; ²]^{k l})²^: (16) where

k k= sup

kxk 1 x

0

xConst X

k l

]²^{k l}^: A special case of the previous inequality is

H 2

N( diag(²¹ⁱ)) ^N( diag(²²ⁱ))

Const

N

X

i=1

;2

1i

!

1=2

N

X

i=1

;2

2i

!

1=2

N

X

i=1

2

1i

; 2

2i

2

: (17)

Finally we need that

H 2

L f(^t)^dt+ⁿ^;1=2^dW(^t) ^L ^g(^t)^dt+ⁿ^;1=2^dW(^t)

Constn

Z (^f ^;^g)²^: (18)

2 The Heteroscedastic Experiment

In this section we will prove (10). That is, we will prove constructively that the heteroscedastic Gaussian experiment ~^E¹ⁿis asymptotically less informative than the density experiment^E⁰ⁿ. The proof will take as a starting point the weak convergence

n

1=2( ^^Fⁿ^;^F)⁾^B ^F (19) where ^B is a Brownian bridge process, and ^F(^t) = ^R⁰^t^f(^s)^ds. From (19) one will move to the space of nite dimensional sequencies,

Z

i

d^F^ⁿ^Z ⁱ^dF +ⁿ^;1=2^Z ⁱ^d(^B ^F) ⁱ= 1 ^:^:^: ^N (20) which suggests the accompanying heteroscedastic Gaussian experiment ~^Eⁿ¹, dened in (8).

In Section 2.1 we will prove the weak convergence with a rate for certain nite dimensional sequencies, in Section 2.2 we will strengthen the weak convergence to the convergence in the total variation norm, and in Section 2.3 we will move from (20) to the accompanying experiment ~^Eⁿ¹, dened in (8).

10

(11)

2.1 Weak Convergence

We will modify the empirical distribution function ^^Fⁿ to

~

F

n= ^^Fⁿ(1 +ⁿ^;1=2^z⁰)

where ^z⁰ ^N(0 1) is independent from the ^X¹ ^:^:^: ^Xⁿ. Now we will have instead of (19),

n

1=2( ~^Fⁿ^;^F)⁾^W ^F (21) where^W is a Wiener process, or Brownian motion, on 0 1]. We will start by proving a rate for the weak convergence of certain nite dimensional sequencies, to be dened in (25) and (26). For ^x²

R

^N, let the norm be

kxk

seq = sup

i=1:::N N

;1

jx

i j:

The rate of the weak convergence will be given in terms of the bounded Lipschitz distance. The bounded Lipschitz distance between probability distributions ^P and

Q on

R

^N is dened by

kP ;Qk

BL= sup

g 2BL Z

g(^dP ^;^dQ) (22)

where

BL= ⁿ^g :

R

^N ^;^!

R

^kgk¹ 1 ^kgk^L 1^o (23) where

kgk

L = sup

x6=y

jg(^x)^;^g(^y)^j

kx;yk

seq :

The bounded Lipschitz distance metricises the weak convergence.

We can write

n

1=2( ~^F ^;^F)^;^W ^F = ⁿ¹⁼²( ^^F ^;^F)^;(^W ^F ^;^z⁰^F) +^z⁰( ^^F^;^F)

= ⁿ¹⁼²( ^^F ^;^F)^;^B ^F+^z⁰( ^^F ^;^F)

where ^B is a Brownian bridge, which is dened in terms of the Wiener process ^W as ^B(^t) =^W(^t)^;^tW(1). By Bretagnolle and Massart (1989),

P kn

1=2( ^^F ^;^F)^;^B^F^k¹^>ⁿ^;1=2(^x+ 12logⁿ)2exp(^;x=6) 11

(12)

where ^x ^> 0 and for ^G in Skorohod space ^D0 1], ^kGk¹ = sup^t201]^jG(^t)^j. Thus, choosing for example ^x= 6logⁿ,

P kn

1=2( ~^F ^;^F)^;^W ^F^k¹^>36ⁿ^;1=2logⁿ

P kn

1=2( ^^F^;^F)^;^B^F^k¹+^jz⁰^jk^F^^;^F^k¹^>36ⁿ^;1=2logⁿ

P kn

1=2( ^^F^;^F)^;^B^F^k¹^>18ⁿ^;1=2logⁿ +^P ^jz⁰^jk^F^^;^F^k¹^>18ⁿ^;1=2logⁿ

Constn

;r (24)

where ^r^>1⁼2 and ^Const does not depend on the unknown distribution^F. Here we used an estimate

P jz

0

jk^F^^;^F^k¹^>18ⁿ^;1=2logⁿ

P jz

0 jkn

1=2( ^^F ^;^F)^;^B^F^k¹^>9logⁿ+^P (^jz⁰^jkB^F^k¹^>9logⁿ) and then the facts that

P jz

0

j>(9logⁿ)¹⁼²ⁿ^;9=2 and

P kBFk

1

>(9logⁿ)¹⁼²ⁿ^;9=2

which follows from Talagrand (1988, Lemma 4). Now we will move to the space of sequences. Let

&ⁿ=^Z ⁱ^d^hⁿ¹⁼²( ~^Fⁿ^;^F)ⁱ

i=1:::N

(25) where, as before,

i(^t) =^N^I^ti;1ti](^t)

t

i =ⁱ⁼^N. Let also

'ⁿ =^Z ⁱ^d(^W ^F)

i=1:::N

: (26)

For ^Y =ⁿ¹⁼²( ~^Fⁿ^;^F) and for ^Y =^W ^F we have

Z

i(^t)^dY(^t) = ^N^Z ^tⁱ

t

i;1

dY(^t) =^N(^Y(^t^i;1)^;^Y(^tⁱ))

and thus

Z

i(^t)^dY(^t)2^N^kY^k¹^: 12

(13)

Thus

k&ⁿ^;'ⁿ^k^seq 2^kn¹⁼²( ~^Fⁿ^;^F)^;^W ^F^k¹ and thus, by (24),

P k&ⁿ^;'ⁿ^k^seq ^>^Constⁿ^;1=2logⁿ

P

p

n( ~^F ^;^F)^;^W ^F 1

>Constn

;1=2logⁿ

Constn

;r

:

Now, for suciently large ⁿ,

kL(&ⁿ)^;^L('ⁿ)^k^BL

= sup

g 2BL

Ejg(&ⁿ)^;^g('ⁿ)^j

Constn

;1=2logⁿ+ 2^P ^k&ⁿ^;'ⁿ^k^seq ^Constⁿ^;1=2logⁿ

Constn

;1=2logⁿ+^Constⁿ^;r

Constn

;1=2log^n:

Because ^Const does not depend on the density^f, we have proved that sup

f2

kL(&ⁿ)^;^L('ⁿ)^k^BL=^O ⁿ^;1=2logⁿ^: (27)

2.2 Total Variation Convergence

In this section we will strengthen the convergence in (27) to converegence in the total variation distance, but the convergence in the total variation distance will be without a rate.

Let &ⁿ be as dened in (25) and let 'ⁿ be as dened in (26). We will add to &ⁿ a certain small random disturbation ^Uⁿ to strengthen the convergence in (27) to the convergence in the total variation norm. A similar kind of reasoning has been used by Muller (1981). Let

U

n= (ⁿ^zⁱ)^i=1:::N where ^zⁱ are i. i. d. ^N(0 1) and ⁿ is dened in (7). Now

kL(&ⁿ+^Uⁿ)^;^L('ⁿ)^k^TV 13

(14)

kL(&ⁿ+Ûⁿ)^;^L('ⁿ+Ûⁿ)^k^T^V (28) +^kL('ⁿ+Ûⁿ)^;^L('ⁿ)^k^TV^: (29) The term (28) can be characterised as a variance term and (29) can be characterised as a bias term. To estimate (28),

kL(&ⁿ+^Uⁿ)^;^L('ⁿ+^Uⁿ)^k^TV

= sup

kg k

1 1

E n

E Un

g(&ⁿ+Ûⁿ)^;ÊⁿÊÛn^g('ⁿ+Ûⁿ)^: (30) We should establish that when ^g :

R

^N ^!

R

, ^kgk¹ 1, then the function ^x ^7!

E U

n

g(^x+^Uⁿ) can be made a bounded Lipschitz

R

^N ^!

R

with ^k^k^seq norm.

Lemma 4

Let ^g :

R

^N ^!

R

be such that ^kgk¹1. Then for ^x ^y²

R

^N,

jE Un

g(^x+Ûⁿ)^;ÊÛn^g(^y+Ûⁿ)^j^Lⁿ^kx^;^yk^seq where

L

n = 2^;1=2ⁿ^;1^N³⁼²^: Proof. Now, using (13), (14), and (15),

E U

n

g(^x+Ûⁿ)^;ÊÛⁿ^g(^y+Ûⁿ) sup

kg k1 1

E U

n

g(^x+Ûⁿ)^;ÊÛⁿ^g(^y+Ûⁿ)

= ^kL(^x+^Uⁿ)^;^L(^y+^Uⁿ)^k^T^V

H(^L(^x+^Uⁿ) ^L(^y+^Uⁿ))

"

2^X^N

i=1 H

2(^L(^xⁱ+ⁿ^zⁱ) ^L(^yⁱ+ⁿ^zⁱ))

#1=2

2¹⁼²

"

N

X

i=1

(^xⁱ^;^yⁱ)² 4²ⁿ

#

1=2

2^;1=2^;1ⁿ ^N³⁼²^kx^;^yk^seq^:

2

From the Lemma 4 we have that the function ^x^7!ÊÛn^g(^x+Ûⁿ)⁼max^fLⁿ 1^g is in ^B^L, where ^BL was dened in (23). From (22), (27), and (30) we have an upper bound for (28):

sup

f2

kL(&ⁿ+^Uⁿ)^;^L('ⁿ+^Uⁿ)^k^T^V max^fLⁿ 1^gsup

f2

kL(&ⁿ)^;^L('ⁿ)^k^BL

= max^fLⁿ 1^g^O ⁿ^;1=2logⁿ^: (31) 14