• Keine Ergebnisse gefunden

Constructive Asymptotic Equivalence of Density Estimation and Gaussian White Noise

N/A
N/A
Protected

Academic year: 2022

Aktie "Constructive Asymptotic Equivalence of Density Estimation and Gaussian White Noise"

Copied!
30
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Constructive Asymptotic Equivalence of Density Estimation and Gaussian White Noise

Michael Nussbaum and Jussi Klemela Weierstrass Institute, Berlin

Rolf Nevanlinna Institute, University of Helsinki

Abstract

A recipe is provided for producing, from a sequence of procedures in the Gaus- sian regression model, an asymptotically equivalent sequence in the density estimation model with i. i. d. observations. The recipe is, to put it roughly, to calculate square roots of normalised frequencies over certain intervals, add a small random distortion, and pretend these to be observations from a Gaussian discrete regression model.

Mathematics Subject Classi cations:

62G07, 62B15, 62G20

Key Words:

Nonparametricexperiments,deciencydistance, Markov kernel, asymp- totic minimax risk, curve estimation.

1 Introduction

In the rst lecture notes of L. Le Cam from 1969 it is said: "En general une experience est compliquee. Alors il faut l'approcher par une experience plus simple." So the The research of Jussi Klemela was nanced by Sonderforschungsbereich 373

\Quantikation und Simulation Okonomischer Prozesse" and to a smaller extent by Jenny and Antti Wihuri Foundation. The authors wish to thank G. Golubev for helpful discussions.

1

(2)

purpose is to approximate an experiment by a simpler one. The simplicity of an experiment is not dened precisely, but in a simple experiment one should be able to nd optimal estimators, or at least asymptotically optimal estimators. By optimal estimator we mean either a minimax estimator, or also a Bayes estimator. Gaussian experiments are primary examples of simple experiments. Simplicity is also achieved by reducing the dimension of observations, as in the classical case of nding sucient statistics whose dimension does not depend on the number of observations. Approxi- mations are especially useful when one is able to transform the optimal procedures of a simple experiment to the optimal procedures of the more complicated experiment.

The precise sense of "approximate" is given by Le Cam's -distance. When the -distance between two experiments vanishes asymptotically, we say that the two experiments are asymptotically equivalent. The known results about asymptotic equivalence of experiments include the asymptotic equivalence of Gaussian discrete regression and Gaussian signal recovery by Brown and Low (1996), the asymptotic equivalence of density estimation with i. i. d. observations and Gaussian signal recov- ery by Nussbaum (1996), the asymptotic equivalence of non-Gaussian and Gaussian regression by Grama and Nussbaum (1996), and the asymptotic equivalence of spec- tral density and regression estimation by Golubev and Nussbaum (1998). With the exception of Brown and Low (1996) the previous articles do not give an explicit recipe for transforming a sequence of procedures of a simple experiment to the asymptoti- cally equivalent sequence of the more complex experiment. In this article we study the density estimation with i. i. d. observations and give construction of Markov kernel which makes it possible to transform all the procedures constructed for the Gaussian experiment to the asymptotically equivalent procedures of the density ex- periment.

The basic idea might be described as "reduction to the Gaussian case by small distortions" (cp. Chapter 11.8 of Le Cam, 1986). Roughly speaking, the idea is as follows. Assume a simple model with real-valued parameter in which a real-valued statistic Tn is sucient and asymptotically normal. Then, by small distortions, involving some additional randomization, one smoothes Tn in such a way that the

"randomized" statistic has a density. This density then should reasonably converge to a normal density, entailing total variation convergence of the laws. The recipe is then "take the sucient statistic, randomize, and pretend the resulting data are

2

(3)

Gaussian". Versions of such a theory (local and global) have been developed in parametric models by Le Cam (1986, Chapter 11.8) and by Muller (1981), and in a more indirect fashion in Shiryaev and Spokoiny (1993). In the framework of nonparametric asymptotic equivalence, a rst result of this type was given by Brown and Low (1996b) for nongaussian regression. Our result pertains to the i.i.d. model, and takes the empirical distribution function as a sucient statistic to start with.

Apart form the initial smoothing, our proof is totally dierent from that of Brown and Low it is inspired by Muller (1981), building upon rates of convergence in the functional central limit theorem. We do not attain the optimal smoothness index 1/2, but our method has a potential of being applicable wherever improvedfunctional CLT's hold.

Before stating the main results, we have to give preliminary denitions.

De nition 1 (i)

A measurable space( A)is called STANDARD BOREL if there is a measurable space (0 A0), such that 2 A0, A is the trace of A0 on , i. e. A=fA\ :A 2A0g, and there is a metric on 0 such that 0 becomes a Polish space and A0 is the Borel sigma-algebra generated by the metric.

(ii)

An experimentE = ( A (P 2))is called POLISHif the measurable space ( A) is standard Borel.

(iii)

An experiment E = ( A (P 2)) is called DOMINATED if there exists a -nite measure on ( A) such that P , 2.

(iv)

Given two measurable spaces (i Ai), i = 0 1, a MARKOV KERNEL is a mapping K : 0A1 !0 1] such that

(a)

K(! ) is a probability measure on A1 for each ! 20,

(b)

K( A) is a measurable function on (0 A0) for each A2A1.

(v)

TheTOTAL VARIANCE DISTANCE between probability measuresP andQ is

kP ;Qk

TV = sup

kg k1 1 Z

g(dP ;dQ):

Given experimentsEi = (i Ai (P i 2)),i= 0 1, the set of Markov kernels associated with the measurable spaces (i Ai) is denoted withR(E0 E1). GivenK 2

3

(4)

R(E0 E1) and a probability measureP on (0 A0),KP is a probability measure on (1 A1), dened by KP(A) =RK(! A)dP(!).

De nition 2

Let Ei = (i Ai (P i 2)), i = 0 1, be two Polish dominated experiments.

(i)

TheDEFICIENCY of E0 with respect to E1 is

(E0 E1) = inf

K2R(E

0 E

1 )

sup

2 kKP

0

;P

1 k

TV :

(ii)

The -pseudodistance of E0 and E1 is

(E0 E1) = maxf(E0 E1) (E1 E0)g:

We say that the sequence of experimentsE1n is asymptotically less informative than the sequence E0n, if (E0n E1n) ! 0. We say that the sequences of experi- mentsE0n andE1n are asymptotically equivalent, if (E0n E1n)!0. Now we are ready to give a denition of central importance to this article. According to this def- inition, a sequence of Markov kernels is asymptotically sucient, if they achieve the inmum in the denition of the deciency and the Markov kernels of the sequence are between two asymptotically equivalent experiments.

De nition 3

Let Ein = ni Ani (Pni 2), i = 0 1, be two Polish dominated experiments. A sequence of Markov kernels Kn 2 R(E0n E1n) is ASYMPTOTI- CALLY SUFFICIENT if

(i)

limn!1sup 2kKnPn0;Pn1kTV = 0

(ii)

limn!1(E0n E1n) = 0:

There are dierent ways of dening asymptotic suciency of a statistic or a sub-

-eld, stemming from Le Cam (1986). See for example Strasser (1985, Denition 81.3) or Laredo (1990, Corollary 1). One can dene for example, that a statistic is asymptotically sucient for an experiment if there is an experiment dened on the same probability space, which is asymptotically equivalent for the original ex- periment, and for which the statistic is exactly sucient. This case is contained in Denition 3.

4

(5)

The usefulness of a sequence of asymptotically sucient Markov kernelsKn lies in the fact that if we have a sequence of statistics Tn1for the experimentsEn1, that is, measurable maps dened on (n1 An1) to some other measurable space, then the sequence of statistics for the experimentsEn0, dened by

T

n0(!0) =Z Tn1(!1)Kn(!0 d!1)

has the same asymptotic minimax risk and Bayes risk for bounded continuous loss functions, as the sequence Tn1. Also, if we have a sequence of procedures Ln1 for the experiments En1, that is, Markov kernels dened on (n1 An1) to some other measurable space, then the sequence of procedures for the experimentsEn0, dened by

L

n0(!0 A) =Z Ln1(!1 A)Kn(!0 d!1)

has the same asymptotic minimax risk and Bayes risk for bounded continuous loss functions, as the sequence Ln1.

Dene a parameter space of densities as follows. For s 2]1 2] and M >0, let

!s(M) be the set of functions 0 1]!

R

satisfying the condition

f(t);f(t0);f0(t0)(t;t0) Mjt;t0js: (1) For >0 dene a setF as the set of densities on 0 1] bounded below by :

F

=f :Z 1

0

f = 1 f(x) x 20 1]: Dene an a priori set, for given s2]1 2], M >0, >0,

= sM = !s(M)\F:

In Nussbaum (1996) it was shown that density estimation with i. i. d. observations and Gaussian signal recovery are asymptotically equivalent when the Holder smooth- ness index s>1=2 but we will have to assume that s>3=2.

LetX1 ::: Xn be i. i. d. random variables with the density function f 2 . Let

P

fn be the distribution of X1 ::: Xn and

E

0n = 0 1]n Bn01] (Pfn f 2 ): (2) 5

(6)

A simple experiment which is asymptotically equivalent to the experiment E0n is the following experiment of discrete Gaussian regression. Let

y

i=qf(si) + 12Nn1=2i i= 1 ::: N (3) where i are i. i. d. N(0 1),si = (ti;1+ti)=2, andti =i=N. Let

Y

n= (yi)i=1:::N

and let Qfn be the distribution of Yn. Let E1n be the corresponding experiment,

E

1n=

R

N BNR (Qfn f 2 ): (4) The main theorem states roughly that the recipe for transforming density data to regression data without loosing information asymptotically, is to rst calculate relative frequencies over the intervals ti;1 ti], multiply with N, add certain small randomization, and take square root from the result. Indeed, let

w

i =Z idF^n(1 +n;1=2z0) +n;1=2nzi i= 1 ::: N (5) where

i =NIti;1ti]

^

F

n(t) = Pni=1I0t](Xi) is the empirical distribution function, zi are i. i. d. N(0 1), and n>0.

Theorem 1

Let s > 3=2 and N = n(1+1)=(2s)] where 0 < 1 < 2s=3 ;1. The Markov kernel

(X1 ::: Xn)7;!qmaxfwi 0g

i=1:::N

(6) from the density experiment E0n to the Gaussian experiment E1n is asymptotically sucient, when we choose n in the denition of wi as

n =o(1) n;1 =o n;3(1+1)=(4s)n1=2(logn);1: (7) A recipe, almost similar to the one given in Theorem 1, was suggested by Donoho, Johnstone, Kerkyacharian, and Picard (1995, page 327).

6

(7)

Remark.

Suppose we observeYn which has Poisson distribution with the inten- sity function nf,f 2 . Let

i =Yn(i=n);Yn((i;1)=n) i= 1 :::n and let ~Fn(t) be the partial sum process formed from i,

~

F

n(t) = n;1Xnt]

i=1

i =n;1Yn(nt]=n): Let Pfn be the distribution of ~Fn and

~

E

0n= (D0 1] A (Pfn f 2 ))

whereD0 1] is the space of functions on 0 1], continuous from the right and with the left side limits. Replacing the modied empiricaldistribution function ^Fn(1+n;1=2z0) by ~Fn in the denition of wi in (5), the statement of the Theorem 1 will hold also

for the experiment ~E0n. 2

In the course of proving Theorem 1 we will also give an asymptotically sucient Markov kernel from the density experiment to the following heteroscedastic Gaussian experiment. Let

~

y

i =f(si) +N

n

1=2

f

1=2(si)i i= 1 ::: N where i are i. i. d. N(0 1),si = (ti;1+ti)=2, andti =i=N. Let

~

Y

n= (~yi)i=1:::N

and let ~Qfn be the distribution of ~Yn. Let ~E1n be the corresponding experiment,

~

E

1n=

R

N BNR ( ~Qfn f 2 ): (8) The heteroscedastic experiment ~E1n might be considered a bit more complicated as the homoscedastic experimentE1n. The following theorem states that the otherwise similar Markov kernel than dened in Theorem 1, but this time we do not take square roots, is asymptotically sucient from the density experiment En0 to the heteroscedastic Gaussian experiment ~En1.

7

(8)

Theorem 2

Let s >3=2 and N = n(1+1)=(2s)] where 0<1 < 2s=3;1. Let wi be as dened in (5). The Markov kernel

(X1 ::: Xn)7;!(wi)i=1:::N (9) from the density experiment E0n to the heteroscedastic Gaussian experiment E~1n is asymptotically sucient, when we choose n in the denition of wi as in (7).

1.1 Proof of Theorems 1 and 2

We will start by proving in Section 2 that if ~Kn 2R(En0 E~n1) is the Markov kernel dened in (9), where En0 is the density experiment dened in (2) and ~En1 is the heteroscedastic Gaussian experiment dened in (8), then

lim

n!1sup

f2

kK~nPfn;Q~fnkTV = 0: (10) Secondly, we will prove in Section 3 that if %Kn 2R( ~En1 En1) is the Markov kernel dened by

(~yi)i=1:::N ;!qmaxfy~i 0g

i=1:::N

where En1 is the homoscedastic Gaussian experiment dened in (4), then lim

n!1sup

f2

kK%nQ~fn;QfnkTV = 0: (11) Thirdly, we will prove in Section 4 the following theorem which states that the homoscedastic Gaussian experimentE1nis asymptoticallyequivalent to a continuous Gaussian white noise model.

Theorem 3

Let

E

2n= C0 1] BC01] (Q(2)fn f 2 ) where Q(2)fn is the distribution of the process

dX

n(t) =qf(t)dt+ 12n;1=2dW(t) t 20 1]: (12) The experiments E1n and E2n are asymptotically equivalent, that is,

(E1n E2n);!0: 8

(9)

Theorem 1 is proved by noting that from (10) and (11) it follows that the con- dition (i) of Denition 3 is satised with the Markov kernel dened in (6). Thus we have also proved that the homoscedastic Gaussian experiment is asymptotically less informative than the density experiment, that is, (E0n E1n) ! 0. We need still to prove that (E1n E0n) ! 0. From Theorem 3 we have that (E1n E2n) ! 0.

We know from Nussbaum (1996) that (E2n E0n)! 0. Thus we have proved also

(E1n E0n)! 0, and thus (E0n E1n) !0. Thus also the condition (ii) of De- nition 3 is satised.

Theorem 2 is proved by noting that from (10) it follows that the condition (i) of Denition 3 is satised with the Markov kernel dened in (9). Thus we have also proved that the heteroscedastic Gaussian experiment is asymptotically less in- formative than the density experiment, that is, (E0n E~1n) ! 0. We need still to prove that ( ~E1n E0n) ! 0. From (11) it follows that ( ~E1n E1n) ! 0. From Theorem 3 we have that (E1n E2n) ! 0. We know from Nussbaum (1996) that (E2n E0n) ! 0. Thus we have proved also (E1n E0n) ! 0, and thus (E0n E1n)!0. Thus also the condition (ii) of Denition 3 is satised.

1.2 Inequalities for the Hellinger Distance

In the following, the -distance will be mostly estimated using Hellinger distance.

The Hellinger distance between probability measures P and Q is dened as

H

2(P Q) =Z fP1=2;fQ1=22:

where fP and fQ are probability densities of the distributionsP and Q with respect to any dominating measure. Note that kP ;QkTV = 12R jfP ;fQj: It holds that

12H2(P Q)kP ;QkTV H(P Q): (13) When P and Qare product measures, P =QNi=1pi, Q=QNi=1qi, then it holds that

H

2(P Q)2XN

i=1 H

2(pi qi): (14) The following inequalities hold:

H 2

N(1 2) N(2 2) (1;2)2

42 : (15)

9

(10)

From Golubev and Nussbaum (1998),

H

2(N( 1) N( 2)) 1

16k ;11 kk ;12 k

X

k l

( 1]k l; 2]k l)2: (16) where

k k= sup

kxk 1 x

0

xConst X

k l

]2k l: A special case of the previous inequality is

H 2

N( diag(21i)) N( diag(22i))

Const

N

X

i=1

;2

1i

!

1=2

N

X

i=1

;2

2i

!

1=2

N

X

i=1

2

1i

; 2

2i

2

: (17)

Finally we need that

H 2

L f(t)dt+n;1=2dW(t) L g(t)dt+n;1=2dW(t)

Constn

Z (f ;g)2: (18)

2 The Heteroscedastic Experiment

In this section we will prove (10). That is, we will prove constructively that the heteroscedastic Gaussian experiment ~E1nis asymptotically less informative than the density experimentE0n. The proof will take as a starting point the weak convergence

n

1=2( ^Fn;F))B F (19) where B is a Brownian bridge process, and F(t) = R0tf(s)ds. From (19) one will move to the space of nite dimensional sequencies,

Z

i

dF^nZ idF +n;1=2Z id(B F) i= 1 ::: N (20) which suggests the accompanying heteroscedastic Gaussian experiment ~En1, dened in (8).

In Section 2.1 we will prove the weak convergence with a rate for certain nite dimensional sequencies, in Section 2.2 we will strengthen the weak convergence to the convergence in the total variation norm, and in Section 2.3 we will move from (20) to the accompanying experiment ~En1, dened in (8).

10

(11)

2.1 Weak Convergence

We will modify the empirical distribution function ^Fn to

~

F

n= ^Fn(1 +n;1=2z0)

where z0 N(0 1) is independent from the X1 ::: Xn. Now we will have instead of (19),

n

1=2( ~Fn;F))W F (21) whereW is a Wiener process, or Brownian motion, on 0 1]. We will start by proving a rate for the weak convergence of certain nite dimensional sequencies, to be dened in (25) and (26). For x2

R

N, let the norm be

kxk

seq = sup

i=1:::N N

;1

jx

i j:

The rate of the weak convergence will be given in terms of the bounded Lipschitz distance. The bounded Lipschitz distance between probability distributions P and

Q on

R

N is dened by

kP ;Qk

BL= sup

g 2BL Z

g(dP ;dQ) (22)

where

BL= ng :

R

N ;!

R

kgk1 1 kgkL 1o (23) where

kgk

L = sup

x6=y

jg(x);g(y)j

kx;yk

seq :

The bounded Lipschitz distance metricises the weak convergence.

We can write

n

1=2( ~F ;F);W F = n1=2( ^F ;F);(W F ;z0F) +z0( ^F;F)

= n1=2( ^F ;F);B F+z0( ^F ;F)

where B is a Brownian bridge, which is dened in terms of the Wiener process W as B(t) =W(t);tW(1). By Bretagnolle and Massart (1989),

P kn

1=2( ^F ;F);BFk1>n;1=2(x+ 12logn)2exp(;x=6) 11

(12)

where x > 0 and for G in Skorohod space D0 1], kGk1 = supt201]jG(t)j. Thus, choosing for example x= 6logn,

P kn

1=2( ~F ;F);W Fk1>36n;1=2logn

P kn

1=2( ^F;F);BFk1+jz0jkF^;Fk1>36n;1=2logn

P kn

1=2( ^F;F);BFk1>18n;1=2logn +P jz0jkF^;Fk1>18n;1=2logn

Constn

;r (24)

where r>1=2 and Const does not depend on the unknown distributionF. Here we used an estimate

P jz

0

jkF^;Fk1>18n;1=2logn

P jz

0 jkn

1=2( ^F ;F);BFk1>9logn+P (jz0jkBFk1>9logn) and then the facts that

P jz

0

j>(9logn)1=2n;9=2 and

P kBFk

1

>(9logn)1=2n;9=2

which follows from Talagrand (1988, Lemma 4). Now we will move to the space of sequences. Let

&n=Z idhn1=2( ~Fn;F)i

i=1:::N

(25) where, as before,

i(t) =NIti;1ti](t)

t

i =i=N. Let also

'n =Z id(W F)

i=1:::N

: (26)

For Y =n1=2( ~Fn;F) and for Y =W F we have

Z

i(t)dY(t) = NZ ti

t

i;1

dY(t) =N(Y(ti;1);Y(ti))

and thus

Z

i(t)dY(t)2NkYk1: 12

(13)

Thus

k&n;'nkseq 2kn1=2( ~Fn;F);W Fk1 and thus, by (24),

P k&n;'nkseq >Constn;1=2logn

P

p

n( ~F ;F);W F 1

>Constn

;1=2logn

Constn

;r

:

Now, for suciently large n,

kL(&n);L('n)kBL

= sup

g 2BL

Ejg(&n);g('n)j

Constn

;1=2logn+ 2P k&n;'nkseq Constn;1=2logn

Constn

;1=2logn+Constn;r

Constn

;1=2logn:

Because Const does not depend on the densityf, we have proved that sup

f2

kL(&n);L('n)kBL=O n;1=2logn: (27)

2.2 Total Variation Convergence

In this section we will strengthen the convergence in (27) to converegence in the total variation distance, but the convergence in the total variation distance will be without a rate.

Let &n be as dened in (25) and let 'n be as dened in (26). We will add to &n a certain small random disturbation Un to strengthen the convergence in (27) to the convergence in the total variation norm. A similar kind of reasoning has been used by Muller (1981). Let

U

n= (nzi)i=1:::N where zi are i. i. d. N(0 1) and n is dened in (7). Now

kL(&n+Un);L('n)kTV 13

(14)

kL(&n+Un);L('n+Un)kTV (28) +kL('n+Un);L('n)kTV: (29) The term (28) can be characterised as a variance term and (29) can be characterised as a bias term. To estimate (28),

kL(&n+Un);L('n+Un)kTV

= sup

kg k

1 1

E n

E Un

g(&n+Un);EnEUng('n+Un): (30) We should establish that when g :

R

N !

R

, kgk1 1, then the function x 7!

E U

n

g(x+Un) can be made a bounded Lipschitz

R

N !

R

with kkseq norm.

Lemma 4

Let g :

R

N !

R

be such that kgk11. Then for x y2

R

N,

jE Un

g(x+Un);EUng(y+Un)jLnkx;ykseq where

L

n = 2;1=2n;1N3=2: Proof. Now, using (13), (14), and (15),

E U

n

g(x+Un);EUng(y+Un) sup

kg k1 1

E U

n

g(x+Un);EUng(y+Un)

= kL(x+Un);L(y+Un)kTV

H(L(x+Un) L(y+Un))

"

2XN

i=1 H

2(L(xi+nzi) L(yi+nzi))

#1=2

21=2

"

N

X

i=1

(xi;yi)2 42n

#

1=2

2;1=2;1n N3=2kx;ykseq:

2

From the Lemma 4 we have that the function x7!EUng(x+Un)=maxfLn 1g is in BL, where BL was dened in (23). From (22), (27), and (30) we have an upper bound for (28):

sup

f2

kL(&n+Un);L('n+Un)kTV maxfLn 1gsup

f2

kL(&n);L('n)kBL

= maxfLn 1gO n;1=2logn: (31) 14

Referenzen

ÄHNLICHE DOKUMENTE

The study of the marginal scenario of the theorem of lemons under the total failure of the market of used cars – nobody buys, but everybody gets taxi – shifts the

All the example I know from my youth are not abelian, but only additive: Diagram categories, categorified quantum group.. and their Schur quotients, Soergel bimodules, tilting

Empirical results reveal that, at first, the hedging strat- egy based on the kernel density estimation method is of highly efficiency, and then it achieves better performance than

The cell counts of diatom species under the microscope from 20 m depth showed a steep decline in numbers of the most abundant big species, the robust, long-spined

This sheet aims to self-assess your progress and to explicitly work out more details of some of the results proposed in the lectures. You do not need to hand in solutions for

After unit root tests and cointegration tests were conducted, the test of the RET were run by using a system in which coefficients of consumption, income, taxes and debt variables

In einer Art Editorial des ersten ARS MEDICI-Hefts, das im Januar 1911 erschien, schrieb Isaak Segel: «Die ARS MEDICI hat sich die Aufgabe gestellt, aus der gesamten

While in the limit of very small and infinite correlation lengths ξ of the random disor- der, the fluctuating gap model (FGM) admits for an exact analytic calculation of the density