Optimal convergence rates in nonparametric regression with fractional time series errors

(1)

Regression with Fractional Time Series Errors

Yuanhua Feng

Department of Mathematics and Statistics

University of Konstanz

Summary. Considertheestimationofg ()

,the-thderivationofthemeanfunction

inaxeddesign,nonparametricregressionwithalinear,invertible,stationarytime

serieserrorprocess

i

. Assume that g2C k

and that the spectral density of

i has

the form f() c

f jj

as ! 0 with constants c

f

> 0 and 2 ( 1;1). Let

r

= (1 )(k )=(2k+1 ). It is shown that the optimal convergence rate

forg^ ()

is n r

. This rate is achieved by local polynomialtting. It is also shown

that the required regular conditions on the innovation distribution in the current

context arethesame asthose innonparametric regressionwithiiderrors.

Keywords: Nonparametric regression, optimal convergence rate, long memory, an-

tipersistence,inverse process.

1 Introduction

Considertheestimationofthe-thderivationofthemeanfunction,g ()

,intheequidistant

design nonparametric regression model

(1.1) Y

i

=g(x

i )+

i

;

where x

i

=i=n, g : [0;1]! < is a smooth function and

i

is a linear, (second order and

strict)stationaryprocessgeneratedbyaniid(identicallyindependentdistributed)innova-

tionseries "

i

through alinearlter. Forthe autocovariancefunction(k)=cov(

i

;

i+k ),

it is assumed that (k) ! 0 as jkj ! 1. Equation (1.1) represents a nonparametric

regression model with short memory (including iid

i

as a special case), long memory

and antipersistence. Here, a stationary process

i

is said to have long memory (or long-

rangedependence),if P

(k)=1. Amore strictassumptionisthat thespectraldensity

(2)

f()=(2) 1

(k)exp (ik) has a poleat the origin ofthe form

(1.2) f()c

f jj

(as !0)

for some 2 (0;1), where c

f

>0 is a constant and `' means that the ratio of the left

andthe righthandsides converges toone (see Beran,1994, andreferences therein). Note

that, (1.2) implies that (k) c

jkj

1

so that P

(k) = 1. Hence now

i

has long

memory. If (1.2) holds with =0, then we have 0<

P

(k)<1 and

i

is said to have

short memory. On the other hand, a stationary process is said to be antipersistent, if

(1.2) holds for 2( 1;0) implyingthat P

(k)=0.

The aim of this paper is to investigate the minimax optimal convergence rate of a

nonparametric estimator of g ()

(see e.g. Farrell, 1972, Stone, 1980, 1982 and Hall and

Hart, 1990a for related works). For a summary of the nonparametric minimax theory

we refer the reader to Hall (1989). Halland Hart (1990a) obtained optimal convergence

rate for estimating g in nonparametric regression with Gaussian stationary short- and

long-memory errors. In this paper a unied formula for the optimal convergence rate

for estimating g ()

in nonparametric regression with short-memory, long-memory and

antipersistent errors is given. It is shown that this rate is achieved by local polynomial

tting(Beran and Feng, 2001a). Our nding generalizesin various ways previous results

in Stone (1980) and Hall and Hart (1990a). A simple condition under which a sequence

n r

forms a lower bound to the convergence rate is given for nonparametric regression

withstationarytimeserieserrorsatanydependencelevel. Resultsinthispaperaregiven

for Gaussianand non-Gaussian error processes satisfyingsome regularconditions.

The estimator and the error process are dened in section 2. Section 3 describes the

conditionsonthedistributionandprovidesthemainresults. Itturnsoutthattherequired

regularconditionsonthemarginalinnovationdistributionarethesameforall2( 1;1)

andhencedonot dependonthe dependence structure. Someauxiliaryresults,whichcan

be thought of as a part of the proofs, are given in section 4. Detailed proofs are put in

the appendix.

(3)

2.1 The local polynomial tting

Kernel estimatorof g in nonparametric regression with short-memory and long-memory

errorswasproposedbyHallandHart(1990a). Beran(1999)extendedthekernelestimator

to nonparametric regression with antipersistence. However, it is well known that the

kernel estimator isaected by the boundaryproblem. Another attractive nonparametric

approachisthelocalpolynomialttingintroducedbyStone(1977)and Cleveland(1979).

Beran and Feng (2001a) proposed local polynomial tting in nonparametric regression

with short-memory, long-memoryand antipersistenterrors. In this paperwewill use the

proposalinBeran andFeng (2001a)toshowthe achievabilityof the optimalconvergence

rate.

Let k 2 be a positive integer. The function class considered in this paper is C k

(B),

the collectionof all k times dierentiablefunctions g on[0;1]which satisfy

sup

0x1 max

=0;1;:::;k jg

()

(x)jB:

Let p =k 1. Then g can be locally approximated by a polynomial of order p for x in

the neighbourhoodof a pointx

0 :

(2.1) g(x)=g(x

0 )+g

0

(x

0

)(x x

0

)+:::+g (p)

(x

0

)(x x

0 )

p

=p!+R

p

;

where R

p

is a remainder term. Let K be a second order kernel (a symmetric density)

having compact support [ 1;1]. Given n observations Y

1

, ..., Y

n

, we can obtain an

estimatorof g ()

( p) by solving the locallyweighted least squares problem

(2.2) Q=

n

X

i=1 (

Y

i p

X

j=0

j (x

i x

0 )

j )

2

K

x

i x

0

h

)min;

where h is the bandwidth. Let

^

= (

^

0

;

^

1

;:::;

^

p )

0

be the solution of (2.2). Then it is

clear from (2.1) that g^ ()

(x

0

):=!

^

estimates g ()

(x

0

), =0;1;:::;p,which is the local

polynomial tting of g ()

. Note in particular that ^g ()

is the same for nonparametric

regression with stationarytime series errorsat any dependence level.

(4)

In this paper it is assumed that the spectral density of

i

has the form (1.2). Hence

i

willbecalled afractionaltime serieserror process.

i

isalsoassumed tobecausal,linear

and invertible. That is,

i

can beexpressed in two ways:

(2.3)

i

= (B)"

i

;

and

(2.4) "

i

='(B)

i

;

where the innovations "

i

are iid mean zero random variables with var("

i ) =

2

"

< 1,

B is the backshift operator, and (B) = P

1

j=0 a

j B

j

and '(B) = P

1

j=0 b

j B

j

are the

characteristic polynomials of the MA and AR representations of

i

, respectively, with

a

0

=b

0

=1, P

a 2

j

<1 and P

b 2

j

<1. The causality of

i

ismade here for convenience.

Some properties of

i

can be understood more easily by means of its inverse process.

FollowingChateld (1979),the inverse process of

i

,denoteby I

i

,is theprocess with the

sameinnovations"

i

and '(B)rep. (B)asitscharacteristicpolynomialsforthe MArep.

AR representations, which isgiven by

(2.5)

I

i

='(B)"

i

;

and

(2.6) "

i

= (B) I

i :

Following Shaman(1975),the spectral density of I

i , f

I

()say, is

(2.7) f

I

()= 4

"

(2) 2

(f()) 1

c I

f jj

I

(as !0);

wherec I

f

= 4

"

(2) 2

(c

f )

1

and I

= . Equation (2.7) implies that: 1. If

i

is ashort-

memoryprocess, sois I

i

(inparticular, theinverse process ofaniidprocessisthe process

itself); 2. If

i

is a long-memory process with 0 < < 1, then I

is an antipersistent

process with I

= , and vice versa.

From (2.3) wesee thatthe autocovariancesof

i

are (k)= 2

"

P

a

j a

j+jkj

. The inverse

autocovariances of

i

(Cleveland, 1972 and Chateld, 1979), i.e. the autocovariances

(5)

of I

i

, are given by I

(k) = 2

"

1

j=0 b

j b

j+jkj

. Hence we have (k) = 2

"

( a

j )

2

and

P

I

(k) = 2

"

( P

b

j )

2

. This results in P

a

j

= 1, P

b

j

= 0 for > 0 and P

a

j

= 0,

P

b

j

=1 for <0. For =0we have both, 0<

P

a

j

<1and 0<

P

b

j

<1.

A class of processes having the property (1.2) is the class of the FARIMA(p;Æ;q)

(fractional ARIMA) processes (Granger and Joyeux, 1980 and Hosking, 1981), where

Æ2( 0:5;0:5) isthe fractionaldierencing parameter. It iswellknown that the spectral

density of a FARIMA process has the form(1.2) with =2Æ.

3 Optimal convergence rates

3.1 Assumptions on the innovation distribution

An important nding of this paper is that the derivation of that a given sequence is a

lower bound to the convergence rate in nonparametric regression with error process

i

is similar to that for nonparametric regression with the iid errors "

i

. Furthermore, it

turns out that the required conditions on the marginal distribution of "

i

under model

(1.1) with any 2 ( 1;1) are the same, i.e. which do not depend on the dependence

level. Inthe followingwe willadaptthe regular conditionsin Stone(1980, 1982)to xed

design nonparametric regression. Assume that Z(g)is a real random variable depending

on g 2 <. It is assumed that the density function f(z;g) is strictly positive and that

f(z;g)=f(z g;0),where g isthe meanfunction of Z(g), i.e.

Z

zf(z;g)dz =g

for allg 2<. It isfurther assumed that the equation

Z

f(z;g)dz=1

can be twice continuously dierentiated with respect tog toyield

Z

f 0

(z;g)dz=0

and

Z

f 00

(z;g)dz =0:

(6)

i

which will be simply denoted by f(z) in the following. Using this notation the density

of Z(g) may be represented as f(z;g) = f(z g). Set l(z;g) = logf(z;g). There are

positive constants

0

and C and there is afunction M(z;g) such that for g 2<

jl 00

(z;g+)jM(z;g) for jj

0

and

Z

M(z;g)f(z;g)dz C:

Notethat the last condition isfullled, if l 00

(z;g) is bounded.

Remark 1. It is easy to show that all of these conditions are fullled, if Z(g) is

Gaussianwith

f(z;g)= 1

p

2

"

e 1

2 (z g)

2

"

; 1<z;g <1:

And itis also not hard toshow that these conditions are fullled, if the marginaldistri-

butionof "

i

isthe studentt

m

distributionwith m 3,i.e. if f(z;g)is given by

f

m

(z;g)=

[(m+1)=2]

(m=2) p

m

1+

(z g) 2

m

(m+1)=2

; 1<z;g <1:

Remark 2. Observethat howeverotherdistributionsconsidered by Stone(1980),e.g.

the exponential distribution, do not satisfy the regular conditions given above. If "

i are

iidexponentialdistributedwith E("

i

)=0 andvar("

i

)=,then densityfunction ofZ(g)

isgiven by

f(z;g)= 1

e

(z+ g)=

; 1<g <1 and g z <1

and zero otherwise. The support of f >0 for this distribution depends ong.

3.2 Lower bounds to convergence rates

For the minimax optimal convergence rate we will use the following denition (see e.g.

Farrell,1972, Stone,1980 andHalland Hart,1990a). Let <k be anonnegativeinteger

and ~g ()

n

denote agenericnonparametric estimatorof g ()

based on(Y

1

;:::;Y

n

). Let r

be

(7)

a positive number. The sequence n is called a lower bound to the convergence rate at

x

0 if

(3.1) liminf

n sup

g2C k

P(j~g ()

n (x

0 ) g

()

(x

0 )j>c

n

r

)>0

forc

suÆcientlysmall. n r

iscalledanachievable convergencerateifthereisasequence

of estimators g^ ()

n

such that

(3.2) lim

c!1

limsup

n sup

g2C k

P(j^g ()

n (x

0 ) g

()

(x

0 )j>c

n

r

)=0:

Also,the sequence n r

is called the optimal convergence rate if it is anachievable lower

boundtotheconvergence rate. The optimalconvergence rateforanonparametricregres-

sion estimatorofg ()

withiiderrors is n

(k )=(2k+1)

(Stone,1980). In fact, n

(k )=(2k+1)

is alsothe optimal convergence rate for estimating g ()

in nonparametric regression with

short-memory errors (results for = 0 may be found in Hall and Hart, 1990a). In

the case with 0 < < 1, Hall and Hart (1990a) shown that the optimal convergence

rate is n

(1 )k=(2k+1 )

for estimating g. In this paper we will show that n r

with

r

=(1 )(k )=(2k+1 ) is the optimalconvergence rate for estimating g ()

, uni-

formly for 2 ( 1;1). The following theorem shows at rst that n r

is a lower bound

tothe convergence rate,i.e. n r

satises (3.1).

Theorem 1 Let model (1.1)hold with g 2C k

. Let x

0

2(0;1)be an interior point of the

support of g. Let < k and r

=(1 )(k )=(2k+1 ). Assume that the regular

conditionson themarginal innovation distributionas described in Section 3.1 hold. Then

n r

is a lower bound to the convergence rate for estimating g ()

(x

0 ).

The proof of Theorem 1 isgiven inthe appendix.

Theorem 1 extends previous results as obtained by Stone (1980) and Hall and Hart

(1990a) in dierent ways. The results in Stone (1980) are extended to nonparametric

regressionwithfractionaltimeserieserrors. MaindierencesbetweenresultsofTheorem1

andthosegiveninHallandHart(1990a)are: 1. Theseresultsaregivenforall2( 1;1)

includingtheantipersistentcase and2. Theseresultsareavailablefornon-Gaussianerror

processes satisfying regular conditions on the marginal innovation distribution. 3. The

estimation of derivativesis alsoconsidered.

(8)

Remark 3. The sequence n asdened inTheorem 1isofcause alsoa lowerbound

to the convergence rate for the estimation at the two boundary points x

0

= 0or x

0

=1,

sincethe setof allmeasurablefunctionsof theobservationsatx

0

=0(rep. x

0

=1)under

therestriction thatthere arenoobservations ontheleft(rep. right)handside isasubset

of allmeasurable functions.

Remark4. IntheproofofTheorem1atwo-pointdiscriminationargumentisused. It

willbeshown thatthe probabilityonthe righthandside of (3.1)can bemade arbitrarily

closeto 1

2

. Ifamoresophisticatedmulti-pointdiscriminationargumentisusedasinStone

(1980),then it can be shown that

(3.3) lim

c

!0

liminf

n sup

g2C k

P(j~g ()

n (x

0 ) g

()

(x

0 )j>c

n

r

)=1:

Remark 5. Results of Theorem 1 are in general not available for random design

nonparametric regression or density estimation with dependent observations, since the

eect of dependence in such cases tends to be less profound than in the model to be

discussed here (see Hall and Hart,1990b).

3.3 Achievability

Beranand Feng (2001a)shown that for g 2C k

withk even, the uniform convergence

rate of the local polynomial tting ^g ()

is of order n r

for all x 2 [0;1], if a bandwidth

of the optimal order n

(1 )=(2k+1 )

is used, where r is as dened in Theorem 1 (see

Theorem 2 in Beran and Feng, 2001a). Similar results hold for function class C k

with

k >0 odd. This result can be used to show the achievability of the lower bound to

the convergence rate as dened in Theorem 1, i.e. (3.2) holds for the local polynomial

ttingg^ ()

withn r

, alsoat the two boundary pointsx

0

=0and x

0

=1. This results in

Theorem 2 Let x

0

2 [0;1]. Under the conditions of Theorem 1 it can be shown that,

n r

is the optimal convergence rate for estimating g ()

(x

0 ).

The additionalproof of Theorem 2 isstraightforward and isomitted tosave place.

Remark6. Indeed,theconvergenceraten r

asdenedinTheorem1maybeachieved

under much weaker conditions. It is clear that, (3.2) will hold, if ^g ()

is asymptotically

normal. Some suÆcient conditions under which g^ ()

is asymptotically normal are given

inBeran and Feng (2001b),which are much weaker than those described in Section3.1.

(9)

4.1 Notations

Notethat r

<1forall2( 1;1)andthat theinterpolationerrorisofordern 1

,which

is hence negligible. Therefore we may assume without lossof generality that x

0

is of the

form i

0

=n. It is notationally convenient to take x

0

= i

0

=n = 0, so we will consider the

shifted model

Y

i

=g(i=n)+

i

;i= n;:::; 1;0;1;:::;n;

and estimate g ()

at the origin. Moreover, we shall assume that both, the innite past

and the innite future, are given, i.e. we observe

(4.1) Y

i

=g(i=n)+

i

; 1<i<1:

Model (4.1) is assumed only for notational convenience, which helps us to save symbols

fordistinguishingnite andinnitesamplepaths. Itturnsout thattheextra information

isof negligiblebenet forthe derivation of a lower bound tothe convergence rate.

The main idea to prove Theorem 1 is to construct two sequences of functions. If

these two sequences are \hard to distinguish", then the dierence of them will form a

lower bound to the convergence rate. If they are \far apart" at the same time, then

the dierenceof them willform anachievable convergence rate, hencewe willobtain the

optimal convergence rate. Following Stone (1980) and Hall and Hart (1990a), let 0

be a k +1-dierentiable function on ( 1;1), vanishing outside ( 1;1) and satisfying

()

(0)>0 for =0;1;:::;k. Put

B 0

= sup

0x1 max

=0;1;:::;k j

()

(x)j:

Choose a>0so smallthat aB 0

<B. Let 0<s<1and set h=n s

. Dene

(4.2) g

(x)=ah k

(x=h):

Then g

(x) for 2f0;1gare two sequences of functions inC k

.

In the following we will denote the limits lim

n!1 n

Q

i= n

and lim

n!1 n

P

i= n by

Q

and P

for

simplicity. For 1 < i < 1, dene the doubly innitive column vectors = (

i ),

" = ("

i

) and g = (g

1

(i=n)). Dene the doubly innite matrices = ((i j)) and

(10)

=( (i j)). Let=(b

i j

)as given in(A.6)in the appendix, whereb

i

=0for i<0.

Let=(a

i j

) bethe as but with b

i j

being replaced by a

i j

. Then we have ="

and " = . Let Y =(Y

i

). We have Y

= g+. Dene X

= Y

=+", where

= g. Note that X

0

= " and Y

0

=. Furthermore, we see that X

1

is a sequence of

independent random variables.

4.2 The likelihood functions and the error probabilities

Let L

0

and L

0

denote the likelihood functions of X

0

= " and Y

0

= , respectively.

Observe that L

0 (x) =

Q

f(x

i

), where x = (:::;x

1

;x

0

;x

1

;:::) 0

is a doubly innite vector

and f is the marginal density function of "

i

. The following lemma gives the relationship

between these two likelihoodfunctions.

Lemma 1 For the fractionaltime series processdenedby(2.3) and (2.4),and a doubly

innite real vector y we have

(4.3) L

0

(y)=L

0 (x)=

1

Y

i= 1 f(x

i );

where x = y with x

i

= P

1

j= 1 b

j y

i j

, 1 < i < 1, and f is the marginal density

function of "

i .

The proof of Lemma 1 is given in the appendix. Lemma 1 shows that L is uniquely

determined by L. Note that, inversely, L is also uniquely determined by L. Following

Lemma1theestimationofthelikelihoodfunctionofaninvertiblestationarytimeseriesis

equivalenttothatofthecorrespondingiidinnovations. Theideabehindthislemmaplays

avery importantroleforthe derivationofasymptoticresultsinnonparametric regression

with dependent errors, which shows that discussions on asymptotic results in this case

may oftenbe reduced to those for models with iiderrors aftera suitable transformation.

Notethat Lemma 1 onlyholds for causal processes.

Let L

1

and L

1

denote the likelihood functions of X

1

= "+ and Y

1

= + g,

respectively. To prove Theorem 1 we need to estimate P(L

0

< L

1

j = 0) and P(L

0

>

L

1

j = 1). The following corollary of lemma 1 reduces the estimation of these error

probabilitiesto that of the independent sequences X

.

(11)

Then, under the assumptions of Lemma 1, we have

P(L

0

(y)<L

1

(y)j=0)=P(L

0

(x)<L

1

(x)j=0)

and

P(L

0

(y)>L

1

(y)j =1)=P(L

0

(x)>L

1

(x)j =1);

where x= y.

The proof of Corollary 1 is given in the appendix. Following Corollary 1, a method for

estimating the error probability developed for nonparametric regression with iid errors

couldbeadaptedtothecurrentcase. Inthispaperwewillusethe methodologyproposed

by Stone(1980). Notethat ,the deterministicpart ofX

1

,doesnot necessarilyhave the

same smooth properties asg,the deterministicpart of Y

1

. However, this does not aect

the estimation of the error probability.

4.3 A suÆcient condition

Let

n

= 1

2 g

1 (0) =

1

2

a (0)h k

= c

0 h

k

, where c

0

= 1

2

a (0). Let

n

= c

h

(k )

, where

c

=

!

2 a

()

(0) for <k. If

i

inmodel(1.1)are iid,then,followingStone (1980),itcan

be shown that a suÆcient condition, under which

n

is a lower rate of convergence for

estimating g ()

, is that there is an M >0 such that P

g 2

(i=n) < M (see equation (2.1)

in Stone, 1980). The following lemma gives a simple extension of this result to the case

when

i

are fractionalstationary timeseries errorsdened by (2.3) and (2.4).

Lemma 2 Let

i

be dened by (2.3) and (2.4). Consider the estimation of g ()

. Then

n

is a lower rate of convergence, if there isan M >0 such that

(4.4)

1

X

i= 1

2

i

=g 0

0

g<M;

where

i

are the elements of =g.

The proof of Lemma 2 is given in the appendix. Note that g

0

0 and hence g is

the dierence sequence between the two functions g

0

and g

1

. Lemma 2 shows that this

(12)

sequence is squared summable. From Lemma 2 we can also see that, if

n

is a lower

rate of convergence for estimating g, then

n

, the sequence of the -thderivative

n , is

alowerrate of convergence for estimating g ()

providing ()

(0)>0.

It is easyto showthat condition (4.4) is equivalentto

(4.5) g

0

g<

2

"

M

and further equivalentto

(4.6) g

0

1

g<

2

"

M:

Proofsof(4.5) and(4.6)are given intheappendix. Thesetworepresentationsare easyto

understand. Equation (4.6) directly shows the change in this suÆcient condition caused

by the dependence structure. The following remarks clarifythe aboveresults.

Remark 7. Foriid errors

i

= "

i

we have =I, = 2

"

I and 1

= 2

"

I, where I

denote the doubly innite identity matrix. In this case we have simply P

g 2

(i=n) <M.

Note that D = p

P

g 2

(i=n) is the L 2

-norm of g. Lemma 2 implies that any method of

deciding between = 0 and = 1, i.e. of deciding between the vector g and the zero

vector must have overall positiveerror probability, if the norm of g is bounded.

Remark 8. Assume that "

i

are normal. Following Hall and Hart (1990a) it can be

shown that, the overall error probability of any estimatorof based onY isat least

(4.7) P

a

=1

g 0

1

g

1=2

;

where is the standard normal distribution function. The error probability P

a

will be

positive,if g 0

1

gisnite. P

a

in(4.7) canbemadearbitrarilyclose to 1

2

bychoosing the

constant a in(4.2) so that a!0and henceg 0

1

g!0.

5 Acknowledgements

This work wasnished underthe advice ofProf. JanBeran, Chairof the Department of

Mathematics and Statistics, University of Konstanz, Germany, and was nancially sup-

portedbytheCenterof FinanceandEconometrics(CoFE)attheUniversityofKonstanz.

The authorgratefullyacknowledges Prof. JanBeran forhisuseful advice and comments,

which lead to improvethe quality of this paper.

(13)

Proof of Lemma 1. It is wellknown that, under common conditions, the likelihood

functions of two random vectors forming a reciprocal one-to-one mapping are uniquely

determinedby each other (see e.g. Theorem 2of Section 4.4inRohatgiand Saleh,2001,

pp. 127). Notethat this result canbeextended to doublyinniterandomvectors. The

proofofLemma1remainstocheckthatalloftheconditionsofthistheoremhold. Atrst,

"= formadoublyinnitedimensionalreciprocalone-to-one-mappingwiththeinverse

transformation =", whereboth,the originalfunction and the inverse transformation

are linear. Hence, conditions (a) to(c) of Theorem 2 ofSection 4.4inRohatgiand Saleh

(2001)hold. Furthermore,isalsothematrix ofthepartialderivativesof"withrespect

to. And the JacobianJ of the inverse transformationis the determinant jj=1,since

isa(doublyinnite)lowertrianglematrix,whosediagonalelementsareidenticallyone.

The relationship between L

0

and L

0

as given inLemma 1holds. 3

Proof of Corollary1. Observethat X

1

=X

0

+ and Y

1

=Y

0

+g. Hencewe have,

L

1

(x) = L

0

(x ) and L

1

(y) = L

0

(y g). It follows from Lemma 1, for any doubly

innitedimensional real vectors y and g,

L

1

(y) = L

0

(y g)

= L

0

(x )

= L

1 (x)=

1

Y

i= 1 f(x

i

i );

(A.1)

wherex= y, = g and f isthe marginaldensity functionof "

i

. Equations(4.3) and

(A.1) together show that L

0

(y) < L

1

(y) (or L

0

(y) > L

1

(y), or L

0

(y) = L

1

(y)), if and

only if L

0

(x)<L

1

(x) (or L

0

(x)>L

1

(x), or L

0

(x)=L

1

(x)), where x= y. Corollary1

follows fromthis fact. 3

The proofs given in the following are related to those in Stone (1980) and Hall and

Hart(1990a). Hence some detailswillbeomittedto save place. Tothis end we refer the

reader to the proofs in these works. We also refer the reader to read Theorem 1 in Hall

(1989) and its proof. Note that the symbol in this paper is dierently dened as that

used inHall and Hart (1990a).

Proof of Lemma 2. Let

n

is as dened in Lemma2. Notethat

sup

g2C k

P

g fj~g

()

n

(0) g ()

(0)j

n

gmax

=0;1 P

fj~g

()

(0) g ()

(0)j

n g:

(14)

Let

~

n

=0or1minimizesj~g

n

(0) g

~

(0)j. Then

~

n

6= impliesj~g

n

(0) g

(0)j

n ,

and hence

max

=0;1 P

fj~g

()

(0) g ()

(0)j

n

g max

=0;1 P

(

~

6=)

1

2 fP

0 (

~

=1)+P

1 (

~

=0)g

1

2 fP

0 (

^

=1)+P

1 (

^

=0)g;

(A.2)

where

^

isthemaximumlikelihoodestimatorof(orthelikelihoodratiodiscriminator)in

thetwo-parameterproblem. ThelastinequalityfollowsfromtheNeyman-Pearsonlemma.

From Corollary1 we have

max

=0;1 P

fj~g

()

(0) g ()

(0)j

n

g

1

2 (P

0 (L

0

<L

1 )+P

1 (L

1

<L

0 ))

= 1

2 (P

0 (L

0

<L

1 )+P

1 (L

1

<L

0 )):

(A.3)

Let L

R

denote the likelihood ratio L

1

=L

0

. By calculations similar to those given on

pages 1352 - 1353 of Stone (1980), it can be shown under the regular conditions on the

marginal distribution of "

i

as given in Section 3.1, that there is a positive constant M

1

such that

(A.4) E

0

jlog(L

R

)j<M

1

and

(A.5) lim

a!0 E

0 jlog(L

R )j=0:

Similar formulas as given in (A.4) and (A.5) hold for the expectation under = 1 with

anotherpositiveconstantM

2

. LetM

0

=max(M

1

;M

2

). Thenwecannd anintegerK

2and 0< <

1

2

such that if L

R

>(1 )= orL

R

<=(1 ),then jlog(L

R

)jKM

0 .

Following the Markov inequality

P

0

1

L

R

1

>

K 1

K

P

1

L

R

1

>

K 1

K :

Put priori probabilities 1/2 eachon =0 and =1. Then

P(=1jY)= 1

2 L

1

2 L

1 +

1

2 L

0

= L

R

L

R +1

(15)

P( P( =1jY)1 ) = P

L

R

L

R +1

1

= P

1

L

R

1

= 1

2 P

0

1

L

R

1

+ 1

2 P

1

L

R

1

K 1

K :

That is,the error probability of

^

isat least K 1

K .

Notethat K 1

K

can bemadearbitrarilyclose to 1

2

asÆ !0by choosingK suÆciently

large and suÆciently close to 1

2

at the same time. 3

Proof of equations (4.5) and (4.6). The matrix is given by

(A.6) =

0

B

@

.

. .

.

. .

.

. .

.

. .

.

. .

.

1 0 0 0 0

b

1

1 0 0 0

b

2 b

1

1 0 0

.

. .

.

. .

.

. .

.

. .

.

. .

.

b

n 1 b

n 2 b

n 3

1 0

b

n b

n 1 b

n 2

b

1

.

. .

.

. .

.

. .

.

. .

.

. .

.

. 1

C

A :

Following the denition of I

(i j) we have = 2

"

0

. Furthermore, it can be shown

that

0

=

0

. The equivalence between (4.4) and (4.5) follows from this fact. The

equivalencebetweenthetwoconditions(4.5)and(4.6)isduetothefactthat 1

= 4

"

inthe sense that = 4

"

=I(see e.g. Shaman(1975)and Beran 1994, pp. 109 .). 3

Proof of Theorem 1. Without loss of generality we will assume that 2

"

= 1 for

convenience. For =0let

n

=c

0 h

k

equaltothe ratec

0 n

r

0

,wherer

0

=(1 )k=(2k+

1 )is asdened in Theorem 1. Then wehave h=n s

with s=(1 )=(2k+1 ).

Following Lemma2, we have toshow that the sequence g under this choice of h satises

e.g. thecondition P

2

i

=g 0

0

g<1,inorderthatc

n

r

isalowerrateofconvergence

for estimating g ()

.

(16)

i

=(

i

)denote the corresponding doubly innite vector. Then we have

g 0

0

g=

1

4 h

2k 2

(0) 0

0

:

Observe that

i

=0 for i< m ori>m. We have

0

=

1

X

j= 1 m

X

i= m

i b

i+j

!

2

= (2m+1) 2m

X

k= 2m

I

(k) 1

2m+1 m

X

j= m

(j=m) f(j+k)=m)g:

(A.7)

Equation (A.7)can alsobeobtained by directly analyzing 0

.

Based on (A.7) we can obtained results for the cases with = 0, 0 < < 1 and

1< <0,separately. Note that the methodology used inthe proof of Theorem 3.1 in

HallandHart(1990a)forthe casewith0<<1isbasedontheassumption b

i

=b

i for

i=1;2;:::, and is hencenot suitable forthe causal error process inthis paper, sincenow

we have b

i

=0 for i < 0. The methodology used in the following is developed based on

the property (1.2) of a fractional time series, which does not involve the exact structure

of b

i .

Assume that = 0. Note that in this case P

(k) I

> 0 and P

j(k) I

j < 1. From

(A.7)we have

0

:

=(2m+1)

X

I

(k)

Z

1

1 2

(u)du:

Notethat h=n

1=(2k+1)

and m =nh=n

2k=(2k+1)

=h 2k

for =0,whence

1

4 h

2k 2

(0) 0

0

<1:

In the case with 0 < < 1 the inverse process I

is an antipersistent process with

the parameter 1 <

I

= < 0 in (2.7) and hence for jkj suÆciently large we have

I

(k)c I

jkj

1

,wherec

=2c I

f

(1

I

)sin(

I

=2)<0(seeBeran,1994andBeranand

Feng, 2001a), which implies that I

(k) are ultimately negative for jkj suÆciently large.

Furthermore,wehave P

I

(k)=0andhence P

m

k= m

I

(k)= 2 P

k>m

I

(k)=O(m

).

(17)

0

= (2m+1) 2m

X

k= 2m

I

(k) 1

2m+1 m

X

j= m

(j=m) f(j+k)=m)g

(2m+1) m

X

k= m

I

(k) 1

2m+1 m

X

j= m

(j=m) f(j+k)=m)g

= (2m+1)O m

X

k= m

I

(k)

!

=O(m 1

):

Now we have h = n

(1 )=(2k+1 )

and m = nh = n

2k=(2k+1 )

. This results in m 1

=

h 2k

, sothat

1

4 h

2k 2

(0) 0

0

=h 2k

O(h 2k

)<1:

If 1 < < 0, the inverse process I

is a long-memory process with the parameter

0 <

I

= < 1 in (2.7) and hence, for jkj suÆciently large, I

(k) c I

jkj

1

, where

c

=2c I

f

(1

I

)sin(

I

=2)>0,sothat I

(k)>0forjkjsuÆcientlylarge. Furthermore,

we have P

I

(k)= 1 with P

2m

I

(k) =O(m

). Notethat can be chosen so that,

for large k, P

m

j= m

(j=m) f(j+k)=m)g<

P

m

j= m 2

(j=m). Hence we have

0

= (2m+1) 2m

X

k= 2m

I

(k) 1

2m+1 m

X

j= m

(j=m) f(j+k)=m)g

(2m+1) 2m

X

k= 2m

I

(k) 1

2m+1 m

X

j= m 2

(j=m)

:

= (2m+1) 2m

X

k= 2m

I

(k) Z

1

1 2

(u)du

= O(m 1

):

In fact,we have

0

= O(m 1

)

uniformly for 2 ( 1;1). However, the derivation for this result is a little dierent in

the three cases. Now, note that h=n

(1 )=(2k+1 )

,whence, asbefore, m 1

=h 2k

,so

that

1

4 h

2k 2

(0) 0

0

=h 2k

O(h 2k

)<1:

Theorem 1 isproved. 3

(18)

Beran, J. (1994), Statistics for Long-Memory Processes, New York: Chapman & Hall.

Beran, J. (1999),SEMIFAR models {A semiparametricframework formodellingtrends,

long range dependence and nonstationarity,Discussion paper No. 99/16, Center of

Finance and Econometrics,University of Konstanz.

Beran, J. and Feng, Y. (2001a), Locally polynomial tting with long-memory, short-

memoryand antipersistenterrors, toappear inAnnalsof the Institute of Statistical

Mathematics.

Beran, J.and Feng,Y. (2001b),LocallypolynomialestimationwithaFARIMA-GARCH

error process, toappear in Bernoulli.

Chateld, C. (1979). Inverse autocorrelations. J. R. Statist. Soc.. ser. A 142 363{377.

Cleveland, W.S. (1972). The inverse autocorrelations of a time series and their applica-

tions (withdiscussion). Technometrics 14 277{298.

Cleveland, W.S. (1979). Robust locally weighted regression and smoothing scatterplots.

J. Amer. Statist. Assoc. 74 829{836.

Farrell,R.H.(1972). Onthebestobtainableasymptoticratesofconvergenceinestimation

of density function ata point. Ann. Math. Statist. 43 170{180.

Granger,C.W.J.and Joyeux,R.(1980),Anintroductiontolong-rangetimeseriesmodels

and fractionaldierencing," J. Time Ser. Anal., 1, 15-30.

Hardle, W., Hall,P.and Marron, J.S. (1992),Regression smoothing parameters that are

not far fromtheir optimum, J. Amer. Statist. Assoc., 87, 227{233.

Hall,P.(1989),Onconvergenceratesinnonparametricproblems,Intern. Statiat. Review

57 45{58.

Hall, P. and Hart,J.D. (1990a),Nonparametric regression with long-range dependence,

Stochastic Process. Appl., 36,339{351.

Hall, P. and Hart, J.D. (1990b), Convergence rates in density estimation for data from

innite-ordermoving average processes, Probab. Theory Rel. Fields87 253{274.

(19)

Shaman, P.(1975). An approximate inverse for the covariance matrix of movingaverage

and autoregressive processes. Ann. Statist. 3 532{538.

Stone, C.J. (1977). Consistent nonparametric regression (with discussion). Ann. Statist.

5 595{620.

Stone, C.J. (1980). Optimal rates of convergence for nonparametric estimators. Ann.

Statist., 8, 1348{1360.

Stone, C.J. (1982). Optimal global rates of convergence for nonparametric regression.

Ann. Statist., 10, 1040{1053.