Regression with Fractional Time Series Errors
Yuanhua Feng
Department of Mathematics and Statistics
University of Konstanz
Summary. Considertheestimationofg ()
,the-thderivationofthemeanfunction
inaxeddesign,nonparametricregressionwithalinear,invertible,stationarytime
serieserrorprocess
i
. Assume that g2C k
and that the spectral density of
i has
the form f() c
f jj
as ! 0 with constants c
f
> 0 and 2 ( 1;1). Let
r
= (1 )(k )=(2k+1 ). It is shown that the optimal convergence rate
forg^ ()
is n r
. This rate is achieved by local polynomialtting. It is also shown
that the required regular conditions on the innovation distribution in the current
context arethesame asthose innonparametric regressionwithiiderrors.
Keywords: Nonparametric regression, optimal convergence rate, long memory, an-
tipersistence,inverse process.
1 Introduction
Considertheestimationofthe-thderivationofthemeanfunction,g ()
,intheequidistant
design nonparametric regression model
(1.1) Y
i
=g(x
i )+
i
;
where x
i
=i=n, g : [0;1]! < is a smooth function and
i
is a linear, (second order and
strict)stationaryprocessgeneratedbyaniid(identicallyindependentdistributed)innova-
tionseries "
i
through alinearlter. Forthe autocovariancefunction(k)=cov(
i
;
i+k ),
it is assumed that (k) ! 0 as jkj ! 1. Equation (1.1) represents a nonparametric
regression model with short memory (including iid
i
as a special case), long memory
and antipersistence. Here, a stationary process
i
is said to have long memory (or long-
rangedependence),if P
(k)=1. Amore strictassumptionisthat thespectraldensity
f()=(2) 1
(k)exp (ik) has a poleat the origin ofthe form
(1.2) f()c
f jj
(as !0)
for some 2 (0;1), where c
f
>0 is a constant and `' means that the ratio of the left
andthe righthandsides converges toone (see Beran,1994, andreferences therein). Note
that, (1.2) implies that (k) c
jkj
1
so that P
(k) = 1. Hence now
i
has long
memory. If (1.2) holds with =0, then we have 0<
P
(k)<1 and
i
is said to have
short memory. On the other hand, a stationary process is said to be antipersistent, if
(1.2) holds for 2( 1;0) implyingthat P
(k)=0.
The aim of this paper is to investigate the minimax optimal convergence rate of a
nonparametric estimator of g ()
(see e.g. Farrell, 1972, Stone, 1980, 1982 and Hall and
Hart, 1990a for related works). For a summary of the nonparametric minimax theory
we refer the reader to Hall (1989). Halland Hart (1990a) obtained optimal convergence
rate for estimating g in nonparametric regression with Gaussian stationary short- and
long-memory errors. In this paper a unied formula for the optimal convergence rate
for estimating g ()
in nonparametric regression with short-memory, long-memory and
antipersistent errors is given. It is shown that this rate is achieved by local polynomial
tting(Beran and Feng, 2001a). Our nding generalizesin various ways previous results
in Stone (1980) and Hall and Hart (1990a). A simple condition under which a sequence
n r
forms a lower bound to the convergence rate is given for nonparametric regression
withstationarytimeserieserrorsatanydependencelevel. Resultsinthispaperaregiven
for Gaussianand non-Gaussian error processes satisfyingsome regularconditions.
The estimator and the error process are dened in section 2. Section 3 describes the
conditionsonthedistributionandprovidesthemainresults. Itturnsoutthattherequired
regularconditionsonthemarginalinnovationdistributionarethesameforall2( 1;1)
andhencedonot dependonthe dependence structure. Someauxiliaryresults,whichcan
be thought of as a part of the proofs, are given in section 4. Detailed proofs are put in
the appendix.
2.1 The local polynomial tting
Kernel estimatorof g in nonparametric regression with short-memory and long-memory
errorswasproposedbyHallandHart(1990a). Beran(1999)extendedthekernelestimator
to nonparametric regression with antipersistence. However, it is well known that the
kernel estimator isaected by the boundaryproblem. Another attractive nonparametric
approachisthelocalpolynomialttingintroducedbyStone(1977)and Cleveland(1979).
Beran and Feng (2001a) proposed local polynomial tting in nonparametric regression
with short-memory, long-memoryand antipersistenterrors. In this paperwewill use the
proposalinBeran andFeng (2001a)toshowthe achievabilityof the optimalconvergence
rate.
Let k 2 be a positive integer. The function class considered in this paper is C k
(B),
the collectionof all k times dierentiablefunctions g on[0;1]which satisfy
sup
0x1 max
=0;1;:::;k jg
()
(x)jB:
Let p =k 1. Then g can be locally approximated by a polynomial of order p for x in
the neighbourhoodof a pointx
0 :
(2.1) g(x)=g(x
0 )+g
0
(x
0
)(x x
0
)+:::+g (p)
(x
0
)(x x
0 )
p
=p!+R
p
;
where R
p
is a remainder term. Let K be a second order kernel (a symmetric density)
having compact support [ 1;1]. Given n observations Y
1
, ..., Y
n
, we can obtain an
estimatorof g ()
( p) by solving the locallyweighted least squares problem
(2.2) Q=
n
X
i=1 (
Y
i p
X
j=0
j (x
i x
0 )
j )
2
K
x
i x
0
h
)min;
where h is the bandwidth. Let
^
= (
^
0
;
^
1
;:::;
^
p )
0
be the solution of (2.2). Then it is
clear from (2.1) that g^ ()
(x
0
):=!
^
estimates g ()
(x
0
), =0;1;:::;p,which is the local
polynomial tting of g ()
. Note in particular that ^g ()
is the same for nonparametric
regression with stationarytime series errorsat any dependence level.
In this paper it is assumed that the spectral density of
i
has the form (1.2). Hence
i
willbecalled afractionaltime serieserror process.
i
isalsoassumed tobecausal,linear
and invertible. That is,
i
can beexpressed in two ways:
(2.3)
i
= (B)"
i
;
and
(2.4) "
i
='(B)
i
;
where the innovations "
i
are iid mean zero random variables with var("
i ) =
2
"
< 1,
B is the backshift operator, and (B) = P
1
j=0 a
j B
j
and '(B) = P
1
j=0 b
j B
j
are the
characteristic polynomials of the MA and AR representations of
i
, respectively, with
a
0
=b
0
=1, P
a 2
j
<1 and P
b 2
j
<1. The causality of
i
ismade here for convenience.
Some properties of
i
can be understood more easily by means of its inverse process.
FollowingChateld (1979),the inverse process of
i
,denoteby I
i
,is theprocess with the
sameinnovations"
i
and '(B)rep. (B)asitscharacteristicpolynomialsforthe MArep.
AR representations, which isgiven by
(2.5)
I
i
='(B)"
i
;
and
(2.6) "
i
= (B) I
i :
Following Shaman(1975),the spectral density of I
i , f
I
()say, is
(2.7) f
I
()= 4
"
(2) 2
(f()) 1
c I
f jj
I
(as !0);
wherec I
f
= 4
"
(2) 2
(c
f )
1
and I
= . Equation (2.7) implies that: 1. If
i
is ashort-
memoryprocess, sois I
i
(inparticular, theinverse process ofaniidprocessisthe process
itself); 2. If
i
is a long-memory process with 0 < < 1, then I
is an antipersistent
process with I
= , and vice versa.
From (2.3) wesee thatthe autocovariancesof
i
are (k)= 2
"
P
a
j a
j+jkj
. The inverse
autocovariances of
i
(Cleveland, 1972 and Chateld, 1979), i.e. the autocovariances
of I
i
, are given by I
(k) = 2
"
1
j=0 b
j b
j+jkj
. Hence we have (k) = 2
"
( a
j )
2
and
P
I
(k) = 2
"
( P
b
j )
2
. This results in P
a
j
= 1, P
b
j
= 0 for > 0 and P
a
j
= 0,
P
b
j
=1 for <0. For =0we have both, 0<
P
a
j
<1and 0<
P
b
j
<1.
A class of processes having the property (1.2) is the class of the FARIMA(p;Æ;q)
(fractional ARIMA) processes (Granger and Joyeux, 1980 and Hosking, 1981), where
Æ2( 0:5;0:5) isthe fractionaldierencing parameter. It iswellknown that the spectral
density of a FARIMA process has the form(1.2) with =2Æ.
3 Optimal convergence rates
3.1 Assumptions on the innovation distribution
An important nding of this paper is that the derivation of that a given sequence is a
lower bound to the convergence rate in nonparametric regression with error process
i
is similar to that for nonparametric regression with the iid errors "
i
. Furthermore, it
turns out that the required conditions on the marginal distribution of "
i
under model
(1.1) with any 2 ( 1;1) are the same, i.e. which do not depend on the dependence
level. Inthe followingwe willadaptthe regular conditionsin Stone(1980, 1982)to xed
design nonparametric regression. Assume that Z(g)is a real random variable depending
on g 2 <. It is assumed that the density function f(z;g) is strictly positive and that
f(z;g)=f(z g;0),where g isthe meanfunction of Z(g), i.e.
Z
zf(z;g)dz =g
for allg 2<. It isfurther assumed that the equation
Z
f(z;g)dz=1
can be twice continuously dierentiated with respect tog toyield
Z
f 0
(z;g)dz=0
and
Z
f 00
(z;g)dz =0:
i
which will be simply denoted by f(z) in the following. Using this notation the density
of Z(g) may be represented as f(z;g) = f(z g). Set l(z;g) = logf(z;g). There are
positive constants
0
and C and there is afunction M(z;g) such that for g 2<
jl 00
(z;g+)jM(z;g) for jj
0
and
Z
M(z;g)f(z;g)dz C:
Notethat the last condition isfullled, if l 00
(z;g) is bounded.
Remark 1. It is easy to show that all of these conditions are fullled, if Z(g) is
Gaussianwith
f(z;g)= 1
p
2
"
e 1
2 (z g)
2
2
"
; 1<z;g <1:
And itis also not hard toshow that these conditions are fullled, if the marginaldistri-
butionof "
i
isthe studentt
m
distributionwith m 3,i.e. if f(z;g)is given by
f
m
(z;g)=
[(m+1)=2]
(m=2) p
m
1+
(z g) 2
m
(m+1)=2
; 1<z;g <1:
Remark 2. Observethat howeverotherdistributionsconsidered by Stone(1980),e.g.
the exponential distribution, do not satisfy the regular conditions given above. If "
i are
iidexponentialdistributedwith E("
i
)=0 andvar("
i
)=,then densityfunction ofZ(g)
isgiven by
f(z;g)= 1
e
(z+ g)=
; 1<g <1 and g z <1
and zero otherwise. The support of f >0 for this distribution depends ong.
3.2 Lower bounds to convergence rates
For the minimax optimal convergence rate we will use the following denition (see e.g.
Farrell,1972, Stone,1980 andHalland Hart,1990a). Let <k be anonnegativeinteger
and ~g ()
n
denote agenericnonparametric estimatorof g ()
based on(Y
1
;:::;Y
n
). Let r
be
a positive number. The sequence n is called a lower bound to the convergence rate at
x
0 if
(3.1) liminf
n sup
g2C k
P(j~g ()
n (x
0 ) g
()
(x
0 )j>c
n
r
)>0
forc
suÆcientlysmall. n r
iscalledanachievable convergencerateifthereisasequence
of estimators g^ ()
n
such that
(3.2) lim
c!1
limsup
n sup
g2C k
P(j^g ()
n (x
0 ) g
()
(x
0 )j>c
n
r
)=0:
Also,the sequence n r
is called the optimal convergence rate if it is anachievable lower
boundtotheconvergence rate. The optimalconvergence rateforanonparametricregres-
sion estimatorofg ()
withiiderrors is n
(k )=(2k+1)
(Stone,1980). In fact, n
(k )=(2k+1)
is alsothe optimal convergence rate for estimating g ()
in nonparametric regression with
short-memory errors (results for = 0 may be found in Hall and Hart, 1990a). In
the case with 0 < < 1, Hall and Hart (1990a) shown that the optimal convergence
rate is n
(1 )k=(2k+1 )
for estimating g. In this paper we will show that n r
with
r
=(1 )(k )=(2k+1 ) is the optimalconvergence rate for estimating g ()
, uni-
formly for 2 ( 1;1). The following theorem shows at rst that n r
is a lower bound
tothe convergence rate,i.e. n r
satises (3.1).
Theorem 1 Let model (1.1)hold with g 2C k
. Let x
0
2(0;1)be an interior point of the
support of g. Let < k and r
=(1 )(k )=(2k+1 ). Assume that the regular
conditionson themarginal innovation distributionas described in Section 3.1 hold. Then
n r
is a lower bound to the convergence rate for estimating g ()
(x
0 ).
The proof of Theorem 1 isgiven inthe appendix.
Theorem 1 extends previous results as obtained by Stone (1980) and Hall and Hart
(1990a) in dierent ways. The results in Stone (1980) are extended to nonparametric
regressionwithfractionaltimeserieserrors. MaindierencesbetweenresultsofTheorem1
andthosegiveninHallandHart(1990a)are: 1. Theseresultsaregivenforall2( 1;1)
includingtheantipersistentcase and2. Theseresultsareavailablefornon-Gaussianerror
processes satisfying regular conditions on the marginal innovation distribution. 3. The
estimation of derivativesis alsoconsidered.
Remark 3. The sequence n asdened inTheorem 1isofcause alsoa lowerbound
to the convergence rate for the estimation at the two boundary points x
0
= 0or x
0
=1,
sincethe setof allmeasurablefunctionsof theobservationsatx
0
=0(rep. x
0
=1)under
therestriction thatthere arenoobservations ontheleft(rep. right)handside isasubset
of allmeasurable functions.
Remark4. IntheproofofTheorem1atwo-pointdiscriminationargumentisused. It
willbeshown thatthe probabilityonthe righthandside of (3.1)can bemade arbitrarily
closeto 1
2
. Ifamoresophisticatedmulti-pointdiscriminationargumentisusedasinStone
(1980),then it can be shown that
(3.3) lim
c
!0
liminf
n sup
g2C k
P(j~g ()
n (x
0 ) g
()
(x
0 )j>c
n
r
)=1:
Remark 5. Results of Theorem 1 are in general not available for random design
nonparametric regression or density estimation with dependent observations, since the
eect of dependence in such cases tends to be less profound than in the model to be
discussed here (see Hall and Hart,1990b).
3.3 Achievability
Beranand Feng (2001a)shown that for g 2C k
withk even, the uniform convergence
rate of the local polynomial tting ^g ()
is of order n r
for all x 2 [0;1], if a bandwidth
of the optimal order n
(1 )=(2k+1 )
is used, where r is as dened in Theorem 1 (see
Theorem 2 in Beran and Feng, 2001a). Similar results hold for function class C k
with
k >0 odd. This result can be used to show the achievability of the lower bound to
the convergence rate as dened in Theorem 1, i.e. (3.2) holds for the local polynomial
ttingg^ ()
withn r
, alsoat the two boundary pointsx
0
=0and x
0
=1. This results in
Theorem 2 Let x
0
2 [0;1]. Under the conditions of Theorem 1 it can be shown that,
n r
is the optimal convergence rate for estimating g ()
(x
0 ).
The additionalproof of Theorem 2 isstraightforward and isomitted tosave place.
Remark6. Indeed,theconvergenceraten r
asdenedinTheorem1maybeachieved
under much weaker conditions. It is clear that, (3.2) will hold, if ^g ()
is asymptotically
normal. Some suÆcient conditions under which g^ ()
is asymptotically normal are given
inBeran and Feng (2001b),which are much weaker than those described in Section3.1.
4.1 Notations
Notethat r
<1forall2( 1;1)andthat theinterpolationerrorisofordern 1
,which
is hence negligible. Therefore we may assume without lossof generality that x
0
is of the
form i
0
=n. It is notationally convenient to take x
0
= i
0
=n = 0, so we will consider the
shifted model
Y
i
=g(i=n)+
i
;i= n;:::; 1;0;1;:::;n;
and estimate g ()
at the origin. Moreover, we shall assume that both, the innite past
and the innite future, are given, i.e. we observe
(4.1) Y
i
=g(i=n)+
i
; 1<i<1:
Model (4.1) is assumed only for notational convenience, which helps us to save symbols
fordistinguishingnite andinnitesamplepaths. Itturnsout thattheextra information
isof negligiblebenet forthe derivation of a lower bound tothe convergence rate.
The main idea to prove Theorem 1 is to construct two sequences of functions. If
these two sequences are \hard to distinguish", then the dierence of them will form a
lower bound to the convergence rate. If they are \far apart" at the same time, then
the dierenceof them willform anachievable convergence rate, hencewe willobtain the
optimal convergence rate. Following Stone (1980) and Hall and Hart (1990a), let 0
be a k +1-dierentiable function on ( 1;1), vanishing outside ( 1;1) and satisfying
()
(0)>0 for =0;1;:::;k. Put
B 0
= sup
0x1 max
=0;1;:::;k j
()
(x)j:
Choose a>0so smallthat aB 0
<B. Let 0<s<1and set h=n s
. Dene
(4.2) g
(x)=ah k
(x=h):
Then g
(x) for 2f0;1gare two sequences of functions inC k
.
In the following we will denote the limits lim
n!1 n
Q
i= n
and lim
n!1 n
P
i= n by
Q
and P
for
simplicity. For 1 < i < 1, dene the doubly innitive column vectors = (
i ),
" = ("
i
) and g = (g
1
(i=n)). Dene the doubly innite matrices = ((i j)) and
=( (i j)). Let=(b
i j
)as given in(A.6)in the appendix, whereb
i
=0for i<0.
Let=(a
i j
) bethe as but with b
i j
being replaced by a
i j
. Then we have ="
and " = . Let Y =(Y
i
). We have Y
= g+. Dene X
= Y
=+", where
= g. Note that X
0
= " and Y
0
=. Furthermore, we see that X
1
is a sequence of
independent random variables.
4.2 The likelihood functions and the error probabilities
Let L
0
and L
0
denote the likelihood functions of X
0
= " and Y
0
= , respectively.
Observe that L
0 (x) =
Q
f(x
i
), where x = (:::;x
1
;x
0
;x
1
;:::) 0
is a doubly innite vector
and f is the marginal density function of "
i
. The following lemma gives the relationship
between these two likelihoodfunctions.
Lemma 1 For the fractionaltime series processdenedby(2.3) and (2.4),and a doubly
innite real vector y we have
(4.3) L
0
(y)=L
0 (x)=
1
Y
i= 1 f(x
i );
where x = y with x
i
= P
1
j= 1 b
j y
i j
, 1 < i < 1, and f is the marginal density
function of "
i .
The proof of Lemma 1 is given in the appendix. Lemma 1 shows that L is uniquely
determined by L. Note that, inversely, L is also uniquely determined by L. Following
Lemma1theestimationofthelikelihoodfunctionofaninvertiblestationarytimeseriesis
equivalenttothatofthecorrespondingiidinnovations. Theideabehindthislemmaplays
avery importantroleforthe derivationofasymptoticresultsinnonparametric regression
with dependent errors, which shows that discussions on asymptotic results in this case
may oftenbe reduced to those for models with iiderrors aftera suitable transformation.
Notethat Lemma 1 onlyholds for causal processes.
Let L
1
and L
1
denote the likelihood functions of X
1
= "+ and Y
1
= + g,
respectively. To prove Theorem 1 we need to estimate P(L
0
< L
1
j = 0) and P(L
0
>
L
1
j = 1). The following corollary of lemma 1 reduces the estimation of these error
probabilitiesto that of the independent sequences X
.
Then, under the assumptions of Lemma 1, we have
P(L
0
(y)<L
1
(y)j=0)=P(L
0
(x)<L
1
(x)j=0)
and
P(L
0
(y)>L
1
(y)j =1)=P(L
0
(x)>L
1
(x)j =1);
where x= y.
The proof of Corollary 1 is given in the appendix. Following Corollary 1, a method for
estimating the error probability developed for nonparametric regression with iid errors
couldbeadaptedtothecurrentcase. Inthispaperwewillusethe methodologyproposed
by Stone(1980). Notethat ,the deterministicpart ofX
1
,doesnot necessarilyhave the
same smooth properties asg,the deterministicpart of Y
1
. However, this does not aect
the estimation of the error probability.
4.3 A suÆcient condition
Let
n
= 1
2 g
1 (0) =
1
2
a (0)h k
= c
0 h
k
, where c
0
= 1
2
a (0). Let
n
= c
h
(k )
, where
c
=
!
2 a
()
(0) for <k. If
i
inmodel(1.1)are iid,then,followingStone (1980),itcan
be shown that a suÆcient condition, under which
n
is a lower rate of convergence for
estimating g ()
, is that there is an M >0 such that P
g 2
(i=n) < M (see equation (2.1)
in Stone, 1980). The following lemma gives a simple extension of this result to the case
when
i
are fractionalstationary timeseries errorsdened by (2.3) and (2.4).
Lemma 2 Let
i
be dened by (2.3) and (2.4). Consider the estimation of g ()
. Then
n
is a lower rate of convergence, if there isan M >0 such that
(4.4)
1
X
i= 1
2
i
=g 0
0
g<M;
where
i
are the elements of =g.
The proof of Lemma 2 is given in the appendix. Note that g
0
0 and hence g is
the dierence sequence between the two functions g
0
and g
1
. Lemma 2 shows that this
sequence is squared summable. From Lemma 2 we can also see that, if
n
is a lower
rate of convergence for estimating g, then
n
, the sequence of the -thderivative
n , is
alowerrate of convergence for estimating g ()
providing ()
(0)>0.
It is easyto showthat condition (4.4) is equivalentto
(4.5) g
0
g<
2
"
M
and further equivalentto
(4.6) g
0
1
g<
2
"
M:
Proofsof(4.5) and(4.6)are given intheappendix. Thesetworepresentationsare easyto
understand. Equation (4.6) directly shows the change in this suÆcient condition caused
by the dependence structure. The following remarks clarifythe aboveresults.
Remark 7. Foriid errors
i
= "
i
we have =I, = 2
"
I and 1
= 2
"
I, where I
denote the doubly innite identity matrix. In this case we have simply P
g 2
(i=n) <M.
Note that D = p
P
g 2
(i=n) is the L 2
-norm of g. Lemma 2 implies that any method of
deciding between = 0 and = 1, i.e. of deciding between the vector g and the zero
vector must have overall positiveerror probability, if the norm of g is bounded.
Remark 8. Assume that "
i
are normal. Following Hall and Hart (1990a) it can be
shown that, the overall error probability of any estimatorof based onY isat least
(4.7) P
a
=1
g 0
1
g
1=2
;
where is the standard normal distribution function. The error probability P
a
will be
positive,if g 0
1
gisnite. P
a
in(4.7) canbemadearbitrarilyclose to 1
2
bychoosing the
constant a in(4.2) so that a!0and henceg 0
1
g!0.
5 Acknowledgements
This work wasnished underthe advice ofProf. JanBeran, Chairof the Department of
Mathematics and Statistics, University of Konstanz, Germany, and was nancially sup-
portedbytheCenterof FinanceandEconometrics(CoFE)attheUniversityofKonstanz.
The authorgratefullyacknowledges Prof. JanBeran forhisuseful advice and comments,
which lead to improvethe quality of this paper.
Proof of Lemma 1. It is wellknown that, under common conditions, the likelihood
functions of two random vectors forming a reciprocal one-to-one mapping are uniquely
determinedby each other (see e.g. Theorem 2of Section 4.4inRohatgiand Saleh,2001,
pp. 127). Notethat this result canbeextended to doublyinniterandomvectors. The
proofofLemma1remainstocheckthatalloftheconditionsofthistheoremhold. Atrst,
"= formadoublyinnitedimensionalreciprocalone-to-one-mappingwiththeinverse
transformation =", whereboth,the originalfunction and the inverse transformation
are linear. Hence, conditions (a) to(c) of Theorem 2 ofSection 4.4inRohatgiand Saleh
(2001)hold. Furthermore,isalsothematrix ofthepartialderivativesof"withrespect
to. And the JacobianJ of the inverse transformationis the determinant jj=1,since
isa(doublyinnite)lowertrianglematrix,whosediagonalelementsareidenticallyone.
The relationship between L
0
and L
0
as given inLemma 1holds. 3
Proof of Corollary1. Observethat X
1
=X
0
+ and Y
1
=Y
0
+g. Hencewe have,
L
1
(x) = L
0
(x ) and L
1
(y) = L
0
(y g). It follows from Lemma 1, for any doubly
innitedimensional real vectors y and g,
L
1
(y) = L
0
(y g)
= L
0
(x )
= L
1 (x)=
1
Y
i= 1 f(x
i
i );
(A.1)
wherex= y, = g and f isthe marginaldensity functionof "
i
. Equations(4.3) and
(A.1) together show that L
0
(y) < L
1
(y) (or L
0
(y) > L
1
(y), or L
0
(y) = L
1
(y)), if and
only if L
0
(x)<L
1
(x) (or L
0
(x)>L
1
(x), or L
0
(x)=L
1
(x)), where x= y. Corollary1
follows fromthis fact. 3
The proofs given in the following are related to those in Stone (1980) and Hall and
Hart(1990a). Hence some detailswillbeomittedto save place. Tothis end we refer the
reader to the proofs in these works. We also refer the reader to read Theorem 1 in Hall
(1989) and its proof. Note that the symbol in this paper is dierently dened as that
used inHall and Hart (1990a).
Proof of Lemma 2. Let
n
is as dened in Lemma2. Notethat
sup
g2C k
P
g fj~g
()
n
(0) g ()
(0)j
n
gmax
=0;1 P
fj~g
()
(0) g ()
(0)j
n g:
Let
~
n
=0or1minimizesj~g
n
(0) g
~
(0)j. Then
~
n
6= impliesj~g
n
(0) g
(0)j
n ,
and hence
max
=0;1 P
fj~g
()
(0) g ()
(0)j
n
g max
=0;1 P
(
~
6=)
1
2 fP
0 (
~
=1)+P
1 (
~
=0)g
1
2 fP
0 (
^
=1)+P
1 (
^
=0)g;
(A.2)
where
^
isthemaximumlikelihoodestimatorof(orthelikelihoodratiodiscriminator)in
thetwo-parameterproblem. ThelastinequalityfollowsfromtheNeyman-Pearsonlemma.
From Corollary1 we have
max
=0;1 P
fj~g
()
(0) g ()
(0)j
n
g
1
2 (P
0 (L
0
<L
1 )+P
1 (L
1
<L
0 ))
= 1
2 (P
0 (L
0
<L
1 )+P
1 (L
1
<L
0 )):
(A.3)
Let L
R
denote the likelihood ratio L
1
=L
0
. By calculations similar to those given on
pages 1352 - 1353 of Stone (1980), it can be shown under the regular conditions on the
marginal distribution of "
i
as given in Section 3.1, that there is a positive constant M
1
such that
(A.4) E
0
jlog(L
R
)j<M
1
and
(A.5) lim
a!0 E
0 jlog(L
R )j=0:
Similar formulas as given in (A.4) and (A.5) hold for the expectation under = 1 with
anotherpositiveconstantM
2
. LetM
0
=max(M
1
;M
2
). Thenwecannd anintegerK
2and 0< <
1
2
such that if L
R
>(1 )= orL
R
<=(1 ),then jlog(L
R
)jKM
0 .
Following the Markov inequality
P
0
1
L
R
1
>
K 1
K
P
1
1
L
R
1
>
K 1
K :
Put priori probabilities 1/2 eachon =0 and =1. Then
P(=1jY)= 1
2 L
1
1
2 L
1 +
1
2 L
0
= L
R
L
R +1
P( P( =1jY)1 ) = P
L
R
L
R +1
1
= P
1
L
R
1
= 1
2 P
0
1
L
R
1
+ 1
2 P
1
1
L
R
1
K 1
K :
That is,the error probability of
^
isat least K 1
K .
Notethat K 1
K
can bemadearbitrarilyclose to 1
2
asÆ !0by choosingK suÆciently
large and suÆciently close to 1
2
at the same time. 3
Proof of equations (4.5) and (4.6). The matrix is given by
(A.6) =
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
.
.
. .
.
. .
.
. .
.
. .
.
. .
.
.
1 0 0 0 0
b
1
1 0 0 0
b
2 b
1
1 0 0
.
.
. .
.
. .
.
. .
.
. .
.
. .
.
.
b
n 1 b
n 2 b
n 3
1 0
b
n b
n 1 b
n 2
b
1
1
.
.
. .
.
. .
.
. .
.
. .
.
. .
.
. 1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A :
Following the denition of I
(i j) we have = 2
"
0
. Furthermore, it can be shown
that
0
=
0
. The equivalence between (4.4) and (4.5) follows from this fact. The
equivalencebetweenthetwoconditions(4.5)and(4.6)isduetothefactthat 1
= 4
"
inthe sense that = 4
"
=I(see e.g. Shaman(1975)and Beran 1994, pp. 109 .). 3
Proof of Theorem 1. Without loss of generality we will assume that 2
"
= 1 for
convenience. For =0let
n
=c
0 h
k
equaltothe ratec
0 n
r
0
,wherer
0
=(1 )k=(2k+
1 )is asdened in Theorem 1. Then wehave h=n s
with s=(1 )=(2k+1 ).
Following Lemma2, we have toshow that the sequence g under this choice of h satises
e.g. thecondition P
2
i
=g 0
0
g<1,inorderthatc
n
r
isalowerrateofconvergence
for estimating g ()
.
i
=(
i
)denote the corresponding doubly innite vector. Then we have
g 0
0
g=
1
4 h
2k 2
(0) 0
0
:
Observe that
i
=0 for i< m ori>m. We have
0
0
=
1
X
j= 1 m
X
i= m
i b
i+j
!
2
= (2m+1) 2m
X
k= 2m
I
(k) 1
2m+1 m
X
j= m
(j=m) f(j+k)=m)g:
(A.7)
Equation (A.7)can alsobeobtained by directly analyzing 0
.
Based on (A.7) we can obtained results for the cases with = 0, 0 < < 1 and
1< <0,separately. Note that the methodology used inthe proof of Theorem 3.1 in
HallandHart(1990a)forthe casewith0<<1isbasedontheassumption b
i
=b
i for
i=1;2;:::, and is hencenot suitable forthe causal error process inthis paper, sincenow
we have b
i
=0 for i < 0. The methodology used in the following is developed based on
the property (1.2) of a fractional time series, which does not involve the exact structure
of b
i .
Assume that = 0. Note that in this case P
(k) I
> 0 and P
j(k) I
j < 1. From
(A.7)we have
0
0
:
=(2m+1)
X
I
(k)
Z
1
1 2
(u)du:
Notethat h=n
1=(2k+1)
and m =nh=n
2k=(2k+1)
=h 2k
for =0,whence
1
4 h
2k 2
(0) 0
0
<1:
In the case with 0 < < 1 the inverse process I
is an antipersistent process with
the parameter 1 <
I
= < 0 in (2.7) and hence for jkj suÆciently large we have
I
(k)c I
jkj
1
,wherec
=2c I
f
(1
I
)sin(
I
=2)<0(seeBeran,1994andBeranand
Feng, 2001a), which implies that I
(k) are ultimately negative for jkj suÆciently large.
Furthermore,wehave P
I
(k)=0andhence P
m
k= m
I
(k)= 2 P
k>m
I
(k)=O(m
).
0
0
= (2m+1) 2m
X
k= 2m
I
(k) 1
2m+1 m
X
j= m
(j=m) f(j+k)=m)g
(2m+1) m
X
k= m
I
(k) 1
2m+1 m
X
j= m
(j=m) f(j+k)=m)g
= (2m+1)O m
X
k= m
I
(k)
!
=O(m 1
):
Now we have h = n
(1 )=(2k+1 )
and m = nh = n
2k=(2k+1 )
. This results in m 1
=
h 2k
, sothat
1
4 h
2k 2
(0) 0
0
=h 2k
O(h 2k
)<1:
If 1 < < 0, the inverse process I
is a long-memory process with the parameter
0 <
I
= < 1 in (2.7) and hence, for jkj suÆciently large, I
(k) c I
jkj
1
, where
c
=2c I
f
(1
I
)sin(
I
=2)>0,sothat I
(k)>0forjkjsuÆcientlylarge. Furthermore,
we have P
I
(k)= 1 with P
2m
2m
I
(k) =O(m
). Notethat can be chosen so that,
for large k, P
m
j= m
(j=m) f(j+k)=m)g<
P
m
j= m 2
(j=m). Hence we have
0
0
= (2m+1) 2m
X
k= 2m
I
(k) 1
2m+1 m
X
j= m
(j=m) f(j+k)=m)g
(2m+1) 2m
X
k= 2m
I
(k) 1
2m+1 m
X
j= m 2
(j=m)
:
= (2m+1) 2m
X
k= 2m
I
(k) Z
1
1 2
(u)du
= O(m 1
):
In fact,we have
0
0
= O(m 1
)
uniformly for 2 ( 1;1). However, the derivation for this result is a little dierent in
the three cases. Now, note that h=n
(1 )=(2k+1 )
,whence, asbefore, m 1
=h 2k
,so
that
1
4 h
2k 2
(0) 0
0
=h 2k
O(h 2k
)<1:
Theorem 1 isproved. 3
Beran, J. (1994), Statistics for Long-Memory Processes, New York: Chapman & Hall.
Beran, J. (1999),SEMIFAR models {A semiparametricframework formodellingtrends,
long range dependence and nonstationarity,Discussion paper No. 99/16, Center of
Finance and Econometrics,University of Konstanz.
Beran, J. and Feng, Y. (2001a), Locally polynomial tting with long-memory, short-
memoryand antipersistenterrors, toappear inAnnalsof the Institute of Statistical
Mathematics.
Beran, J.and Feng,Y. (2001b),LocallypolynomialestimationwithaFARIMA-GARCH
error process, toappear in Bernoulli.
Chateld, C. (1979). Inverse autocorrelations. J. R. Statist. Soc.. ser. A 142 363{377.
Cleveland, W.S. (1972). The inverse autocorrelations of a time series and their applica-
tions (withdiscussion). Technometrics 14 277{298.
Cleveland, W.S. (1979). Robust locally weighted regression and smoothing scatterplots.
J. Amer. Statist. Assoc. 74 829{836.
Farrell,R.H.(1972). Onthebestobtainableasymptoticratesofconvergenceinestimation
of density function ata point. Ann. Math. Statist. 43 170{180.
Granger,C.W.J.and Joyeux,R.(1980),Anintroductiontolong-rangetimeseriesmodels
and fractionaldierencing," J. Time Ser. Anal., 1, 15-30.
Hardle, W., Hall,P.and Marron, J.S. (1992),Regression smoothing parameters that are
not far fromtheir optimum, J. Amer. Statist. Assoc., 87, 227{233.
Hall,P.(1989),Onconvergenceratesinnonparametricproblems,Intern. Statiat. Review
57 45{58.
Hall, P. and Hart,J.D. (1990a),Nonparametric regression with long-range dependence,
Stochastic Process. Appl., 36,339{351.
Hall, P. and Hart, J.D. (1990b), Convergence rates in density estimation for data from
innite-ordermoving average processes, Probab. Theory Rel. Fields87 253{274.
Shaman, P.(1975). An approximate inverse for the covariance matrix of movingaverage
and autoregressive processes. Ann. Statist. 3 532{538.
Stone, C.J. (1977). Consistent nonparametric regression (with discussion). Ann. Statist.
5 595{620.
Stone, C.J. (1980). Optimal rates of convergence for nonparametric estimators. Ann.
Statist., 8, 1348{1360.
Stone, C.J. (1982). Optimal global rates of convergence for nonparametric regression.
Ann. Statist., 10, 1040{1053.