, and J. Nielsen

(1)

PROPERTIES OF A BACKFITTING

PROJECTION ALGORITHM UNDER WEAK

CONDITIONS

O. Linton

¹

, E. Mammen

²

, and J. Nielsen

³

May 8, 1998

Abstract

We derive the asymptotic distribution of a new backtting procedure for estimating the closest additive approximation to a nonparametric regression function. The procedure employs a recent projection interpretation of popular kernel estimators provided by Mammen et al.

(1997), and the asymptotic theory of our estimators is derived using the theory of additive projections reviewed in Bickel et al. (1995). Our procedure achieves the same bias and variance as the oracle estimator based on knowing the other components, and in this sense improves on the method analyzed in Opsomer and Ruppert (1997). We provide `high level' conditions independent of the sampling scheme. We then verify that these conditions are satised in a time series autoregression under weak conditions.

1

(2)

AMS 1991 subject classications. primary 62G07 , secondary 62G20

Keywords and phrases. Additive models Alternating projections Backtting Kernel Smooth- ing Local Polynomials Nonparametric Regression.

Short title. Backtting under weak conditions.

1 Introduction

Separable models are important in exploratory analyses of nonparametric regression. The backtting technique has long been the state of the art method for estimating these models, see Hastie and Tibshirani (1991). While backtting has proven very useful in application and simulation studies, it has been somewhat dicult to analyze theoretically, which has long been a drawback to its universal acceptance. Recently, a new method, called marginal integration, has been proposed, see Linton and Nielsen (1995), Tjstheim and Auestad (1994) and Newey (1994), see also earlier work by Auestad and Tjstheim (1991)]. This method is perhaps easier to understand for non-statisticians since it involves averaging rather than iterative solution of nonlinear equations. Its statistical properties are trivial to obtain, and have been established in the aforementioned papers. Although tractable, marginal integration is not generally ecient. Fan, Mammen, and Hardle (1996) and Linton (1996) showed how to improve on the eciency of the marginal integration estimator in regression^; in the latter paper, this was achieved by carrying out one backtting iteration from this initial consistent

0

1Cowles Foundation for Research in Economics, Yale University, 30 Hillhouse Avenue, New Haven, CT 06520-8281, USA. Phone: (203) 432-3699. Fax: (203) 432-6167. http://www.econ.yale.edu/~linton. Supported by the National Science Foundation and the North Atlantic Treaty Organization.

2Institut fur Angewandte Mathematik, Ruprecht-Karls-Universitat Heidelberg, Im Neuenheimer Feld 294, 69120 Hei- delberg, Germany Supported by the Deutsche Forschungsgemeinschaft, Sonderforschungsbereich 373 "Quantikation und Simulation okonomischer Prozesse", Humboldt-Universitat zu Berlin.

3PFA Pension, Sundkrogsgade 4, DK-2100 Copenhagen, Denmark

2

(3)

starting point. This modication actually achieves full oracle eciency, i.e., one achieves the same result as if one knew the other components. This suggests that backtting itself is also ecient in the same sense. Moreover, backtting, since it relies only on one-dimensional smooths is free from the curse of dimensionality.

Recent work by Opsomer and Ruppert (1997) and Opsomer (1997) has addressed the algorithmic and statistical properties of backtting. Specically, they gave sucient conditions for the existence and uniqueness of a version of backtting, or rather an exact solution to the empirical projection equations, suitable for any (recentred) smoother matrix. They also derived an expansion for the conditional mean squared error of their version of backtting: the asymptotic variance is equal to the oracle bound, while the precise form of the bias, as for the integration method, depends on the way recentering is carried out, but in any case is not oracle, except when the covariates are mutually independent. This important work conrms the eciency, at least with respect to variance, of their version of] backtting. Unfortunately, their version of backtting is not design adaptive, which is somewhat surprising given that they use local polynomial smoothers throughout. Furthermore, their proof technique required rather strong conditions specically, the amount of dependence in the covariates was strictly limited.

In this paper, we dene a new backtting-type estimator for additive nonparametric regression.

We make use of an interpretation of the Nadaraya-Watson estimator and the local linear estimator as projections in an appropriate Hilbert space, which was rst provided by Mammen et al. (1997).

Our additive estimator is dened as the further projection of these multivariate estimators down on the space of additive functions. We examine this estimator and show how^; in both the Nadaraya- Watson case and in the local linear case^;the estimator can be interpreted as a backtting estimator dened through iterative solution of the empirical equations. We establish the geometric convergence of the backtting equations to the unique solution using the theory of additive projections, see Bickel et al. (1995). We use this result to establish the limiting behaviour of the estimates: we give both

3

(4)

the asymptotic distribution and a uniform convergence result. Our procedure achieves the same bias and variance as the oracle estimator based on knowing the other components, and in this sense improves on the method analyzed in Opsomer and Ruppert (1997). Although the criterion function is dened in terms of the high-dimensional estimates, we show that the estimator is also characterized by equations that only depend on one- and two-dimensional marginals, so that the curse of dimensionality truly does not operate here. Our rst results are established using ideas from Hilbert space mathematicsand hold under `high level'conditions, which are formulatedindependently of specic sampling assumptions. We then verify these conditions in a time series regression with strong mixing data. Our conditions are strictly weaker than those of Opsomer and Ruppert (1997), and do not necessarily restrict the dependence between the covariates in any way.

This paper is organized as follows. In section 2 we show how local polynomial estimators can be interpreted as projections. In section 3 we introduce our additive estimators in the simplest situation, i.e., for the Nadaraya-Watson-like pilot estimator, establishing the convergence of the backtting algorithm and the asymptotic distribution of the estimator under high level conditions that are suitable for a range of sampling schemes. In section 4 we extend the analysis to local polynomials. In section 5 we investigate a time series setting and give primitive conditions that imply the high level conditions. In section 6 we illustrate our procedure on nancial data. All proofs are contained in the appendix.

2 A projection interpretation of the local polynomials

Let YX be random variables of dimensions 1 and d respectively and let (Y¹X¹):::(YⁿXⁿ) be a random sample drawn from (YX): We rst provide a new interpretation of local polynomial estimators of the regression function m(x¹:::xd) = E(Y^jX = x) evaluated at the vector x = (x¹:::xd)^T see Mammen, Marron, Turlach and Wand (1997). This new point of view will be useful

4

(5)

for interpreting our estimators of the restricted additive functionm(x) = +m¹(x¹)++md(xd).

The full dimensional local q'th order] polynomial regression smoother, which we denote by

m

b(x) = (m^b⁰(x) ::: m^b^qd(x))^T, satises

m

b(x) = argmin_m

=(m⁰:::m^{q d}⁾^T n

X

i⁼¹

Yⁱ^;m⁰^;

X¹ⁱ^;x¹ h

m¹^;:::^;

Xid^;xd

h

q

m^qd

2Yd

`⁼¹Kh(Xi`^;x`) where q is the order of the polynomial approximation. In fact, for simplicity of notation we will(1) concentrate on the local linear case considered in Ruppert and Wand (1995) for whichq = 1 ^; the Nadaraya-Watson case, for whichq = 0 is even simpler, see below. Dene the matrices of dimension n(d + 1) and nn, respectively]

X

(x) =

0

B

@

1 ^X¹¹_h^;^x¹ ^X^d¹_h^;^x^d ... ... ... ...

1 ^X¹ⁿ_h^;^x¹ ^X^dⁿ_h^;^x^d

1

C

A

K

(x) = diag ^Q^d_`⁼¹Kh(X_`¹^;x`) ^Q^d_`⁼¹Kh(X_n`^;x`) and write

m

b(x) =

X

(x)^T

K

(x)

X

(x)^;1

X

(x)^T

K

(x)

Y

V

^b^;1(x)

R

^b(x) (2) where

Y

= (Y¹:::Yⁿ)^T,

V

^b(x) =

X

(x)^T

K

(x)

X

(x) and

R

^b(x) =

X

(x)^T

K

(x)

Y

.

For the new interpretation of local linear estimators we think of the data

Y

= (Y¹:::Yⁿ)^T as an element of the space of tuples of 2n functions

F =(f^ij :i = 1:::nj = 0:::d) : Here, f^ij are functions from^R^d to ^R:

We do this by puttingfⁱ⁰(x)Yⁱ and f^ij(x)0 for j ⁶= 0. We dene the following norm on ^F

kf^k² =

Z 1 n

n

X

i⁼¹

hfⁱ⁰(x) +^X^d

j⁼¹ f^ij(x)x^j ^;X_ij h

i

2^Yd

j⁼¹Kh(Xij^;xj) dx 5

(6)

whereKh() =K(=h)=h with K() a univariate kernel. Consider now the following subspaces of ^F:

Ffull = ^ff ²^F :f^ij does not depend on i for j = 0:::d^g

Fadd = ^ff ²^Ffull:fⁱ⁰(x) = g¹(x¹) +:::gd(xd) for some functions gj :^R^!^Rfor j = 1:::d and f^ij(x) = g^j(xj) for some functions g^j :^R^!^Rfor j = 1:::d if j ⁶=0^g: The estimate

m

^b(x) denes an element of ^F by putting f^ij(x) = m^b^j(x) j = 01:::d. This is an element of ^Ffull. It is easy to see that, with respect to ^k ^k,

m

^b is the orthogonal projection of

Y

onto ^Ffull: Below we introduce our version

m

^e of the backtting estimator as the orthogonal projection of

m

^b onto ^Faddwith respect to ^k ^k]. For an understanding of

m

^e it will become essential that it be the orthogonal projection of

Y

onto ^Fadd. For the denition of such norms and linear spaces for higher order local polynomials and for other smoothers we refer to Mammen, Marron, Turlach and Wand (1997). Each local polynomial estimator corresponds to a specic choice of inner product in a Hilbert space, and the denition of the corresponding additive estimators is then the projection further down on^Fadd: In particular, for the local constant estimator (Nadaraya Watson-like smoothers) one chooses:

F = (fⁱ :i = 1:::n) : Here, fⁱ are functions from^R^d to ^R

Ffull = f ²^F :fⁱ does not depend on i

Fadd = f ²^Ffull:fⁱ(x) = g¹(x¹) +:::gd(xd) for some functions gj :^R^!^R

kf^k² =

Z 1 n

n

X

i⁼¹ fⁱ(x)]²^Y^d

j⁼¹Kh(X_ij ^;xj) dx:

Note that for functions

m

in ^Ffull i.e. m := m¹ =::: = mⁿ] we get

k

m

^k²=

Z

m(x)²p(x) dx^

where ^p(x) = n^;1^Pⁿj⁼¹Kh(Xij^;xj) is the kernel density estimate of the design density. In particular, in this case

m

^e is the projection of the full dimensional Nadaraya-Watson estimate onto the subspace

6

(7)

of additive with respect to the norm of the space

L

²(^bp). We give a slightly dierent motivation for the projection estimate

m

^e in the next section, see (7). There we will discuss the case of local constant smoothing in detail.

3 Estimation with Nadaraya Watson-Like Smoothers

In this section we will motivate our backtting estimate based on regression smoothers like the Nadaraya-Watson

m(x) = nb ^;1 Pn

i⁼¹

Qd

`⁼¹Kh(x`^;Xi`)Yⁱ

n^;1^Pⁿ_i⁼¹^Q^d_`⁼¹Kh(x`^;X_i`) : (3) The specic choice of the Nadaraya-Watson estimator is not important, but the smoother is supposed to have the ratio form

m(x) =b ^br(x) p(x) =b

n

X

i⁼¹ wi(x)Yⁱ (4)

wherep(x) which depends only on^b ^Xⁿ =^fX¹:::Xⁿ^g is an estimator of p(x) the marginal density of X: Here, the weighting sequence ^fwi(x)^gⁿ_i⁼¹ only depends on ^Xⁿ as does the weighting sequence

fwi(x)^gⁿ_i⁼¹ of the numerator ^br(x) =^Pⁿ_i⁼¹wi(x)Yⁱ: The assumption that the pilot estimatem exists^b i.e., is everywhere and always nite] will be dropped in our asymptotic analysis in the next section, which will allow us to include the case of high dimensionsd. We assume for the most part that

m(x) = + m¹(x¹) +::: + md(xd) (5) although our denitions make sense more generally i.e., when the regression function is not additive, in which case the target function is the closest additive approximation to the regression function.

For identiability we assume that

Z

mj(xj)pj(xj)dxj = 0 j = 1:::d (6) 7

(8)

where the marginal density of Xj is denoted by pj(): Denote also the marginal density of (XiXj) bypij() respectively (ij = 1:::d). The vector (Xi :i⁶=j) is denoted by X^;j and its density by p^;j.

Recall that backtting is motivated as solving an empirical version of the set of equations

m¹(x¹) = E(Y^jX¹ =x¹)^;^;E^fm²(X²)^jX¹ =x¹^g

;:::^;E^fmd(Xd)^jX¹ =x¹^g ... = ...

md(xd) = E(Y^jXd =xd)^;^;E^fm¹(X¹)^jXd =xd^g

;:::^;E^fmd^;1(Xd^;1)^jXd =xd^g:

In the sample, one replaces E(Y^jXj = xj) by one-dimensional smoothers m^bj(), and iterates from some arbitrary starting values formj() see Hastie and Tibshirani (1991, p. 108). Letp(x) and^b m(x)^b be multidimensional density and regression smoothers dened above. We dene backtting estimates mej as the minimizers of the following norm

kmb ^;m^k_p^b=

Z

m(x)^b ^;^;m¹(x¹)^;:::^;md(xd)]²p(x)dx^b (7) where the minimization runs over all functionsm(x) = +^P_jmj(xj) with ^R mj(xj)p^bj(xj)dxj = 0 see Nielsen and Linton (1996) we suppose that the density estimatep is non-negative]. This means^b that m(x) =ê +^b mê¹(x¹) +::: +mêj(xd) is the projection in the space

L

²(^bp) of m onto the ane^b subspace of additive functions ^fm ²

L

²(^bp) : m(x) = + m¹(x¹) +::: + md(xd)^g. This is a central point of our discussion. For projection operators backtting is well understood (method of alternating projections, see below). Therefore, this interpretation will enable us to understand convergence of the backtting algorithm and the asymptotics ofm^ej. We remark that not every backtting algorithm

8

(9)

based on iterative smoothing can be interpreted as an alternating projection method. The solution to (7) is characterized by the following system of equations (j = 1:::d):

mej(xj) =

Z

m(x)b p(x)^b

pbj(xj)dx^;^j ^;

X

k⁶⁼j

Z

mek(xk) p(x)^b

bpj(xj)dx^;^j^;^b (8)

=b ^Z m(x)^b p(x)dx^b (9)

where m^bj(xj) = n^;1^Pⁿ_i⁼¹Kh(xj ^; Xij)Yⁱ/^bpj(xj) is the univariate Nadaraya-Watson regression smoother, in whichp^bj(xj) =^R p(x)dx^b ^;j is the marginal of the density estimatep(x): Straightforward^b algebra gives

Z

m(x)b p(x)^b

pbj(xj)dx^;^j = p^b^;1j (xj)n^;1^Xⁿ

i⁼¹ Kh(xj^;X_ij)Yⁱ^Z ^Y

`⁶⁼jKh(x`^;X_i`)dx^;j

= m^bj(xj):

Furthermore, =^b ^R m(x)^b ^bp(x)dx = ^R ^br(x)dx and when ^R wj(x)dx = 1 we nd, as in Hastie and Tibshirani (1991), that ^b = n^;1^Pⁿ_i⁼¹Yⁱ, i.e., that is the sample mean. So^b is a^b ^pn-consistent estimate of the population mean and the randomness from this estimation is of smaller order and can be eectively ignored. Note also that

=b ^Z m^bj(xj)p^bj(xj)dxj forj = 1:::d: (10) We therefore dene a backtting estimator m^ej(xj) j = 1:::d as a solution to the system of equations

mej(xj) =m^bj(xj)^;^X

k⁶⁼j

Z

mek(xk) p(x)^b

bpj(xj)dx^;^j^; j = 1:::d^b 9

(10)

with dened by (10). Up to now we have assumed that multivariate estimates of the density and^b of the regression function exist. This assumption is not reasonable for large dimensionsd (or at least such estimates can perform very poorly). Furthermore, this assumption is not necessary. Note that (8) can be rewritten as

mej(xj) =m^bj(xj)^;^X

k⁶⁼j

Z

mek(xk)p^bjk(xjxk)

pbj(xj) dx^k ^;^b: (11) In this equation only two dimensional marginals of^bp are used. Note also that the solutionsm^ej(xj) to (11) inherit the smoothness properties of m(x) and^b p(x). We can therefore estimate the derivatives^b ofmj(xj) for example, by

d^rm^ej(xj)

dx_rj = d^rm^bj(xj) dx_rj ^;

X

k⁶⁼j

Z

mek(xk) ddx^r_rj

pbjk(xjxk) pbj(xj)

dxk r = 12:::

In the next section we will discuss estimatesmêj that are dened by (11) along with their asymptotic properties. In practice, our backtting algorithm works as follows. One starts with an arbitrary initial guessmê^0]j formêj. In the j-th step of the r-th iteration cycle one puts

mej^r^](xj) =m^bj(xj)^;^X

k<j

Z

mek^r^](xk)p^bjk(xjxk) pbj(xj) dx^k ^;

X

k>j

Z

mek^r^;1](xk)p^bjk(xjxk)

bpj(xj) dx^k ^;^b and the process is iterated until a desired convergence criterion is satised. The integrals are com- puted numerically, see section 4 below for further comments.

3.1 Asymptotics for the Nadaraya-Watson-like Version

We consider now estimatesm^ej that are dened by (11) with dened by (10), where^b m^bjp^bjk andp^bj

are some given estimates. The next theorem gives conditions under which, with probability tending to one, there exists a solution m^ej of (11) that is unique and that can be calculated by backtting.

10

(11)

Furthermore, the backtting algorithm converges with geometric rate. Our assumptions, given below, are `high-level' and only refer to properties of m^bj p^bjk and p^bj for example, we do not require that p is the underlying density of X or that m^bj ^bpjk and p^bj are kernel estimates] ^; these properties can be veried for a range of smoothers under quite general heterogeneous and dependent sampling schemes, and we investigate this in section 5 below.

Assumptions. We suppose that there exists a density function p ^on ^R^d with marginals pj(xj) =

Z

p(x) dx^;j

and

pjk(xjxk) =^Z p(x) dx^;(jk⁾ for j ⁶=k:

(A1) For all j ⁶=k it holds that

Z p²_jk(xjxk)

pk(xk)pj(xj)dx^jdxk <¹: (A2) For all j ⁶=k it holds that

Z

pbjk(xjxk)

pk(xk)p^bj(xj)^; pjk(xjxk) pk(xk)pj(xj)

2pk(xk)pj(xj)dxjdxk =oP(1):

Furthermore, Z

mbj(xj)^bpj(xj) dxj const.

By denition this constant is equal to ^b, see (10).

(A3) There exists a constant C such that with probability tending to one for all j

Z

mb²j(xj)pj(xj)dxj C:

(A4) There exists a constant C such that with probability tending to one for all j ⁶=k, sup_x

k Z

pb²jk(xjxk)

pb²_k(xk)^bpj(xj)dx^j C:

11

(12)

(A5) We suppose that for a sequence n ^#0the one-dimensional smoothers m^bj can be decomposed as mbj =m^bAj+m^bBj with ^R m^bj(xj)p^bj(xj) dxj not depending on j and, where the rst component mbAj is mean zero and satises

sup_x

k

Z

pbjk(xjxk)

pbk(xk) m^bAj(xj)dxj

=oP

n

logn

:

For s = A and s = B we dene m^esj as the solution of the following equation:

me_sj(xj) =m^b_sj(xj)^;^X

k⁶⁼j

Z

me_sk(xk)^bpjk(xjxk)

pbj(xj) dx^k^;^b^s (12) where ^b^s = ^R m^b^s(x)p(x)dx.^b Existence and uniqueness of mêAj and mêBj is stated in the next theorem. Note that mê_sj is dened as mêj in equation (11) with m^bj replaced by m^b_sj . We suppose that for (deterministic) functions jn() the term mêBj satises

me_Bj(xj) =jn(xj) +oP( n):

These conditions, which we discuss further below, are all straightforward to verify, except perhaps A5, and turn out to be weaker than those made by Opsomer and Ruppert (1997).

The following result is crucial in establishing the asymptotic properties of the estimates.

Theorem 1 Convergence of backfitting]. Suppose that conditions A1-A2 hold. Then, with probability tending to one, there exists a solutionm^ej of (11) and (10) that is unique. Furthermore there exist constants 0 < < 1 and c > 0 such that, with probability tending to one, the following inequality holds

Z

h

me_j^r^](xj)^;mêj(xj)ⁱ²pj(xj)dxj c²^r^Z mê^0](x)²p(x)dx: (13) Here, forr = 0 the function mê^r^](x) =mê¹^r^](x¹)+:::+mê_d^r^](xd) is the starting value of the backtting algorithm.

12

(13)

Furthermore, for s = A and s = B, with probability tending to one there exists a solution m^e_sj of (12) that is unique.

Our next theorem states that the stochastic part of the backtting estimate is easy to understand.

It coincides with the stochastic part of a one-dimensional smooth. Therefore, for an understanding of the asymptotic properties of the backtting estimate it remains to study its asymptotic bias. This will be done after the theorem for the special case that an asymptotic theory is available for the pilot estimatem.^b

Theorem 2. Suppose that conditions A1 - A5 hold for a sequence n. Then, it holds that sup_x

j

jmeAj(xj)^;m^bAj(xj)^j =oP( n):

In particular, one gets

mej(xj) =m^bAj(xj) +jn(xj) +oP( n):

We now apply Theorem 2 to the case that full dimensional pilot estimatesp(x),^b ^br(x) and m(x) =^b

br(x)=p(x) =^b ^Pⁿ_i⁼¹wi(x)Yⁱ exist and that^bmê¹:::mêd are dened as minimizers of (7) i.e.,+^b mê¹+ :::+mêd is the projection ofm onto the class of additive functions in L^b ²(p).] For the one-dimensional^b smooths,m^bj we have, with appropriate weights wji(xj) that

mbj(xj) =

Z

m(x)b p(x)^b

bpj(xj) dx^;^j =^Xⁿ

i⁼¹ wji(x)Yⁱ:

We compare now the estimatem^ej with the infeasible estimate mj that uses the knowledge of the other componentsml with l ⁶=j. More precisely, we dene the infeasible estimator mj(xj) to be the one-dimensional smooth of the unobserved dataYⁱ =mj(Xij)+"ⁱ with"ⁱ =Yⁱ^;^;^Pⁿ_k⁼¹mk(X_ik)]

on Xij thus

mj(xj) =^Xⁿ

i⁼¹ wji(xj)Yⁱ j = 1:::d: (14) 13

(14)

Then, under appropriate regularity conditions,

n²⁼⁵^fmj(xj)^;mj(xj)^g=⁾Nⁿbj(xj)vj(xj)^o j = 1:::d (15) for certain functions bj() and vj(). Moreover, because of cov^fmj(xj) mk(xk)^g=o(n^;4⁼⁵) one has

n²⁼⁵^fmj(xj)^;mj(xj)^g and n²⁼⁵^fmk(xk)^;mk(xk)^g are asymptotically independent forj ⁶=k:

(16) The additional information that ^Rmj(xj)pj(xj)dxj = 0 may have some value and we can dene the mean corrected version of mj(xj) by mcj(xj) = mj(xj)^;n^;1^Pⁿ_i⁼¹mj(Xij) which has the same asymptotic variance as mj(xj) but bias bcj(x) = bj(x)^;^Rbj(x)pj(xj)dxj:

We suppose now that our conditions hold with m^b^A(x) = ^Pⁿ_i⁼¹wi(x)"ⁱ and m^b^B(x) =^Pⁿ_i⁼¹wi(x) m(Xⁱ). One can decompose

mbAj(xj) = ^Z m^b^A(x) p(x)^b

pbj(xj)dx^;^j =^Xⁿ

i⁼¹ wji(xj)"ⁱ mbBj(xj) = ^Z m^b^B(x) p(x)^b

bpj(xj)dx^;^j =^Xⁿ

i⁼¹ wji(xj)m(Xij):

Suppose now that it can be shown for a functionb that

mb^B(x) = m(x) + n^;2⁼⁵b(x) + oP(n^;2⁼⁵): (17) We have the following

Corollary 1. Suppose that conditions A1-A5 hold with n=n^;2⁼⁵and that (14) - (17) apply.

Then

n²⁼⁵

2

6

4

me¹(x¹)^;m¹(x¹) ...

med(xd)^;md(xd)

3

7

5

=⁾N

0

B

@ 2

6

4

b¹(x¹) bd(x... d)

3

7

5

2

6

4

v¹(x¹) 0 0

0 ... ...

... ... 0

0 0 vd(xd)

3

7

5 1

C

A

14

(15)

where vj(xj) = vj(xj) j = 1:::d are dened above, while bj(xj) are solutions to the following minimization problem

b¹⁽⁾min:::b^d⁽⁾

Z

b(x)^;^;b¹(x¹)^;:::^;bd(xd)]²p(x)dx s:t: ^Z bj(xj)pj(xj)dxj = 0 j = 1:::d:

For the special case that the function b is already of additive form b(x) = b¹(x¹) +::: + b_d(xd) the bias functions bj(xj) coincide with the bias bcj(xj) of the `corrected' oracle estimate mcj(xj). Also

n²⁼⁵^fm(x)^e ^;m(x)^g=⁾N b⁺(x)v⁺(x)]

where b⁺(x) =^P_jbj(xj) and v⁺(x) =^P_jvj(xj):

Suppose additionally that for a sequence n with n^;2⁼⁵ =o(n) sup_x ^jm^b^B(x)^;m(x)^;n^;2⁼⁵b(x)^j = OP(n)

sup_x ^jmj(xj)^;mj(xj)^j = OP(n) for j = 1:::d:

Then, we have for j = 1:::d

sup_x ^jm^ej(x)^;mj(x)^j=OP(n):

4 Estimation with Local polynomials

We discuss now local polynomials. For simplicity of notation we consider only local linear smoothing.

All arguments and theoretical results given for this special case can be easily generalized to local polynomials of higher degree.

Backtting estimators based on local polynomials can be written in the form of equation (7) by choosing p(x) =^b V^b⁰⁰(x)^;

V

^b^T⁰^;0(x)

V

^b^;1^;0^;0(x)

V

^b⁰^;0(x) where

15

(16)

V

b(x) =

0

@

Vb⁰⁰(x)

V

^b⁰^;0(x)

V

b^;0⁰(x)

V

^b^;0^;0(x)

1

A

X

(x)^T

K

(x)

X

(x)

with the scalarV^b⁰⁰(x) = n^;1^Pⁿ_i⁼¹^Q^d_`⁼¹Kh(Xi`^;x`) and

V

^b^;0⁰(x)

V

^b^;0^;0(x) dened appropriately.

This approach has two disadvantages. First, it may work only in low dimensions ^; since for the asymptotics, existence of the matrix

V

^b^;1^;0^;0(x) and convergence of

V

^b^;0^;0(x) is required under our assumptions and this may hold only for low dimensional argument x]. Second, the corresponding backtting algorithm does not consist in iterative local polynomial smoothing.

We now discuss another approach based on local polynomials that works in higher dimensions and that is based on iterative local polynomial smoothing. We motivate this approach for the case that

V

^b(x) does exist, but we will see that the denition of the backtting estimate is based on only one- and two-dimensional `marginals' of

V

^b(x). So its asymptotic treatment requires only consistency of these marginals, and the asymptotics work also for higher dimensions. This is similar to the discussion in the last section where consistency has been needed only for one- and two- dimensional marginals of the kernel density estimatep.^b

For functions f = (f⁰:::f^d) with componentsf^j :^R^d ^!^Rand d + 1 by d + 1 positive denite matrix functionM() dene the norm

kf^k_M =^Z f(x)^TM(x)f(x)dx:

There is a one-to-one correspondence between functions f and functions in ^Ffull. Furthermore, taking M =

V

^b the norm ^kk_M is simply the norm induced by the norm ^kk. In Section 2 our version

m

ê(x) = (mê⁰(x):::mê^d(x))^T of the backtting estimate was dened as the projection of the function in^Ffullcorresponding to]

m

^b see (1)] with respect to^kk onto the space^Fadd. Therefore,

m

^e

coincides with the L²(

V

^b) projection, with respect to the norm ^kf^k_V^b, of

m

^b onto the subspace ^Madd, 16

(17)

where

Madd = ^f

u

(x) = (u⁰(x):::u^d(x))^T ²^Mj

u⁰(x) = + u¹(x¹) +::: + ud(xd)u^`(x) = w`(x`) for ` = 1:::d

whereu¹:::ud are functions ^R^!^Rwith ^Z

V

^b^jj⁰(xj)uj(xj)dxj = 0 for j = 1:::d and where w` :` = 1:::d are functions: ^R^!^Rg

where for eachj the (d+1)(d+1) matrix

V

^b^j(xj) =^R

V

^b(x)dx^;j: The class^Maddcontains functions that are additive in the rst component for` = 0] and where the other components for ` = 1:::d]

depend only on a one-dimensional argument. A function f in^Madd is specied by a constant and 2d functions^R^!^R. Becausef^` ` = 1:::d depend only on one argument, in abuse of notation we write alsof^`(x`) instead of f^`(x). Note that there is a one-to-one correspondence between elements of^Madd and^Fadd.

We now discuss how

m

^e is calculated by backtting. Note that

m

^e is dened as minimizer of

kmb ^;m^k_V^b. Recall that this is equivalent to minimize^k

Y

^;

m

^k² over ^Fadd. We discuss now minimization of this term with respect to thej-th components m^j(xj) and + mj(xj). Dene for each j

kf^k²_j(xj) =

Z 1 n

n

X

i⁼¹

hfⁱ⁰(x) +^X^d

j⁼¹f^ij(x)x^j^;Xij

h

i

2Yd

j⁼¹Kh(X_ij ^;xj) dx^;j and note the obvious fact that

kf^k² =

Z

kf^k²_j(xj)dxj j = 1:::d:

Therefore, because an integral is minimized by minimizing the integrand, our problem is solved by minimizing^k

Y

^;

m

^k²_j(xj) for xedxj with respect to m^j(xj) and+mj(xj), forj = 1:::d. After some standard calculations, this leads to:

mej(xj)V^b⁰^j⁰(xj) +m^e^j(xj)V^b_j^j⁰(xj) = 1n^X_i=1ⁿ

Kh(X_ij^;xj)Yⁱ^;^V^b⁰^j⁰(xj) 17

(18)

; X

`⁶⁼j

Z

me`(x`)V^b⁰^`j⁰(x`xj) dx`

; X

`⁶⁼j

Z

me^`(x`)V^b`^`j⁰(x`xj) dx` (18) mej(xj)V^b_j^j⁰(xj) +m^e^j(xj)V^b_jj^j (xj) = 1n^X_i=1ⁿ

X_ij^;xj

h K^h(Xij^;xj)Yⁱ^;^V^bj^j⁰(xj)

; X

`⁶⁼j

Z

me`(x`)V^b⁰^`jj(x`xj) dx`

; X

`⁶⁼j

Z

me^`(x`)V^b`j^`j(x`xj) dx`: (19) Here, we have used one- and two-dimensional marginals of the matrix

V

^b:

V

b^r(xr) =

Z

V

b(x) dx^;r (20)

V

b^rs(xrxs) =

Z

V

b(x) dx^;(rs⁾: (21)

The elements of these matrices are denoted byV^b_rpq(xr) andV^b_pq^rs(xrxs) withpq = 0:::d. Together with the norming condition ^Z

mej(xj)V^bj^j⁰(xj)dxj = 0 (22) equations (18) and (19) dene ^, m^ej and m^e^j for given

Y

and m^e`m^e^` :`⁶=j].

Equations (18) and (19) can be rewritten as

mej(xj) = mj(xj) + !mj(xj) (23) me^j(xj) = m^j(xj) + !m^j(xj) (24) where mj(xj) !mj(xj) m^j(xj) and !m^j(xj) are dened by:

mj(xj)V^b⁰^j⁰(xj) + m^j(xj)V^b_j^j⁰(xj) = 1n^X_i=1ⁿ

Kh(X_ij^;xj)Yⁱ (25)

18