• Keine Ergebnisse gefunden

Efficient Estimation in Single-Index Regression

N/A
N/A
Protected

Academic year: 2022

Aktie "Efficient Estimation in Single-Index Regression"

Copied!
15
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Ecient Estimation in Single-Index Regression

Michel DELECROIX

ENSAI and CREST Rue Blaise Pascal Campus de Ker Lann 35170 Bruz, FRANCE

Wolfgang H

ARDLE

Institut fur Statistik und Okonometrie Humboldt-Universitat zu Berlin

Spandauer Str. 1 D-10178 Berlin, GERMANY

Marian HRISTACHE

ENSAI and CREST Rue Blaise Pascal Campus de Ker Lann 35170 Bruz, FRANCE

January 15, 1997

Abstract

Semiparametric single-index regression involves an unknown nite dimensional parameter and an unknown (link) function. We consider estimation of the param- eter via the pseudo maximum likelihood method. For this purpose we estimate the conditional density of the response given a candidate index and maximize the obtained likelihood. We show that this technique of adaptation yields an asymp- totically ecient estimator : it has minimal variance among all estimators.

0Acknowledgements : The second author was supported by Sonderforschungsbereich 373 \Quantika- tion und Simulation Okonomischer Prozesse" Deutsche Forschungsgemeinschaft. This work was substan- tially completed when the rst and third author where visiting at Sonderforschungsbereich 373.

The authors thank V. Spokoiny and J. Horowitz for helpful discussions and comments.

1

(2)

1 Introduction

A single-index response model has the form

E(Y jX =x) =E(Y jX=x) (1.1) whereY is a scalar dependent variable, X is a d-dimensional vector of explanatory vari- ables,x is the index, the scalar product ofx witha vector of parameters whose values are unknown. Many widely used parametric models have this form. Examples are linear regression, binary logit and probit, and tobit models.

These models assume in (1:1) a "link" between the index x and the response. In the linear regression, for example, this link is the identity. In the logit model case it is the conditional distribution function of a logistic distribution. In this paper we consider estimation of the parameter in (1:1) without supposing further restrictions on this link.

Moreover, we derive the asymptotic normal distribution of this estimator and show that it is e cient in the sense of achieving minimal variance among all estimators for.

Adaptation in the linear regression model has been considered by Carroll (1982), Robinson (1987). They consider the case of unknown heteroskedasticity of the error variables. They show that a nonparametric estimate of the unknown heteroskedasticity function gives full adaptation in the sense of yielding the same variance as the Aitken- estimator with known heteroskedasticity function. The model (1:1) is more general since it allows arbitrary relations between the index X and Y.

Several estimators of that do not require a fully parametric specication of (1:1) already exist. Ichimura (1993) developed a semiparametric least squares estimator of . This estimator is closely related to projection pursuit regression (Friedman and Stuetzle (1981)) since it minimizes a least squares criterion based on nonparametric estimation of the link. Han (1987) and Sherman (1993) describe a maximum rank correlation estimator.

Klein and Spady (1993) developed a quasi-maximum likelihood estimator for the case in which Y is a binary response. This estimator achieves the asymptotic e ciency bound of Cosslett (1987) if the link is a conditional distribution function. Horowitz and Hardle (1995) considered fast non-iterative methods for single-index models in the case of discrete covariates. The estimators of Ichimura, Han, Klein and Spady, Sherman, and Horowitz and Hardle are n1=2-consistent, and asymptotically normal under regularity conditions.

The foregoing estimators were designed for specic data situations like e.g. discrete covariates or binary response or computational e ciency. The focus on such a particular aspect make them not necessarily e cient. The variance of the direct (computationally e cient) estimator of Horowitz and Hardle (1995) for example is not the best possible one computed in Klein and Spady (1993). The object of this paper is to construct an asymptotically e cient estimator for general single-index response models. It deviates from the ideas of projection pursuit since we use a pseudo maximum likelihood criterion.

Our method will be based on nonparametric estimation of the semiparametric condi- tional density f(yx) of the distributionL(YjX). We do not assume a specic struc- ture, like for example a binary response as in Klein and Spady (1993), for this conditional density. Our approach thus covers e cient estimation in linear regression (Bickel (1982)) with unknown error distribution as well as nonlinear response models with single-index structure (see Huet, Jolivet, Messean (1989)).

2

(3)

Suppose we are given i.i.d. observations Zi = (XiYi)2IRdIR, with

E(Yi jXi =x) =E(Yi jXi0 =x0) i= 1n (1.2) where0 2IRd is the true value of the parameter in the model. Assume that for all

x 2 suppX the conditional density f(yjx) of Y given X = x with respect to a -nite measure exists. This density is supposed to depend uponx through x0.

Thus a positive function f dened on suppY M(MIR), is given satisfying :

f(yjx) =f(yx0) ((x y)2suppZ): (1.3) The main idea of our estimator is to estimate the function f in (1:3) and then to optimize the (estimated) pseudo-likelihood over the parameter vector . The technique is called pseudo maximum likelihood estimation (PMLE). We use the kernel estimation method here since it is easy to compute in practice and auxiliary asymptotic results are available in the literature. In order to present our estimator we need some more notation.

LetSbe a xed subset of the support ofZ = (XY) and letSX =fx:9y s:t: (xy)2Sg. We assume that for all x in SX and all in , one can dene the conditional density

f

(yx) of Y given X = x and Z 2 S. We will then dene n estimators fbi of f at the point (yx), for (xy) in the xed subset S, by

b

f i

(yx) = Nin(yx)=Din(x) (1.4) with

N

in(yt) = 1

n;1

n

X

j=1

j6=i K

hn(y;Yj)Khn(t;Xj)IfZj2Sg

D

in(t) = 1(n;1)

n

X

j=1

j6=i K

h

n(t;Xj)IfZj2Sg (1.5) wherehn is the bandwidth,K is a xed kernel density, Kh() =K(=h)=h.

We dene ^n to be the solution of

b

L

n(^n) = max

2 b

L

n() (1.6)

with

b

L

n() = 1

n n

X

i=1

logfbi(YiXi)IfZi2Sg: (1.7) LetLn be the log-likelihood function dened by

L

n() = 1

n n

X

i=1

logf(YiXi)IfZi2Sg: (1.8) Dene also

L() =Eflogf(YiXi)]If Zi2Sgg: (1.9) The idea is to maximize the proxy Lbn() forLn() which itself is a proxy for L().

3

(4)

2 Consistency of the semiparametric estimator

First, we show that the estimate ^n dened in (1:6) converges almost surely to 0 as n tends to 1. We shall prove the following :

sup

2

b

L

n();Ln()n!1;a:s:! 0: (2.1)

and sup

2 jL

n();L()jn!1;a:s:! 0: (2.2) Then, if L() has a unique maximum at 0, the PMLE estimate ^n converges almost surely towards this maximum. The precise assumptions are as follows :

(A1)

(XiYi) are i.i.d. random vectors.

(A2)

is a compact subset of IRd:

(A3)

The random vectors (YiXi) have a continuous distribution.

(A4)

The compact subset S of the support of Zi = (YiXi) is such that:

i) for all in , the conditional density h of Xi given that Zi 2 S veries inf

( x)2S

X

h

(x)>0 ii) inf

( z)2S f

(yx)>0, where z = (xy):

(A5)

i) h and f are uniformly continuous

ii) h(t) and f(yt) are three times dierentiable with respect to tand the third derivatives satisfy Lipschitz conditions for t 2T(S), uniformly in 2 and

y2S

Y

iii) R(x) def:= f(yx) and D(x) def:= h(x) are twice continuously dieren- tiable with respect to onS:

(A6)

There exists an unique 0 2 such that relation (1:3) holds.

(A7)

i) For all 0 2 and x 2 SX, the distributions P and P0 dened by the densitiesf(x) and f0(x0) are equivalent.

ii) There exists a subset A SX of positive Lebesgue measure such that Xi is continuous onA:

(A8)

The matrix M =E"; @2

@@

T logf(YiXi)

=

0 I

f Z

i 2Sg

#

is positive-denite.

(C( ))

K is a real symmetric fourth order kernel andhn=cn; with c >0: The following preliminary results are shown in the appendix.

4

(5)

Lemma 2.1

Under the assumptions (A1){(A5) and (C( )) with 2013, we have sup

z2S

sup

2

logfb(yx);logf(yx)n!1;a:s:! 0:

Lemma 2.2

Under the assumptions (A1) through (A5) and (C( )) with 2 013, we have :

sup

2

b

L

n();Ln()+jLn();L()jn!1;a:s:! 0: From Lemma 2.2 it follows that sup

2

b

L

n();L()n!1;a:s:! 0:

Remark : Inspection of the proofs of Lemma 2.1 and 2.2 show that we do not need (A5) in its full strength. Lipschitz continuity is su cient. For better exposition we use this stronger smoothing throughout.

Lemma 2.3

Under assumptions (A1), (A3) and (A5)-(A7) the function L() has a unique maximum at 0.

The proof relies on the properties of Kullback information and can be found in Bonneu, Delecroix and Hristache (1995). Application of Gourieroux, Monfort (1989, page 431) and Lemmas 2.1 and 2.2, yields the following :

Theorem 2.1

Under assumptions (A1)-(A7) and (C( )) with 2 013, the estimator

^

n dened in (1:6) satises:

^

n a:s:

;!

n!1

0 :

3 Asymptotic distribution of the semiparametric es- timator

In order to obtain the asymptotic normality ofbn we show uniform convergence of the rst and second derivatives offb.

Lemma 3.1

Under assumptions (A1)-(A5) and (C( )) with 21816,

n

1=4 sup

(Z )2S

b

f

(YX);f(YX)n!1!P 0 (3.1)

n

1=4 sup

(Z )2S

@ b

f

(YX)

@

;

@f

(YX)

@

P

!

n!10 (3.2)

sup

( Z )2S

@ 2

b

f

(YX)

@@

T

;

@ 2

f

(YX)

@@

T

P

!

n!10: (3.3)

5

(6)

We'll show that ;Lbn() veries the assumptions of Lemma 5.1 of Ichimura (1993).

This is a consequence of Lemma 3.1 and of the following :

Lemma 3.2

Under the assumptions (A1)-(A5), 1

p

n n

X

i=1 (

@

@

logfb(YiXi)

=

0

;

@

@

logf(YiXi)

=

0 )

I

f Zi2Sg P

;!

n!10: The asymptotic distribution of bn is then given by the following :

Theorem 3.1

Under assumptions (A1){(A8) and(C( )), 21816and if0 2, then

p

n

b

n

;

0

L

!N(0) (3.4)

with =M;1VM;1 where

V =E

(

@

@

logf(YiXi)

=0

@

@

T logf(YiXi)

=0 I

fZ

i 2Sg

)

:

and M was dened in (A8).

Proof of Theorem 3.1 : It is su cient to show that

;

1

n n

X

i=1

logfb(YiXi) IfZi2Sg veries the conditions (i){(iv) of Lemma 5.1 of Ichimura (1993).

(i) bn converges almost surely to 0, by Theorem 2.1.

(ii) ; 1

p

n n

X

i=1

@

@

logf(YiXi)

=

0 I

fZ

i 2Sg

L

;!

n!1

N(0V)since

0 = arg max

2 E

hlogf(YiXi) IfZi2Sgi

=) E

"

@

@

logf(YiXi)

=

0 I

f Z

i 2Sg

#

= 0 and

1

p

n n

X

i=1

"

@

@

logfb(YiXi)

=

0

;

@

@

logf(YiXi)

=

0

#

I

fZ

i 2Sg

converges to 0 in probability by Lemma 3.2.

(iii) ;1

n n

X

i=1

@ 2

@@

0 b

f

(YiXi)IfZi2Sg n!1;P! E

"

;

@ 2

@@

T f

(YiXi) IfZi2Sg

#

uniformly in

2 (Lemma 3.1 and assumption (A5(iii))).

(iv) M(0) = E

"

;

@ 2

@@

T f

(YiXi)

=0 I

fZ

i 2Sg

#

is a positive-denite matrix by (A8).

6

(7)

4 Eciency of the semiparametric estimator

Let0(yx) be the density ofZi = (YiXi) given thatZi 2SwhereSis a subset of the support ofZi (we assume that (YiXi) is absolutely continuous with respect to a-nite measure ). Here we do not suppose that S satises assumption (A4), so that S may coincide with the support of Zi, even in the case where Zi is not compactly supported.

According to (1:3) (with f replaced by f0), for each z 2 S we have a decomposition of the form :

0(yx) = f0(yx0)g0(x) (4.1) whereg0(x) is the marginal density of Xi given that Zi 2S. Hence, our semiparametric model is dened by the family of distributions

P =

(

P : dP

d

=(yxfg) 2 f 2F g 2G

)

(4.2) with the densities satisfying :

i) (yxfg) =f(yx)g(x) ii) (yx0f0g0) =0(yx):

Following Bickel, Klaassen, Ritov and Wellner (1993), in order to determine the bound of the asymptotic variance of an estimator of 0, we need to calculate the e cient score.

For this purpose, we rst need to determine the tangent space P2 corresponding to the nonparametric part

P

2 =

(

P : dP

d

=(yx0fg) f 2F g 2G

)

(4.3) of the model. This is the closed linear span of the union of tangent spaces corresponding to (one-dimensional) regular parametric submodelsQP2: Let

Q=

(

P : dP

d

=(yx0) = (yx0f()g()) 2H IR

)

(4.4) be such a submodel. Thus ff()g2H F fg()g2H G and there exists an element 0 2 H such that 0(yx0) = 0(yx). The tangent space Q of Q (at 0) is simply the linear subspace of L2(P0) = L2(0) spanned by the score function S =

@ln(YiXi0)

@

=0

: We have :

S

= @lnf(YiXi0)

@

=

0

+ @lng(Xi)

@

=

0

(4.5) so that

S

2S =fs1(YiXi0) +s2(Xi) :

E

0 s

1(YiXi0)jXi0] = 0 E0 s2(Xi)] = 0g 7

(8)

where E0 means that the expectation is taken with respect to the probability measure

P

0 =0. This means that the tangent space P2 is a subspace ofS. Let

S

= @ln(YiXif0g0)

@

=

0

= @lnf0(YiXi)

@

=

0

= @2lnf0(YiXi0)Xi:

(4.6) According to Bickel, Klaassen, Ritov and Wellner (1993), Corollary 3.4.1, the information bound on0is given byI0 =E0SST, whereSthe e cient score, is the residual of the projection ofSonP2. SinceS P2, we haveI0 E0S1S1TwereS1 =S;proj(SjS) (see Lemma 9 of Bonneu, Delecroix and Hristache (1995)).

On the other hand, if

Q=

(

P : dP

d

=(yxf()g()) 2

)

(4.7) is a regular parametric submodel ofP containingP0, then the information boundI(0Q) on 0 in Q is such that I(0Q) I0. This means that if we can nd a parametric submodelQ such thatI(0Q) =E0S1S1T, we have an explicit formula for I0 :

I

0 =E0S1S1T=E0n@2lnf0(YiXi0)]2

X

i

;E

0(XijXi0)] Xi;E0(XijXi0)]To

since the projection of a vector s(YiXi) 2L2(P0) such that E0 s(YiXi)jXi] = 0 on S is simplyE0 s(YiXi)jYiXi0]: It is not di cult to see that the submodel :

Q=

(

P : dP

d

=(yxfg0) 2

)

(4.8) has the desired property : I(0Q) =E0S1S1T, since

d

d f

(YiXi)

=0

= @

@t f

0(Yit)

t=X

i 0

X

i

;E(XijXi0)]: (4.9) If we compareI0 with the asymptotic variance-covariance matrix of our estimator, we can see that, for the given set S satisfying assumption (A4), we obtain the same thing.

This means that the estimator we proposed is e cient, in the model built on the data set (Xi?Yi?), i 1, where (Xi?Yi?) is the ith among those (XjYj)2S. If S is the support of Zi, our estimator is e cient in the initial model.

5 Simulation study

The asymptotic e ciency of an estimator is not always the most important argument for a practician to use it instead of one which is easy to compute, even if it is not optimal from

8

(9)

a theoretical point of view. This is why methods like those proposed by Powell, Stock and Stoker (1989) or Horowitz and Hardle (1996) are and will be preferred in practice to an estimator which needs optimization procedures, like the one dened by equations (1:6) and (1:7). A possible solution to this problem1 would be to use a one-step estimator, as a compromise between asymptotical and computational e ciencies, whenever we dispose of an estimate easy to compute.

This can be done in the following way : if bn is dened by (1:6) and (1:7), then

@ b

L

n

@

b

n

= 0: If en is a preliminary pn;consistent estimator of 0 (we can take, for example, en as the weighted average derivative estimator of Powell, Stock and Stoker (1989)), then we have

@ b

L

n

@

b

n

= @Lbn

@

e

n

+ @2Lbn

@ @ T

e

n

b

n

; e

n

+oP bn;en :

By the assumption (A8) and the fact thatbn and en are root-n-consistent estimators of 0, we obtain :

b

n =en;

@ 2b

L

n

@ @ T

e

n

!

;1

@ e

L

n

@

e

n +oP

1

p

n

!

:

If we dene

n=en;

@ 2b

L

n

@ @ T

e

n

!

;1

@ b

L

n

@

e

n

(5.1)

we then obtain an asymptotically e cient estimator, since this one-step estimator has the same asymptotic distribution asbn.

In order to evaluate the performances of the one-step estimator, which is asymptot- ically equivalent to bn but easier to compute, for small sample sizes, we give here the results of a simulation study. We considered the model

Y

i =Xi0+"i i= 1:::n

whereYi 2IR Xi =Xi( 1)Xi(2)2IR2 0 = (;11)Xi( 1) and Xi(2) are independent and of the same law, a mixture of two normal laws,

X (1)

i X

(2)

i

0:2N (01) + 0:8N (0:252)

and the errors are normal of mean zero and variance equal to (Xi0)2 =Xi(1);Xi(2)2 :

"

i

N(0jXi0j):

As the initial estimator we used the weighted average derivative estimator dened by :

e

n =; 2

n(n;1)

n

X

i=1 n

X

j=1

j6=i

1

h 3

n K

0

X

i

;X

j

h

n

Y

i

1As suggested by practical experience of J. Horowitz

9

(10)

whereK0Xi;Xj

h

n

is a notation for the vector

0

B

B

B

B

B

B

B

B

@ K

0 0

@ X

( 1)

i

;X ( 1)

j

h

n 1

A

K 0

@ X

(2)

i

;X ( 2)

j

h

n 1

A

K 0

@ X

(1)

i

;X (1)

j

h

n 1

A

K 0

0

@ X

(2)

i

;X ( 2)

j

h

n 1

A 1

C

C

C

C

C

C

C

C

A 2IR

2

the real valued kernel function is dened by

K(u) =

8

>

<

>

: 1

4(7;31u2) juj 12

1

4(u2;1) 12 juj1

0 juj1

and the bandwidth hn is of the formhn = 6n;1=5:

For the one-step estimator n given by (5:1), we used the same kernel K in the de- nition ofLbn() and a bandwidth of the form hn = 2:5 n;1=7:5.

As only the direction of 0 can be identied and not 0 itself, we used for the esti- mators the same constraint as for 0, that the last component equals 1. The results for the estimation of (1)0 = ;1 using the weighted average derivative estimator en and the one-step estimator n with samples sizes n 2 f50100200400g are summarized in the following table, containing the empirical mean and the empirical mean squared error for each case :

n= 50 n= 100 n= 200 n= 400

e

n -1.0406 (0.1067) -1.0174 (0.0305) -1.0187 (0.0152) -1.0030 (78,45 10;4)

n -0.9599 (0.0866) -0.9649 (0.0269) -0.9808 (0.0138) -0.9806 (77.81 10;4) As a general conclusion, we can say that the one-step estimator works better than the initial one. However, the rate of improvement of the squared error decreases with the sample size (18.83%, 11.80%, 9.21% and 0.81% respectively), but this may be only a consequence of our bandwidth choice, which is for no reason optimal. Moreover, if we change the constant in the bandwidth used to obtain n from 2:5 to 2:0, taking hn = 2:0 n;1=7:5, this phenomenon disappears but the general conclusion remains the same, namely that n provides better estimates of 0 than en (except for a small "accident" in the case n= 100) :

n= 50 n= 100 n= 200 n= 400

e

n -1.0406 (0.1067) -1.0174 (0.0305) -1.0187 (0.0152) -1.0030 (78,45 10;4)

n -0.9812 (0.0961) -0.9786 (0.0307) -0.9886 (0.0134) -0.9857 (75.37 10;4)

10

Referenzen

ÄHNLICHE DOKUMENTE

We show that the asymptotic variance of the resulting nonparametric estimator of the mean function in the main regression model is the same as that when the selection probabilities

Direct semiparametric estimation of single index models with discrete covariates, Journal of the American Statistical Association. Semiparametric regression in likelihood{based

But we intentionally choose the simplest partial linear model to demonstrate why the second order theory is essential in semiparametric estimation.. We will make comments on

Our paper considers the sequential parameter estimation problem of the process (3) with p = 1 as an example of the general estimation procedure, elaborated for linear regression

In Figure 1 we show a typical data set in the Laplace case (a) together with box plots for the absolute error of the different methods in 1000 Monte Carlo repetitions: local means

(The assumption of median as opposed to mean unbiasedness is not really important since there is no way to distinguish between the two cases in practice. The advantage to

And the methodology is implemented in terms of financial time series to estimate CoVaR of one specified firm, then two different methods are compared: quantile lasso regression

[r]