Variable data driven bandwidth choice in nonparametric quantile regression

(1)

quantile regression

Klaus Abberger, University of Konstanz, Germany

Abstract:

The choice of a smoothing parameter or bandwidth is crucial when applying non-

parametric regression estimators. In nonparametric mean regression various meth-

odsfor bandwidth selectionexists. Butin nonparametric quantileregression band-

width choice is still an unsolved problem. In this paper a selection procedure for

localvarying bandwidthsbasedontheasymptoticmeansquarederror(MSE) ofthe

locallinear quantile estimator is discussed. To estimate the unknown quantities of

the MSE locallinear quantile regression based on cross-validation and local likeli-

hoodestimation is used.

Key Words: quantile regression, nonparametric regression, conditional quantile

estimation, locallinear estimation, localbandwidth selection, locallikelihood,gen-

eralizedlogisticdistribution

1 Introduction

It is an interesting problem in a study of the interdependence between a random

variable Y and a covariate X is how estimate the quantiles of Y for a given value

of X. For xed 2 (0;1), the quantile regression function gives the th quantile

q

(x) in the conditional distribution of Y given X = x. Quantile regression can be

(2)

but alsoinits lower and uppertails.

Various nonparametric estimation methods for quantile regression have been

discussed. These methods include spline smoothing, kernel estimation, nearest-

neighbourestimationandlocalweightedpolynomialregression. YuandJones(1998)

propose two kinds of locallinear quantile regression.

Inthis paperthe localweighted linear quantileregression estimatorisused. the

estimatorisdened by setting q^

(x)=^a, wherea^and

^

b minimize

n

X

i=1

(Y

i

a b(X

i

x))K

x X

i

h

; (1)

with kernel function K(),bandwidth h and loss function

= 1

fu0g

(u)u+( 1)1

fq<0g

(u)u (2)

introduced by Koenker and Basset (1978) in connection with parametric quantile

regression. Foradiscussionofthisnonparametric estimatorseeHeiler(2000),orYu

andJones(1998),who alsoderivesthe meansquarederror (MSE)of thisestimator.

The considerationsinSec. 2 of this paperare based onthis MSE.

The practical performance of q^

(x) depends strongly on the bandwidth h. Yu

andJones (1998)developarule-of-thumbbandwidth choiceprocedurebasedonthe

plug-inidea. Startingpointistheasymptoticallyoptimalbandwidthminimizingthe

MSE. Since this bandwidth depends on unknown quantities the authors introduce

somesimplifyingassumptions. Theseassumptionsresult inthe bandwidthselection

strategy

h

=h

mean

f (1 )=(

1

( )) 2

g 1=5

: (3)

and arestandard normaldensityanddistributionfunctionandh

mean

isaband-

width choice for regression mean estimation with one of various existing methods.

(3)

quantiles.

Abberger (1998) adapts the cross-validation idea to kernel quantile regression

and presents some simulationexamples.

In contrast to the above two bandwidth selection strategies where one global

bandwidthischosen,inthis papera methodforlocallyvarying bandwidthchoiceis

developed. Analgorithmbasedonthe MSEoptimalbandwidthis discussed inSec.

2and some simulation examplesare presented in Sec. 3.

2 Variable bandwidth choice

Forlocal linear quantile regression, the asymptoticform of the mean squared error

is

MSE(^q

(x)) 1

4 h

4

2 (K)

2

q

00(x)

2

+

R (K) (1 )

nhg(x)f(q

(x)jx)

2

; (4)

where

2 (K)=

R

u 2

K(u)du,R (K)= R

K 2

(u)du, and g is the designdensity, the

marginaldensity of X. f denotes the conditional density f(yjx)of Y given X =x

and q

00(x) the second derivative of the conditional -quantile (see Yu and Jones

(1998)).

From (4) follows the asymptoticallyoptimal bandwidth

h 5

(x)=

R (K) (1 )

n

2 (K)

2

q

00(x)

2

g(x)f(q

(x)jx)

2

: (5)

This bandwidth depends on the unknown quantities g(x), q

(x) and f(yjx). Plug-

in estimates for h

(x) use formula (5), replacing the unknown quantities by some

estimates. Before calculatingthe localbandwidths itis necessary to estimate:

(i) the designdensity g(x)

(4)

(iii) the conditionaldensity f(yjx)at y=q

(x).

Analgorithmisneeded whichgivesestimates forthesequantities. In thispaperthe

following procedureis chosen:

(i) g(x) is easiest to estimate. Various nonparametric density estimators can be

applied. Bandwidth choice procedures also exist. In equidistant designs g(x)

is uniform.

(ii) A prior estimate of q

(x) and its second derivative is estimated by local

quadratic quantileregression

min

a;b;c (

n

X

i=1

(Y

i

a b(X

i

x) c(X

i x)

2

)K

x X

i

h

)

; (6)

withq^

(x)=aandq^

00(x) =c(seeFanandGijbels(1996)forlocalpolynomial

estimation in general). These estimates are based on a global bandwidth

chosen by cross-validation. That is the bandwidth minimizing

min

h (

n

X

i=1

(Y

i

^ q

( i)

(X

i ))

)

; (7)

with q^ ( i)

(X

i

), the so called leave-one-out estimator. That means that the

estimator of the conditional quantile at X

i

is calculated without using the

observation (Y

i

;X

i

) (see Abberger (1998) for details).

(iii) The most crucial point is the estimation of the conditional density f(jx) at

q

(x). To estimate this density we use local likelihood estimation similar

to Staniswalis (1989). With presumed density

~

f

!

, parameter vector ! and

parameter space the parameters are estimated locallyas maximizersof the

weighted likelihoodcriterion

^

!(x)=max

!2 n

X

i=1 K

x X

i

h

logf(Y

i

;!): (8)

(5)

~

f

^

!

and the value of

~

f

^

! (^q

(x)jx) is calculated. Doing this a primerbandwidth

and adensity

~

f

!

hastobechosen. Asdiscussed byStaniswalis(1989)aglobal

bandwidth selection procedure is cross-validation similar to step (ii) of the

presentalgorithm. It remainsthe presumptionof afamilyof densities. There-

fore, the location-scale-shapemodel of the generalized logistic distribution is

used. The generalized logistic distribution with location (), scale () and

shape (b) parameters has the density

f(x)= b

e

(x )

(1+e (x )

) b+1

; b>0; >0; 2R ; x2R : (9)

This distribution and the maximum likelihoodestimation ofits parametersis

discussed in detailby Abberger and Heiler(2000). For b=1 the distribution

is symmetric, for b < 1the distribution is skewed to the left and for b >1 it

is skewed to the right.

The logisticdistributionand itsvarious generalizationsare discussed inJohn-

son, Kotz and Balakrishnan (1995). The logistic distribution is one of the

most important statistical distributions because of its simplicity and also its

historical importance as growth curve. The generalized logistic distributions

are very useful classes of densities as they possess a wide range of indices of

skewness and kurtosis. Therefore, animportantapplication of these distribu-

tions is their use in studying robustness of estimators. In bandwidth choice

the exibilityof the generalizedlogistic distribution isused toapproximatea

wide rangeofpossiblyunderlyingdistributions. Obviouslyother distributions

mightbeusedandforanyspecialproblemathandtheremaybenaturalother

choices. But the generalizedlogistic seems tobe a suitablechoice in general.

After estimation of the parameters the value of f(q

(x)jx) can be estimated

(6)

asymptotically optimalbandwidth.

Theabovethree stepsbuild aframeworkof thebandwidth choiceselectorwhich

clearly could be varied at several stages. So for global bandwidth choice in steps

(ii)and (iii)other procedures mightbe used. In step(iii) the locallikelihoodmight

be based on an dierent distribution family. If there is further information about

the underlying data generating process available, e.g. symmetry of the conditional

distribution or heavy tails, this can be considered in the selection of the distribu-

tion family. The above used settings are very general. Let us demonstrate their

applicabilityin somesimulationexamples inthe next section.

3 Simulation examples

In this section some simulation results are presented. Two dierent densities are

chosen. In oneexample the trueunderlyingdistributionisexponentialwith density

f(y)=se sy 1

1

fy> 1=ag

(y); s>0: (10)

This distribution is asymmetric and has expectation Zero for all a > 0. With

x=1;2;:::;600 we chose

s=1:5+sin(

x

100

) (11)

Thus forg(x)anequidistantdesignis used. The second distributionunderstudy is

the lognormaldistributionalso with scale parameter s asdened in(11). The gen-

eralizedlogisticdistributionisintentionallynotusedasdatageneratingdistribution

sothat the exibility of the above algorithmisdemonstrated.

The two data setting are quite extreme as Figure 1 shows. This gure presents

two data sets generated by the two distributions. The exponential data are very

smooth and not really exciting. In contrast to the lognormal data where strong

(7)

x

exponential data

0 100 200 300 400 500 600

05 1 0

x

lognormal data

0 100 200 300 400 500 600

0 100 200 300 400

Figure1: Two simulateddata sets with scale functionas dened inequation (11)

swingscan beobserved.

Inbothsettingsour aimistoestimatethe conditional 0:75 quantiles. Thetrue

quantile functions are presented in Figure 2 and 3. They both look identical but

mindthe dierent scales onthe ordinates.

To evaluate the resulting quantile estimates for each setting 100 repetitions are

calculated. Locallinear quantileestimation with locallychosen bandwidthsis used

andcomparedwiththelocallinearquantileestimationbasedonaglobalbandwidth

chosen by cross-validation. The resultinglocalMSE are shown in Figure4 and 5.

(8)

x

true 0.75 quantiles

0 100 200 300 400 500 600

23 45

Figure 2: True 0:75 quantilesfor the lognormaldistribution

Figure4 contains the brave case of exponential data. It can bee seen that re-

latingtotheMSE, estimationbased onlocalbandwidthchoiceand estimationwith

aglobalbandwidthselectedbycross-validationperformalmostidentical. Although,

thereare changesinthecomponentsoftheMSE. Comparedtotheglobalprocedure

local bandwidth choice using the above algorithm leads to an increase in the bias

buttoandecrease inthevariancepart. Butlocalbandwidth choiceseemstobenot

reallynecessaryinthiscase. Ontheotherhandthereisalsonodisadvantageusingit.

A dierent situation presents Figure 5. In this more extreme data situation lo-

calbandwidth choice clearly beats the global method. In the peaks of the quantile

functionlocalbandwidth choice leads toa considerablereduction of the MSE.

(9)

x

true 0.75 quantiles

0 100 200 300 400 500 600

0.2 0.4 0.6 0.8

Figure3: True 0:75 quantiles for the exponential distribution

Finally the ability of the local likelihood approach based on the generalized

logistic distribution to approximate the behaviour of the underlying lognormal is

demonstrated . Figure 6 shows for one example the dierence between the local

likelihoodbaseddensity estimationinstep(iii)andthe valuesusing thetrue under-

lyinglognormal distribution. The conditional quantiles are estimated as described

instep (ii) of the algorithm. The gure shows that the locallikelihood estimateis

quitereasonable.

Tosumupthe twoexamplesitcanbestatedthatthe presentedalgorithmworks

well. Localbandwidthchoiceisnot neededingeneral. Buttherearedatasituations

(10)

x

est. MSE

0 100 200 300 400 500 600

0.0 0.2 0.4 0.6 0.8

local bandwidth cross-validation

Figure4: SimulatedMSE for local and cross-validationbandwidth choice with ex-

ponential data

as demonstrated in the lognormal example, where local bandwidth choice leads to

remarkable improvements about the globalchoice.

(11)

x

est. MSE

0 100 200 300 400 500 600

0246 8 1 0

local bandwidth cross-validation

Figure5: Simulated MSEfor localand cross-validationbandwidth choice with log-

normaldata

(12)

x

est. density

0 100 200 300 400 500 600

0.0 0.1 0.2 0.3 0.4 0.5 0.6

lognormal

local likelihood with generalized logistic

Figure 6: Example of the calculated conditional density at q^

0:75

(x) (rst using the

true underlyinglognormal density and second using approximation with estimated

generalizedlogistic density

(13)

Abberger K. (1998). Cross-validation in nonparametric quantileregression. Allge-

meines Statistisches Archiv82,149-161.

Abberger K., Heiler S. (2000). Simulataneous estimation of parameters for a

generalizedlogistic distribution and application totime series models. Allgemeines

StatistischesArichv 84,41-50.

FanJ., GijbelsI. (1996). Local polynomial modeling and its applications. Chap-

manand Hall,London.

HeilerS.(2000). Nonparametrictimeseries analysis. In: A coursein timeseries

analysis,editedby D. Pena and G.C. Tiao. John Wiley, New York.

JohnsonN.L., KotzS., BalakrishnanN.(1995). Continuous univariate distribu-

tions,volume 2. John Wiley, New York.

Koenker R. BassetG. (1978). Regression quantiles. Econometrica46, 33-50.

Staniswalis,J.G.(1989). Thekernelestimateofaregressionfunctioninlikelihood-

basedmodels. Journal of the American Statistical Association 84,276-283.

YuK.,JonesM.C.(1998). Locallinearquantileregression. Journalof theAmer-

ican StatisticalAssociation93,228-237.