• Keine Ergebnisse gefunden

Locally Linear Estimation and Bandwidth Choice

4.2 The Data

4.3.2 Locally Linear Estimation and Bandwidth Choice

Denote the non–parametric estimate ofyiby the functionm(xi), which is the solu-tion to the following problem: The estimator m(xi) is therefore the constant of a linear fit around each xi, weighting the neighboring observations aroundxi with the kernel functionK(·).

Another representation of the estimation of the local coefficient vector is (Loader 2004)

βb= (xW x)1xW y, (4.3)

CHAPTER4 Non–Monotonicity in the Income–Longevity Relationship where βb = [m(xi), β1] and W is a diagonal matrix with the respective kernel weights on the main diagonal. Note that kernel weights are distinguished from the weightsωin Equation (4.1). In this case of a linear fit, the asymptotic bias of the estimated functionm(xi)is zero, which is not the case for a local constant estimator (Nadaraya-Watson estimator), see Mittelhammer et al. (2000, pp. 622). As Loader (2004) shows, the asymptotic bias will vanish whenever the degree of the polyno-mial is odd, and especially the bias at the boundaries of the data set will decrease, compared to the Nadaraya-Watson estimator. This property is especially useful in the setting applied here, as a potentially downward–sloping area of the estimated functionm(xi)at the left boundary ofxis analyzed. See e.g. Fan and Gijbels (1992), Fan (1992), Pagan and Ullah (1999, pp.105), and Fan and Gijbels (2003, pp.60) for a discussion on this topic: The bias ofm(xi) from locally linear regression does not depend on the density ofx, hence it is not subject to the question whether the local regression is performed at the boundaries or in the interior ofx.

Still, I have to choose the weighting kernel K(·). There are several major pro-posals for a weighting scheme, among them the Gaussian and the Epanechnikov kernel. The latter proves to be the efficient one, see e.g. Pagan and Ullah (1999, p. 28). Using the Gaussian kernel, however, no observation (no matter how far fromxi) ever receives a weight of zero, which eliminates some computational bur-den5, and is therefore applied here. In general, a kernel function has only to fulfill non–negativity and symmetry around xi at the center,6 and the choice of kernel function is only a minor determinant of the later results. The main difference in the application of kernels is their relative efficiency (as compared to the Epanech-nikov kernel), where the Gaussian kernel I apply here reaches .9512. On properties of kernels and their efficiency, see e.g. the discussion in Mittelhammer et al. (2000, pp.602).

After the choice of the kernel, a bandwidthh has to be determined. A band-width chosen too high will leave the estimate ’over–smoothed’ and potentially ignores specific patterns, whereas an under–smoothed estimate may hide the pat-tern of interest behind erratic components, leading in the limit (ash→0) to an exact replication of the unfitted data. This phenomenon is known as the bias–variance–

tradeoff (see e.g. Yatchew 1998). The optimal bandwidth can be approximated by a rule–of–thumb, which may be advisable while using large data sets. The method I apply can be understood as a refinement of Silverman’s Rule–of–Thumb, which is

5A large number of zero weights may yield a computational difficulty, which is due to singular matrices. The matrixxW xmay be singular for certain outcomes of the kernel weights, and thus not invertible.

6In addition, as the kernel determines weights, it has to integrate to one, and with exception of the centerxi, it has to be continuous (this includes, e.g., the triangular kernel with a non–differentiable kink atxi).

CHAPTER4 Non–Monotonicity in the Income–Longevity Relationship broadly discussed in the literature.7 Assuming a normal kernel, Silverman (1986) proposes the bandwidth to be

h= 1.06σxn1/5, (4.4)

whereσxis the standard deviation of thex–variable andnthe number of observa-tions. Yet, this formula relies on parametrical distributional assumpobserva-tions.8 These assumptions can be replaced by distributional properties of the kernel function and the data itself, i.e. by measures of the variance and skewness, to derive an improved plug–in method.

One specific plug–in method specifying the bandwidth is characterized by Loader (1999) and Loader (2004), who proposes the optimal bandwidth to be

h= σ2(b−a)2R

whereσ2is the error variance,m′′(x)is the second derivative of the estimated func-tion, andaandbare the lower and upper bounds ofx. Using a first stage or pilot estimate, the error variance can be estimated by

2= 1 n−2ν12

Xn

i

[yi−m(xi)]2, (4.6) withν1 andν2 adjusting the degrees of freedom (see Loader 2004 for the compu-tation). The second derivativem′′(x)of the estimate is obtained by fitting a local quadraticfunction (as the pilot estimate) to the data first, hence by solving

m(xi) = arg min An estimate for the second derivative is then given by the vector 2βb2. How-ever, there remains a pilot bandwidth to be chosen, as the respectiveβb2andm′′(x) are sensitive to the bandwidth as well. Following Silverman’s Rule–of–Thumb, I choose the pilot bandwidth to be 1.06σxn1/5. Of course, the final bandwidth of

7See e.g. the textbooks by Fan and Gijbels (2003, pp. 47) or Li and Racine (2007, pp. 14).

8See also Pagan and Ullah (1999, p.103), who propose the bandwidth to be of the order of magni-tuden1/5.

CHAPTER4 Non–Monotonicity in the Income–Longevity Relationship Equation (4.5) varies with the pilot bandwidth. To give a short overview, a test with the data analyzed here resulted in an under–proportional inverse relation of pilot bandwidth with optimal bandwidth, meaning that the optimal bandwidth does not vary as much as the pilot bandwidth.9

Applying this method unconditionally or conditionally on certain outcomes of control variables (stratification), I denote as strategy (I).

4.3.3 Approximate Confidence Interval

I approximate a point–wise confidence interval around the estimate ofm(xi)using conditional standard errorsσ(xi)at each grid point of x. The confidence bounds are given by (see Härdle et al. 2004, pp.119)

mCB(xi) =m(xi)±zα

s||K||22(xi)

nhf(xb i) , (4.8)

wherezαis 2.58, given the number of observations and the desired confidence level of 99%. The estimated conditional (or local) standard error is

σb2(xi) = 1 n

Xw(xi) [yi−m(xi)]2, (4.9)

the densityfb(xi)ofxiis a non–parametric estimate applying the Gaussian kernel, and||K||2 denotes R

uK2(u)du, which is 4.37335 in the case of the normal kernel.

The kernel weightsw(xi)are the respective elements of the weighting matrixW, see Equation (4.3).