P-Splines - Continuous covariates and time scales

4.2 Predictor components and priors

4.2.2 Continuous covariates and time scales

4.2.2.1 P-Splines

One increasingly popular idea to estimate smooth effects of continuous covariates are penalized splines or P-splines, introduced by Eilers & Marx (1996). The fundamental assumption of this approach is that the unknown smooth function fj of a covariate xj

can be approximated by a polynomial spline. For notational simplicity we will drop the index j in the following discussion.

A polynomial spline function (see Dierckx (1993) or de Boor (1978) for mathematically rigorous treatments) is defined based on a set of M + 1 (not necessarily equally spaced) knots xmin =κ0 < κ1 < · · ·< κM−1 < κM =xmax within the domain of covariate x. To be more specific, a function g : [a, b]→ is called a polynomial spline of degree l, l∈ ⁰ based on knotsκ0, . . . , κM, if it satisfies the following conditions:

1. g(x) is (l−1) times continuous differentiable and

2. g(x) is a polynomial of degree l for x∈[κm, κm+1),m = 0, . . . , M −1.

The space of polynomial splines can be shown to be a (M +l)-dimensional subspace of the space of (l−1) times continuous differentiable functions. Therefore, assuming that f(x) can be approximated by a polynomial spline leads to a representation in terms of a linear combination ofd =M +l basis functions Bm, i. e.

f(x) = Xd

m=1

ξmBm(x),

where ξ= (ξ₁, . . . , ξ_d)⁰ corresponds to the vector of unknown regression coefficients.

One possible set of basis functions is given by the truncated polynomials B1(x) = 1, B2(x) =x, . . . , Bl+1(x) =x^l, B_l+2(x) = (x−κ₁)^l₊, . . . , B_M_+l(x) = (x−κ_M₋₁)^l₊, where

(x−κm)^l₊=

((x−κ_m)^l if x≥κ_m 0 if x < κm.

Although frequently used to introduce penalized splines (see e. g. Wand (2003) or Rup-pert, Wand & Carroll (2003)), truncated polynomials exhibit a number of drawbacks:

Due to their polynomial nature, the evaluation of truncated polynomials may cause nu-merical problems for large values of the covariate x. More important is the question of how to incorporate a smoothness prior or penalization into the model. With truncated polynomials this is usually achieved by assuming that the coefficients associated with the truncated basis functions Bl+2(x), . . . , BM+l(x) are i. i. d. Gaussian random effects.

Hence, increasing the amount of penalization leads (in the limit) to a polynomial of degree l defined by the degree of the basis functions. Applying a different set of basis functions allows to separately control the degree of the limiting polynomial and the degree of the basis functions, thus leading to a richer model class.

This different class of basis functions is called the B(asic)-spline basis, where basis func-tions of degree l are defined recursively via

B_m^l (x) = x−κm

κm+l−κm

B_m^l−1(x) + κm+l+1−x κm+l+1−κm+1

B_m+1^l−1 (x) with

B_m⁰(x) = [κm,κm+1)(x) =

(1 κm ≤x < κm+1

0 else.

Note that additional knots κ−l < . . . < κ−1 < a and b < κM+1 < κM+2 < . . . < κM+l

are needed for the recursive construction of the full B-spline basis. In case of equidistant knots, the additional knots are easily defined using the same spacing as for the inner knots. With nonequidistant knots an additional rule has to be defined, e. g. to use the distance between the two leftmost and rightmost inner knots.

Figure 4.1 shows a small set of B-spline basis functions for different degrees l and dif-ferent knot choices. Obviously, B-splines form a local basis since the basis functions are only positive within an area spanned by l + 2 knots. This property is essential for the construction of the smoothness penalty for P-splines. Furthermore, the basis functions are bounded yielding better numerical properties compared to the truncated power series basis. In case of equidistant knots (shown in the left panel) all basis functions are of the same functional form and only shifted along the x-axis. In the contrary case of non-equidistant knots (shown in the right panel), the functional form of the basis functions extremely changes between areas with dense knots and areas where knots are only placed at few positions. As required by the definition, the smoothness of the basis functions increases with increasing degree.

κ1 κ2 κ3 κ4 κ5 κ6 κ7 κ8 κ9

00.250.50.751

κ1κ2κ3κ4 κ5 κ6 κ7 κ8κ9

00.250.50.751

κ1 κ2 κ3 κ4 κ5 κ6 κ7 κ8 κ9

00.250.50.751

κ1κ2κ3κ4 κ5 κ6 κ7 κ8κ9

00.250.50.751

κ1 κ2 κ3 κ4 κ5 κ6 κ7 κ8 κ9

00.250.50.751

κ1κ2κ3κ4 κ5 κ6 κ7 κ8κ9

00.250.50.751

Figure 4.1: B-spline basis functions of degrees l= 1 (upper row),l= 2(middle row) and, l = 3 (lower row) for equally spaced knots (left panel) and unequally spaced knots (right panel).

Regardless of the specific basis used in an implementation, the d-dimensional design vector vi consists of the basis functions evaluated at the observations xi, i. e. vi = (B1(xi), . . . , Bd(xi))⁰. The crucial choice is the number of knots: For a small number of knots, the resulting spline may not be flexible enough to capture the variability of the data. For a large number of knots, however, estimated curves tend to overfit the data and, as a result, too rough functions are obtained. Figure 4.2 illustrates the influence of the number of knots when estimating a sinusoidal function. Clearly, 5 or 10 knots lead to a smooth and more or less appropriate fit, while more knots tend to overfit the data.

Whereas adaptive knot selection approaches use specific criteria and strategies to choose an optimal set of basis functions, Eilers & Marx (1996) suggest to use a moderately large number of equally spaced knots (usually between 20 and 40) to ensure enough flexibility, and to define a roughness penalty based on differences of adjacent B-Spline coefficients to guarantee sufficient smoothness of the fitted curves. In their formulation this leads to penalized likelihood estimation with penalty terms

pen(λ) = 1 2λ

m=k+1

(∆^kξm)², (4.10)

−2.502.5

−3 −1.5 0 1.5 3

(a) 5 knots

−2.502.5

−3 −1.5 0 1.5 3

(b) 10 knots

−2.502.5

−3 −1.5 0 1.5 3

−2.502.5

−3 −1.5 0 1.5 3

(d) 40 knots

Figure 4.2: Influence of the number of knots. The true function is given by the dashed line, the estimated curve by the solid line.

whereλis a smoothing parameter and the factor 1/2 is introduced only due to notational convenience. The difference operator ∆^k of order k is defined recursively by

∆¹ξm = ξm−ξm−1, (4.11)

∆²ξm = ∆¹(∆¹ξm) =ξm−2ξm−1+ξm−2, (4.12)

∆^kξm = ∆¹(∆^k−1ξm).

While first order differences penalize abrupt jumps between successive parameters, second order differences penalize deviations from the linear trend 2ξm−1−ξm−2. As a consequence, large values of the smoothing parameterλlead to estimates close to a horizontal line (first order differences) or a linear effect (second order differences). In general, the limiting polynomial when the smoothing parameter goes to infinity is of orderk−1 and is therefore independent from the degree of the spline basis. This is in contrast to the results for the truncated power series basis discussed before.

In a Bayesian framework, penalized splines are introduced by replacing the difference penalties with their stochastic analogues, i. e., random walks of orderk are used as priors for the regression coefficients (Lang & Brezger (2004) and Brezger & Lang (2005)). First and second order random walks for equidistant knots are given by

ξm =ξm−1+um, m = 2, . . . , d (4.13) and

ξm = 2ξm−1−ξm−2+um, m = 3, . . . , d (4.14)

with Gaussian errors u_m ∼ N(0, τ²). Diffuse priors p(ξ₁) ∝ const, and p(ξ₁), p(ξ₂) ∝ const, are chosen for the initial values. Alternatively, assumptions (4.13) and (4.14) can be replaced by

ξm|ξm−1 ∼N(ξm−1, τ²) and

ξm|ξm−1, ξm−2 ∼N(2ξm−1−ξm−2, τ²).

Figure 4.3 presents the illustration of these assumptions. While a first order random walk induces a constant trend for the conditional expectation ofβ_m givenβ_m−1, a second order random walk results in a linear trend depending on the two previous values βm−1 and βm−2. Figure 4.3 also reveals the impact of the variance parameter τ² more clearly. Ifτ² is large, ample deviations from the trend assumed by the random walk prior are possible.

In contrast, if τ² is small, deviations from this trend will also be small, leading to an almost deterministic behavior of the regression coefficients.

τ²

m−1 m

E(ξm|ξm−1)= ξm−1

τ²

m−2 m−1 m

ξm−2

ξm−1

E(ξm|ξm−1,ξm−2)

Figure 4.3: Trends induced by first and second order random walk priors.

The joint distribution of the regression parameters ξ is easily computed as a product of conditional densities defined by the random walk priors and can be brought into the

general form (4.9) of a multivariate but improper Gaussian distribution. The precision matrix is of the formK =D⁰_kDk whereDk is a difference matrix of orderk, i. e.

D₁ =







−1 1

−1 1 . .. ...

−1 1







(d−1×d),

D2 =







1 −2 1 1 −2 1

. .. ... ...

1 −2 1







(d−2×d),

Dk = D1Dk−1 (d−k×d).

Concerning first order random walks this yields the precision matrix

K =







1 −1

−1 2 −1 . .. ... ...

−1 2 −1

−1 1







(4.15)

and referring to second order random walks we obtain

K =







1 −2 1

−2 5 −4 1 1 −4 6 −4 1

. .. ... ... ... ...

1 −4 6 −4 1 1 −4 5 −2 1 −2 1





 .

Since the rows of matrix (4.15) sum to zero, the rank deficiency of this matrix equals one.

In general, the penalty matrix constructed from a random walk of orderk has rank d−k.

Instead of conditioning only on preceding values ofξmwe can also compute the conditional distribution of ξm given all other entries of ξ. This distribution then depends on both preceding and succeeding values ofξ_m. From the joint distribution (4.9) these conditional distributions can be easily derived using standard calculations for normal distributions:

The weight associated with parameter ξr in the conditional distribution of ξm is given by

−k_mr/k_mm, wherek_mr and k_mm are the corresponding elements ofK. The variance of the distribution is given by τ²/kmm (see for example Rue & Held 2005, p. 22).

Obviously, the weights for a first order random walk equal zero for all but the two nearest neighbors. More specifically we obtain

ξm|.∼











N(ξ2, τ²) m = 1, N

2(ξm−1+ξm+1),^τ₂²

m= 2, . . . , d−1 N(ξd−1, τ²) m =d,

, (4.16)

i. e. the conditional expectations are given by the local mean of the two next neighbors for all but the boundary coefficients. From a regression perspective, this can also be inter-preted as a local linear fit through the two next neighbors as illustrated in Figure 4.4. Note that we could also start directly by assuming a local linear conditional expectation forξm, which would result in the same prior distribution as discussed above. This interpretation will be used in Section 4.2.6 where P-splines are extended to two dimensions.

τ² 2

m−1 m m+1

ξm−1

E(ξm|.) ξm+1

τ² 6

m−2 m−1 m m+1 m+2

E(ξm|.) ξm−1

ξm+1

ξm+2

ξm−2

Figure 4.4: Local regression models induced by first and second order random walk priors.

For a second order random walk, similar calculations yield the conditional distributions

ξ_m|.∼











N(2ξ2−ξ3, τ²) m= 1,

5ξ1+ ⁴₅ξ3−¹₅ξ4,^τ₅²

m= 2, N

−¹₆ξ_m−2+ ⁴₆ξ_m−1+⁴₆ξ_m+1− ¹₆ξ_m+2,^τ₆²

m= 3, . . . , d−2, N

−¹5ξd−3+⁴₅ξd−2+ ²₅ξd,^τ₅²

m=d−1, N(−ξ_d−2+ 2ξ_d−1, τ²) m=d.

(4.17)

Again, these distributions (at least for the coefficients not affected by the boundaries) can be interpreted in a regression setting. Consider a regression ofξm on the four nearest

neighbors ξm−2, ξm−1, ξm+1 and ξm+2 where we assume a local quadratic influence. This corresponds to a linear regression model with design matrix

X =







1 m−2 (m−2)² 1 m−1 (m−1)² 1 m+ 1 (m+ 1)² 1 m+ 2 (m+ 2)²







and response vector y = (ξm−2, ξm−1, ξm+1, ξm+2)⁰. If we predict ξm from this regression model, we obtain exactly the same coefficients as in (4.17). Figure 4.4 illustrates the interpretation of (4.17) as a local quadratic fit.

Let us now take a closer look at the null space of the precision matrices of random walks. From (4.11) it is obvious that the penalization induced by first order differences remains unchanged when adding a constant to the parameter vector ξ. Equivalently, the prior distribution defined by a first order random walk remains unchanged by adding a constant. Therefore, the one-dimensional null space of precision matrix (4.15) is spanned by a d-dimensional vector of ones. Similarly, taking a closer look at (4.12) reveals that adding a linear trend does neither affect a second order difference penalty nor a second order random walk. Thorough considerations show that, in general, the null space of a P-spline precision matrix is spanned by a polynomial of degreek−1 defined by the knots of the P-spline. More specifically, the column vectors of the matrix





1 κ1 . . . κ^k−1₁ ... ... ...

1 κdj . . . κ^k−1_d_j



 (4.18)

form a basis of this null space. Equivalently we may use any set of equidistant values instead of the knots. For instance the column vectors of the matrix







1 1 . . . 1^k−1 1 2 . . . 2^k−1 ... ... ... 1 d . . . d^k−1







span the same null space as (4.18).

Although we restricted the description of P-splines to the equidistant knot setting, they can also be used with nonequidistant knots by defining weighted random walks or weighted difference penalties. However, due to their construction, P-splines should be rather in-sensitive to the position and number of the knots as long as a sufficiently large number of knots is used. Probably nonequidistant knots may lead to an improved fit if the dis-tribution of covariate x over its domain differs significantly from a uniform distribution.

Then it could be useful to define knots on the basis quantiles of x instead of equidistant values. Weighted random walks will be described in the following section.

One reason why P-splines nowadays are one of the most popular smoothing techniques is their low-rank property in the sense of Hastie (1996): P-splines allow to describe a broad class of flexible functions using only a moderate number of parameters. In contrast, smoothing splines need a number of parameters which approximately equals the number

of observations making direct simultaneous estimation of several smoothing splines nu-merically untractable even for a small number of functions fj(xj). Iterative techniques like the backfitting algorithm have to be employed in such a setting. Based on P-splines, a large number of nonparametric effects can be estimated simultaneously. Moreover, routinely used regression diagnostics like the elements of the hat matrix which are not available from iterative procedures can easily be computed (see also Marx & Eilers 1998).

Im Dokument Mixed model based inference in structured additive regression (Seite 39-47)