Boosting parameters - Structured additive quantile regression with applications to modelling un

Note that the estimation of individual-specific effects with ridge-penalized least squares base learners is a natural concept in analogy to Gaussian random effects in additive mixed models.

The quadratic form of the penalty corresponds to the log-density of Gaussian random effects priors from a Bayesian perspective. (This is for example clarified in Appendix A.2 of Hofner, 2011). As will be further pointed out in Section 5.3, the individual-specific effects of a STAQ model can be interpreted in analogy to the conditional view of random effects in additive mixed models.

●

●●

●

●●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

0 1 2 3 4 5

−10 0 10 20 30 40

Starting value: 90% quantile

●

●●

●

●●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

0 1 2 3 4 5

−10 0 10 20 30 40

Starting value: median

Figure 4.2 Heteroscedastic data example withn = 200. Dashed black lines show the true conditional quantile curves forτ = 0.9, grey solid lines illustrate the stepwise boosting fit after each 300 iterations beginning at the horizontal line.Left plot:Starting value = 90% quantile (horizontal line);Right plot:Starting value = median (horizontal line).

Step lengthν

Originally, the step-length factorν ∈(0,1]was regarded as an additional tuning parameter of the boosting algorithm and optimized in every step of the boosting algorithm (see, e.g., Friedman, 2001). Later it was established that this parameter is only of minor importance for predictive accuracy of the estimators as long asν is chosen “sufficiently small” (B ¨uhlmann and Hothorn, 2007).

The step length and the optimal number of boosting iterations m_stop trade off each other, with smaller step lengths resulting in more boosting iterations and vice versa. Thus, when one of these two parameters is fixed, an optimal choice only has to be derived for the remaining one.

Sincem_stopis easier to vary in practice, the step lengthνis fixed at a small value, e.g.,ν = 0.1, to ensure small steps and therefore weak base learners.

As was illustrated by Figure 4.1, the stepwise increments of the estimators can be very small in case of quantile regression whenν is set to be0.1 due to the binary character of the gradient residuals. This potentially results in a large number of boosting iterations. To avoid excessive computational effort it can make sense to fixνat a greater value than0.1in the context of quantile regression, e.g.,ν = 0.2orν = 0.4, as was done in our applications.

Note that multiplyingν with a constantc > 0 has the same impact on the estimation result as multiplying the original loss function (and its gradient) withc. For example, the standard loss function for median regression is the absolute value loss, i.e.,L(y, η) =|y−η|, while the check function forτ = 0.5 is exactly half of this quantity, i.e.,ρ_0.5(y−η) = 0.5|y−η|. Thus, quantile boosting withν = 0.2andτ = 0.5is equivalent to boosting with the absolute value loss function andν= 0.1.

Degrees of freedomdf(λ_d)

It is important to note that in the boosting algorithm the smoothing parameters λd > 0 of the penalized least squares base learnersd = 1, . . . , D, are not treated as hyperparameters to be optimized. This is one of the main differences of boosting to other penalized model approaches whereλis often the major tuning parameter.

However, when specifying different degrees of freedom for different base learners one would run the risk for a biased selection of base learners. A base learner with greater degrees of freedom, i.e., less penalization, offers greater flexibility than a base learner with smaller degrees of freedom, i.e., more penalization, and therefore has a greater chance to be selected by the boosting algorithm.

To avoid this bias in the base learner selection, Kneibet al. (2009) and Hofner et al. (2011a) suggested to fix the initial degrees of freedomdf(λd)at the same (small) value for all penalized base learners, for example atdf(λd) = 1ford= 1, . . . , D. This should ensure that the complexity of each base learner is comparable. Since there is a direct relationship between smoothing parametersλd and degrees of freedomdf(λd)of a base learner (B ¨uhlmann and Yu, 2003), the smoothing parameters λd can be derived by solving the initial equation df(λd) = 1 for λd and d= 1, . . . , D(see Hofneret al., 2011a, Lemma 1 for technical details).

Regarding the degrees of freedom of a base learner, Kneibet al. (2009) proposed to use the standard definition from the smoothing literature. According to that, the degrees of freedom of a penalized least squares estimator are defined as the trace of the hat matrix, i.e.,df(λd) = tr(Sd), with the hat matrix of a base learner resulting from (4.4) on page 62. Soon afterwards Hofner et al.(2011a) deduced the alternativedf(λ_d) = tr(2S_d−S_d^>S_d)and demonstrated why applying this definition in the boosting algorithm makes more sense than using the classical one when the aim is an unbiased selection of base learners.

Note that due to the repeated selection of a base learner, in the final model the degree of smoothness of a penalized effect can be of higher order than the one imposed by the initial degrees of freedom (B ¨uhlmann and Hothorn, 2007). In addition, different degrees of smoothness can be obtained for different functional effects as a result of different selection rates of the corresponding base learners.

Regarding nonlinear effects based on P-splines, the degrees of freedom of a smooth nonlinear effect cannot be made arbitrarily small – even for large smoothing parameters λd. With a difference penalty of order δ, a δ −1 polynomial of the nonlinear function always remains unpenalized. For this reason, Kneibet al. (2009) suggested to decompose the nonlinear effect into linear part and nonlinear deviation, as was described in (4.5) on page 67. By splitting the complete effect into three base learners for intercept, linear part and nonlinear deviation, the corresponding degrees of freedom of each part can be set to one. In this context, Hofneret al.

(2011a) advocated that the base learner of a categorical covariate should also be penalized to one degree of freedom.

Number of boosting iterationsm_stop

The number of boosting iterationsm_stopis the most important parameter of the boosting algorithm since it controls variable selection and overfitting behaviour of the algorithm, including the amount of shrinkage and smoothness of the estimators.

However, in general the danger of overfitting is relatively small for boosting algorithms when weak base learners with small degrees of freedom and small step lengths are used (B ¨uhlmann and Hothorn, 2007). Stopping the boosting algorithm early enough (early stopping) is all the same crucial to induce shrinkage of the estimators towards zero. Shrinkage is desirable since shrunken estimates yield more accurate and stable predictions due to their reduced variance (see, e.g., Hastie et al., 2009). In addition, early stopping is important to employ the inherent variable selection and model choice abilities of boosting (which we will further discuss in Section 4.4).

The optimal number of boosting iterationsm_stopfor STAQ models can be determined by cross-validation techniques, such as k-fold cross-cross-validation, bootstrap or subsampling. With each of these techniques, the data is split into two parts: a training and a test sample. Boosting estimation is then carried out on the training sample with a very large initial number of iterations while the empirical risk is evaluated on the test sample (out-of-bag risk) for each boosting iteration. The optimal m_stop finally arises as the point of minimal risk of the aggregated empirical out-of-bag risks.

To save computational effort, Mayret al.(2012b) recently proposed a sequential and fully data-driven approach for the search of the optimal m_stop. This approach also avoids that the initial number of boosting iterations has to be specified by the user.

Im Dokument Structured additive quantile regression with applications to modelling undernutrition and obesity of children (Seite 82-85)