Component-Wise Functional Gradient Descent Boosting

If we interpretηηη = (η1, . . . , ηn)^t = (η(˜x1), . . . , η(˜xn))^t as n-dimensional parameter vector obtained by applying the additive predictorη(·)on the data pointsxx˜˜x˜i[30, 35], problem (3.2) can be seen as searching for the minimizing vector of parameters

In each iterationm, the negative gradient of the loss function evaluated at the current param-eter vectorηˆ^[m−1](˜xxx˜˜i)(the estimate obtained in the previous iteration) is derived. This results in ann×1dimensional gradient vectoruuu^[m]= (u^[m]_i , . . . , u^[m]n )with entries

CHAPTER 3. BOOSTING

for i = 1, . . . , n. The estimate of the additive predictor is initialized by a starting value ˆ

ηˆ

ηηˆ= ˆηηˆηˆ^[0], such asηˆi[0] = 0 fori = 1, . . . , n[35], and updated in each iteration according to the steepest descent of the loss function. In gradient descent boosting, this update is given by

ηηηˆˆ^[m]= ˆηηηˆˆ^[m−1]+ν^[m]uuu^[m] (3.5) whereν^[m] denotes the step length for the update in iteration m. A suitable choice will be discussed below.

Functional Gradient Descent:The updating step (3.5) may be adjusted such that it makes use of the steepest descent direction, but at the same time connects the update to the desired class of functions of the covariates given by the base-learners [36]. This link is established by fitting a base-learner to the negative gradient of the loss function, e.g. via least squares estimation [30]. The result is a constrained estimate

uuuˆˆ^[m]= (ˆu^[m]₁ , . . . ,uˆ^[m]_n ) = ( ˆf^[m](˜xx˜x˜1), . . . ,fˆ^[m](˜xxx˜˜n)) = ˆfffˆˆ (3.6) of the steepest descent direction, in whichfˆ(·)denotes the fitted base-learner. By making use offffˆˆˆinstead of the negative gradientuuu^[m]directly, the update in iterationmis changed to

ˆ ηˆ

ηηˆ^[m]= ˆηηˆηˆ^[m−1]+ν^[m]fffˆˆˆ (3.7) which is known as functional gradient boosting.

Component-Wise Boosting: A single function f(·) was considered above to estimate the negative gradient of the loss function. However, the inclusion of multiple base-learners in functional gradient boosting is possible and often desired, as it allows for a component-wise approach facilitating variable selection. Bühlmann and Yu introduced the concept of component-wise functional gradient boosting [37]. It differs from the above outlined ap-proach as it fits each base-learner fj(·) seperately to the negative gradient. This results in estimatesfffˆˆˆj,j = 1, . . . , J. The best fitting base-learnerfffˆˆˆj^∗ is determined via

j^∗ = argmin

j n

i=1

(u^[m]_i −fˆj(˜xxx˜˜i))² (3.8)

as the one minimizing the residual sum of squares. In each iteration, the identifiedfffˆˆˆj^∗ is added to the current estimate of the additive predictor according to

ηηηˆˆ^[m]= ˆηηηˆˆ^[m−1]+ν^[m]fffˆˆˆj^∗ (3.9)

CHAPTER 3. BOOSTING

in a stagewise fashion, leaving previously added function estimates unchanged [30]. In each iteration, a single base-learner, multiplied by the step length, is incorporated into the model.

However, repeated selection of the identical base-learner is possible and will lead to an in-creased weight of the corresponding function in the estimate ofη(ˆ ·). Thus, the final additive predictor is a weighted sum over all base-learners selected in at least one iteration.

Since different base-learners typically depend on differing subsets of the considered vari-ables, not selecting a particular base-learner indicates the exclusion of the respective variables from the model. Hence, the sufficiently (but not too) early stopping of the procedure automat-ically leads to variable selection. The algorithm then returns a prediction model for the trait of interest and simultaneously identifies the most influential variables during model estimation [34].

Choice of Parameters: The maximum number of iterations, mstop, is an important tuning parameter of the algorithm. Additional iterations usually decrease the training risk. However, this may lead to overfitting [30]. This phenomenon occurs if the training data are fitted to such an extent that the determined predictor performs poorly for new observations. A well advised choice of mstop is crucial to prevent overfitting [34]. An optimum number for mstop may be determined in a single dataset by use of cross-validation techniques. Herein, the data are repeatedly divided into training and test samples and subsequently used in parts to fit (training data) and validate (test data) the model. The optimummstop is the parameter leading to the lowest empirical risk on the data [34].

The number of iterations is influenced by the step length ν^[m] employed in the updating step of the algorithm. For 0 < ν < 1, the step length is a shrinkage factor scaling the contribution of each incorporated base-learner [30]. One way to derive an appropriate value forν^[m]in a gradient descent approach is to define it as the minimizer

ν^[m] = argmin

ρ(y,ηˆ^[m−1]+νu^[m]) (3.10)

in each iteration step. The step lengthν^[m]can be understood as learning rate of the procedure.

It has been found empirically that smaller values (ν ≤ 0.1) are favourable, as they improve the algorithm’s performance considerably as compared to no shrinkage (ν = 1) [30, 36].

Decreasing the step length, however, leads to a higher number of performed iterations and thus increases the computational burden for the algorithm. In practice, (3.10) does not have to be derived in every iteration. Instead, a small constant may be chosen for ν. A useful default value is settingν= 0.1[34].

Data Focus: Boosting algorithms, as mentioned above, focus on the observations most dif-ficult to classify. In traditional classification algorithms, such asAdaBoost, the data are

re-CHAPTER 3. BOOSTING

weighted in every step. Previously incorrectly classified observations are upweighted, while those correctly classified are downweighted by iteratively assigning more influence to the difficult observations [30].

Gradient descent boosting implicitly shifts the focus on the more challenging measure-ments by considering the gradient of the loss function instead. This may be regarded as fitting the errors of the previous iteration [34] and can best be seen by looking at an exemplary loss function, such as the commonly used squared error loss ρ(y, η(x)) = (y− η(x))². Here, the derived negative gradient is equal to 2(y −η(x)), basically leading to re-fitting of theˆ residuals.

Im Dokument Kernel-Based Pathway Approaches for Testing and Selection (Seite 20-23)