• Keine Ergebnisse gefunden

4. Modelling 33

4.1.2. Stepwise selection

Different stepwise selection procedures were considered as they are widely used in LiDAR-applications (e.g., Næsset, 2002; Gobakkenet al., 2012).

In stepwise selection a subset of the J available variables is selected from the full set of covariates following statistical selection criteria. One can distinguish between three different approaches: (a) forward stepwise selection, (b) backward stepwise selection, and (c) a hybrid approach.

Forward stepwise regression starts with a model containing no covariates. Then variables are added sequentially. At each step the covariate that provides the greatest additional improvement to the fit is added (James et al., 2013). Forward stepwise selection is a so-called greedy algorithm producing a nested sequence of models (Hastie et al., 2009).

Different criteria such as the AIC, or BIC (see below) are used to evaluate improvements (or decline) in model performance. In forward selection the procedure terminates if no further improvement is possible (Fahrmeiret al., 2013).

An alternative to forward stepwise selection is backward stepwise selection, or back-ward elimination. Here, the full model containing all potential covariates is considered

first. Variables are iteratively removed that lead to the greatest improvement of model performance. The procedure terminates if no further improvements are possible.

A hybrid stepwise selection approach is a combination of both, forward and backward selection. Here, variables are added to the model at each iteration. However, after adding a covariate the algorithm may also remove a covariate which no longer provide an improvement of model performance. In this study the hybrid approach has been applied, as this approach best mimics best-subset selection (James et al., 2013). In best-subset selection all 2J possible models are separately fitted to the data and the best model, according to a predefined criteria, is selected. However, because of the large number of iterations in the simulation studies, best-subset selection was computationally prohibitive when the full set of covariates was considered.

Akaike Information Criterion (AIC)

For stepwise selection algorithms a criterion needs to be defined that determines whether adding (or removing) a covariate leads to an improvement (or decline) in model per-formance. The fist criterion that was used in this study was the Akaike Information Criterion (AIC) defined as

AIC=2log(L) + 2J (4.1)

where J is the number of coefficients in the model, and log(L) is the logarithm of the maximized likelihood function of the estimated model. For OLS the AIC can alterna-tively expressed as

AIC=nlog(s2e) + 2J where (as defined in Chapter 3)

s2e = (n1)1

kS

e2k.

Corrected Akaike Information Criterion (AICc)

The term 2J in equation (4.1) penalizes model complexity. However, the AIC does not necessarily lead to the most parsimonious model, and there is a risk of overfitting (Claeskens & Hjort, 2008). A corrected version of the AIC is the AICc that has been proposed by Hurvich & Tsai (1989). The AICc is defined as

AICc=AIC2J(J+ 1)

n−J−1. (4.2)

The AICc puts a stronger penalty on the number of parameters in the model and has been recommend for smalln(Burnham & Anderson, 2002). The same authors suggested to always employ the AIC instead of the AICc, as the latter converges to the AIC, if n gets sufficiently large.

Bayesian Information Criterion (BIC)

The Bayesian Information Criterion (BIC) developed by Schwarz (1978) is closely related to the AIC. The only difference is that for the BIC the number of covariates that enter the model is multiplied by log(n) instead of 2. The BIC is defined as

BIC=2 log(L) +log(n) J. (4.3)

The BIC, like the AICc, puts, thus, a stronger penalty on the number of covariates compared to the AIC.

AIC and variance inflation factor (VIF)

When the aim of using a model is prediction only, multi-collinearity is often of minor concern (Burnham & Anderson, 2002). However, as has been noted in Section 3.3 in Chapter 3, multicollinearity may affect estimates of precision for some variance estima-tors. Therefore, variance inflation factors (VIFs) have been computed for the sampled

data in order to identify highly correlated covariates. The VIF is defined as (Fahrmeir et al., 2013)

VIF= 1

(1−R2j) (4.4)

whereR2j is obtained from a regression ofxj onto all other covariates xJj.

In this study, the VIF was first calculated for each covariate using the full model. If any VIFj was above or equal 10, the variable with the highest VIF was removed. Next, the model was refitted and the VIF was computed again for all covariates that remained in the model. The procedure was repeated until no covariate with a VIF larger 10 remained in the model.

Since the above procedure does not remove variables that are not related to the target variable, the final model was selected using a stepwise procedure based on the AIC (as described above).

Best-subset selection and variance inflation factor

The VIF procedure was also combined with best-subset selection. First, the VIF was used to remove highly correlated variables (as described above), and afterwards best-subset selection was used to choose the best model from the set of candidate models.

Mallows’Cp statistic was used to identify the final model. TheCp statistic is computed as

Cp =n1(∑

kS

[yk−yˆk]2+ 2J s2e).

Like the AIC, Mallow’s Cp puts a penalty on the number of variables that enter the model.

Condition number

Silva & Skinner (1997, page 26) propose the following variable selection procedure (which is a a modification of the procedure originally developed by Bankieret al. (1992)):

1. Compute the cross-products matrix CP=Xk∗′SXkS considering all the columns initially available (saturated subset).

2. Compute the Hermite canonical form of CP, say H (see Rao (1973, page 18)), and check for singularity by looking at the diagonal elements of H. Any zero diagonal elements inHindicate that the corresponding columns ofXk∗′SXkS(and XkS) are linearly dependent on other columns (see Rao (1973, page 27)). Each of these columns is eliminated by deleting the corresponding rows and columns from Xk∗′SXkS.

3. After removing any linearly dependent columns, the condition numberc=λmaxmin

of the reduced CP matrix is computed, where λmax and λmin are the largest and smallest of the eigenvalues of CP, respectively. If c < L, a specified value, stop and use all the auxiliary variables remaining.

4. Otherwise perform backward elimination as follows. For everyk, drop thekth row and column from CP, and recompute the eigenvalues and the condition number of the reduced matrix. Compute the condition number reductions rk = c−ck whereckis the condition number after dropping thekth row and column from CP.

Determine rmax=maxk(rk)andkmax={k:rmax=rk}and eliminate the column kmax by deleting the kmax row and column from CP. Make c = ckmax and iterate while c ≥L and q 2, starting each new iteration with the reduced CP matrix resulting from the previous one.

Note, XkS above is similar to XkS (defined in 3.30) except that a vector of 1’s of lengthn was added as a first column. ForLa value of 30 was chosen.