Stepwise selection - Integrating remotely sensed data into forest resource inventories

4. Modelling 33

4.1.2. Stepwise selection

Diﬀerent stepwise selection procedures were considered as they are widely used in LiDAR-applications (e.g., Næsset, 2002; Gobakkenet al., 2012).

In stepwise selection a subset of the J available variables is selected from the full set of covariates following statistical selection criteria. One can distinguish between three diﬀerent approaches: (a) forward stepwise selection, (b) backward stepwise selection, and (c) a hybrid approach.

Forward stepwise regression starts with a model containing no covariates. Then variables are added sequentially. At each step the covariate that provides the greatest additional improvement to the ﬁt is added (James et al., 2013). Forward stepwise selection is a so-called greedy algorithm producing a nested sequence of models (Hastie et al., 2009).

Diﬀerent criteria such as the AIC, or BIC (see below) are used to evaluate improvements (or decline) in model performance. In forward selection the procedure terminates if no further improvement is possible (Fahrmeiret al., 2013).

An alternative to forward stepwise selection is backward stepwise selection, or back-ward elimination. Here, the full model containing all potential covariates is considered

ﬁrst. Variables are iteratively removed that lead to the greatest improvement of model performance. The procedure terminates if no further improvements are possible.

A hybrid stepwise selection approach is a combination of both, forward and backward selection. Here, variables are added to the model at each iteration. However, after adding a covariate the algorithm may also remove a covariate which no longer provide an improvement of model performance. In this study the hybrid approach has been applied, as this approach best mimics best-subset selection (James et al., 2013). In best-subset selection all 2^J possible models are separately ﬁtted to the data and the best model, according to a predeﬁned criteria, is selected. However, because of the large number of iterations in the simulation studies, best-subset selection was computationally prohibitive when the full set of covariates was considered.

Akaike Information Criterion (AIC)

For stepwise selection algorithms a criterion needs to be deﬁned that determines whether adding (or removing) a covariate leads to an improvement (or decline) in model per-formance. The ﬁst criterion that was used in this study was the Akaike Information Criterion (AIC) deﬁned as

AIC=−2log(L) + 2J (4.1)

where J is the number of coeﬃcients in the model, and log(L) is the logarithm of the maximized likelihood function of the estimated model. For OLS the AIC can alterna-tively expressed as

AIC=nlog(s²_e) + 2J where (as deﬁned in Chapter 3)

s²_e = (n−1)⁻¹∑

k∈S

e²_k.

Corrected Akaike Information Criterion (AICc)

The term 2J in equation (4.1) penalizes model complexity. However, the AIC does not necessarily lead to the most parsimonious model, and there is a risk of overﬁtting (Claeskens & Hjort, 2008). A corrected version of the AIC is the AICc that has been proposed by Hurvich & Tsai (1989). The AICc is deﬁned as

AICc=AIC2J(J+ 1)

n−J−1. (4.2)

The AICc puts a stronger penalty on the number of parameters in the model and has been recommend for smalln(Burnham & Anderson, 2002). The same authors suggested to always employ the AIC instead of the AICc, as the latter converges to the AIC, if n gets suﬃciently large.

Bayesian Information Criterion (BIC)

The Bayesian Information Criterion (BIC) developed by Schwarz (1978) is closely related to the AIC. The only diﬀerence is that for the BIC the number of covariates that enter the model is multiplied by log(n) instead of 2. The BIC is deﬁned as

BIC=−2 log(L) +log(n) J. (4.3)

The BIC, like the AICc, puts, thus, a stronger penalty on the number of covariates compared to the AIC.

AIC and variance inﬂation factor (VIF)

When the aim of using a model is prediction only, multi-collinearity is often of minor concern (Burnham & Anderson, 2002). However, as has been noted in Section 3.3 in Chapter 3, multicollinearity may aﬀect estimates of precision for some variance estima-tors. Therefore, variance inﬂation factors (VIFs) have been computed for the sampled

data in order to identify highly correlated covariates. The VIF is deﬁned as (Fahrmeir et al., 2013)

VIF= 1

(1−R²_j) (4.4)

whereR²_j is obtained from a regression ofxj onto all other covariates xJ−j.

In this study, the VIF was ﬁrst calculated for each covariate using the full model. If any VIF_j was above or equal 10, the variable with the highest VIF was removed. Next, the model was reﬁtted and the VIF was computed again for all covariates that remained in the model. The procedure was repeated until no covariate with a VIF larger 10 remained in the model.

Since the above procedure does not remove variables that are not related to the target variable, the ﬁnal model was selected using a stepwise procedure based on the AIC (as described above).

Best-subset selection and variance inﬂation factor

The VIF procedure was also combined with best-subset selection. First, the VIF was used to remove highly correlated variables (as described above), and afterwards best-subset selection was used to choose the best model from the set of candidate models.

Mallows’Cp statistic was used to identify the ﬁnal model. TheCp statistic is computed as

Cp =n⁻¹(∑

k∈S

[yk−yˆk]²+ 2J s²_e).

Like the AIC, Mallow’s Cp puts a penalty on the number of variables that enter the model.

Condition number

Silva & Skinner (1997, page 26) propose the following variable selection procedure (which is a a modiﬁcation of the procedure originally developed by Bankieret al. (1992)):

1. Compute the cross-products matrix CP=X_k^∗′_∈_SX_k^∗_∈_S considering all the columns initially available (saturated subset).

2. Compute the Hermite canonical form of CP, say H (see Rao (1973, page 18)), and check for singularity by looking at the diagonal elements of H. Any zero diagonal elements inHindicate that the corresponding columns ofX_k^∗′_∈_SX_k^∗_∈_S(and X_k^∗_∈_S) are linearly dependent on other columns (see Rao (1973, page 27)). Each of these columns is eliminated by deleting the corresponding rows and columns from X_k^∗′_∈_SX_k^∗_∈_S.

3. After removing any linearly dependent columns, the condition numberc=λmax/λmin

of the reduced CP matrix is computed, where λmax and λmin are the largest and smallest of the eigenvalues of CP, respectively. If c < L, a speciﬁed value, stop and use all the auxiliary variables remaining.

4. Otherwise perform backward elimination as follows. For everyk, drop thekth row and column from CP, and recompute the eigenvalues and the condition number of the reduced matrix. Compute the condition number reductions r_k = c−c_k whereckis the condition number after dropping thekth row and column from CP.

Determine rmax=max_k(r_k)andkmax={k:rmax=r_k}and eliminate the column kmax by deleting the kmax row and column from CP. Make c = c_k_max and iterate while c ≥L and q ≥ 2, starting each new iteration with the reduced CP matrix resulting from the previous one.

Note, X_k^∗_∈_S above is similar to X_k_∈_S (deﬁned in 3.30) except that a vector of 1’s of lengthn was added as a ﬁrst column. ForLa value of 30 was chosen.

Im Dokument Integrating remotely sensed data into forest resource inventories (Seite 59-63)