• Keine Ergebnisse gefunden

Other Methods Suited for Time-to-event Data

Here two other methods suited for time-to-event data are shortly introduced.

These two methods are often used competitors of boosting approaches when embedded feature selection is needed. They both are suited for time-to-event data and were used for comparison in this work.

2.3.1 Regularized Regression Methods

Lasso (Tibshirani, 1996) was proposed as a shrinkage regression models (Hastie et al., 2009; chap. 3) implementing an embedded feature selection. The regression coefficients are penalized with anL1 penalty term

βˆ = argmax

β

(l(β)−α||β||1) (2.58)

with a likelihood function l(β) suited for the outcome. While Ridge regression uses an L2 (quadratic) penalty term the use of the absolute penalty forces most of the entries in ˆβ to be exactly 0 and therewith performs an embedded features selection. Ridge regression performs a parameter shrinking leaving most of them>0. If a feature selection is needed a cutoff for the parameters needs to be defined. On the other hand, if many variables with small effects can be assumed Ridge regression might be a better choice. As a trade-off the method yields large models that are harder to interpret.

Originally, Tibshirani (1996) proposed quadratic programming to solve (2.58) for linear regression models. Tibshirani (1997) extended the idea of Lasso to Cox proportional hazard models still based on quadratic programming.

Since the solution for Cox models is much more computationally intensive, Goeman (2010) proposed a solution of the Lasso estimation based on gradient ascent optimization. The associated R implementation (Goeman, 2011) was used for comparison.

2.3.2 Random survival forests

The second method is based on decision trees. Here, the sample space is divided into smaller subspaces based on single variables. The dividing process

 Material and Methods

FIGURE 2.11. Example for a decision tree (adopted from Hastie et al., 2009). The regression treeT divides the space of input samples into several subspacesS1, . . . , S5based on the variablesX1, . . . , X4 and assigned split pointst1, . . . , t4 in a hierarchical manner. A new sample with feature vectorx= (x1, . . . , x4)T can now be assigned to one of the subspaces.

The prediction of the tree for this sample is then simple the average of the training samples in the given subspace, in this exampleS4. For a classification tree the predictionC(x)ˆ would be simply the majority vote of the training samples inS4. Note, not all entries inxinfluence the decision. In this example the value ofx2 is irrelevant for the prediction.

is hierarchical and thus can be illustrated as tree. A formal definition can be found in Alpaydin (2010):

Definition 2. Decision trees

A decision tree is a hierarchical model for supervised learning where the local region is identified in a sequence of recursive splits in a smaller number of steps.

The tree is composed of internal decision nodes and terminal leaf nodes.

Each internal decision node implements a test function based on one variable (univariate tree) or several variables (multivariate tree). Such a tree can then be used for prediction. The test sample is assigned to a terminal leaf node (and therewith a subset of the training samples) based on the test functions of the internal decision nodes. The prediction is simply a majority vote over those training samples (classification tree) or the average (regression tree). Figure 2.11 illustrates the basic principle of a decision tree.

. Other Methods Suited for Time-to-event Data 

The structure of the tree is not fixed a priori but has to be learned together with the decision functions of the internal nodes. Several approaches have been proposed to learn such a tree based on training data, e.g. the CART (Classification and Regression Trees) algorithm (Breiman et al., 1984) and C4.5 (J. R. Quinlan, 1993).

After learning the structure and internal test functions a tree is a simple and easy to interpret prediction model learned from the data. The simplicity comes with a price. Decision trees usually have a high variance based on an inherent instability. Slight changes in the data could cause a complete different series of splits and an error in the top of the hierarchy is propagated through the whole structure.

A solution for this problem was found by Breiman (2001). Instead of learning one single tree,B trees are trained and their predictions are averaged. Similar to Bagging (bootstrap aggregation, Breiman, 1996) the single trees are trained on bootstrap samples of the original training, introducing a randomization to the data, thus the name of the method: Random forests. Additionally, during the tree growing process before choosing a split point,m≤ppredictor variables are chosen as candidates for the split. The resulting trees are (for large B) uncorrelated and reduce the variance of the overall prediction model.

The parameters B and m have to be determined a priori. After fitting the trees the prediction of a training sample with feature vectorx is given by

ˆ

for a regression problem and by ˆ

g = ˆCrfB = majority vote{Cˆb(x)}B1 (2.60) for a classification problem. Thereby, ˆCb(x) is the class prediction of the b-th decision tree.

Random forests perform remarkably well for most situations with little tuning efforts (see Hastie et al., 2009; chap. 15 for a comprehensive overview and comparisons to boosting). Additionally it performs an embedded feature

 Material and Methods

selection and can deal with variables on different scales, making it a good choice for high-dimensional heterogenous data sets.

Ishwaran et al. (2008) proposed an extension of Random forests suited for right censored survival data calledRandom survival forests (RSF). Following the principles of Random forests a collection of binary decisions trees is built from bootstrap samples. For the internal nodes of each tree a random set ofm variables is chosen. The variable with corresponding split point that maximises the survival difference between the two resulting daughter nodes is used for the split. A terminal node is created when no more split can be performed e.g. a specified number of unique events is reached. The authors also provide an R implementation of their method (Ishwaran and Kogalur, 2007) which was used in this work.