K-Fold Cross Validation - Performance Assessment

3.8 Performance Assessment

3.8.2 K-Fold Cross Validation

The methods described in section 3.8.1 serve as a utility to measure how well a model is adjusted to a data set, but it does not reflect how the model will perform for predicting new cases. It might be possible that the model achieves high accuracy on its training data, but performs poorly on unseen cases. This phenomenon is known overfitting and takes place, if the models

starts to follow theirreducible error (3.6.3) ornoise within the set of training samples. To ensure that the model has not been overfitted it is necessary to validate its performance on an independent test cohort, which has not been seen by the model during training. As the access to unseen samples might be limited, it is common to split the cohort into two halfs, a training and a test set. This way, a model can be trained and evaluated independently.

This technique comes with two major drawbacks. First, depending on which samples are chosen for training, the difference between training and test performance can be huge. Second, only fifty percent of the data are used for training. This might effect the model, as statistical models tend to perform better, the bigger the sample size. For a detailed discussion, the reader is referred to James et al [62]. A simple, yet effective method to overcome these flaws is called K-fold cross-validation. As depicted in figure, 3.10, K-fold cross-validation splits the available data into k chunks (here 5).

Afterwards, k−1 chunks are used for training and the remaining chunk for validation. This process is repeated k times, each time leaving out the kth chunk for testing. Therefore, only a small fraction of samples is left out for training. To choose the right k, the bias-variance trade-off from 3.6.3 has to be taken into account. One might argue that choosing k = N might be a good idea, where N is the number of samples. Despite from computational burden, there aren models to be trained, the cross validation will show high degree of variance. This is because alln training sets are very similar to each other, while the single left out sample can be very distinct to the training data. On the other hand, if k is chosen to low, a high bias will be observed, as the model has not enough data to correct for noise. To summarize, k has to be chosen with care. Empirically the choice of k = 5 or k = 10 have proven to show test error rates that suffer neither from excessively high bias nor from very high variance [62].

3.8. PERFORMANCE ASSESSMENT 43

5.1 Cross-Validation 181 1 2 3

11 76 5 11 76 5 11 76 5 11 76 5 11 76 5

47 47 47 47 47 FIGURE 5.5. A schematic display of5-fold CV. A set ofnobservations is randomly split into five non-overlapping groups. Each of these fifths acts as a validation set (shown in beige), and the remainder as a training set (shown in blue). The test error is estimated by averaging the five resulting MSE estimates.

chapters. The magic formula (5.2) does not hold in general, in which case the model has to be refitntimes.

5.1.3 k-Fold Cross-Validation

An alternative to LOOCV isk-fold CV. This approach involves randomly

k-fold CV

dividing the set of observations intokgroups, orfolds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k−1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated ktimes; each time, a diﬀerent group of observations is treated as a validation set. This process results in kestimates of the test error, MSE1,MSE2, . . . ,MSEk. Thek-fold CV estimate is computed by averaging these values,

CV_(k)=1 k

!k i=1

MSEi. (5.3)

Figure 5.5 illustrates thek-fold CV approach.

It is not hard to see that LOOCV is a special case ofk-fold CV in whichk is set to equaln. In practice, one typically performsk-fold CV usingk= 5 ork= 10. What is the advantage of usingk= 5 ork= 10 rather than k =n? The most obvious advantage is computational. LOOCV requires fitting the statistical learning methodntimes. This has the potential to be computationally expensive (except for linear models fit by least squares, in which case formula (5.2) can be used). But cross-validation is a very general approach that can be applied to almost any statistical learning method. Some statistical learning methods have computationally intensive fitting procedures, and so performing LOOCV may pose computational problems, especially ifnis extremely large. In contrast, performing 10-fold

Figure 3.10: An example of K-fold cross-validation with k = 5. The data 1,2,3, . . . , n is split into five chunks. Afterwards five sets are created (each shown in blue), each leaving out the kth chunk (shown in beige). This way, 11765 samples can be used for training, while 47 are used for validation.

Figure adapted from James et al [62].

Chapter 4 Results

Fist, the results on the test data sets will be discussed. Afterwards, the de-veloped EDL model is applied to a cohort of 1.000 breast cancer patients.

This is done as a proof of principle, to ensure that the model is capable of both, labeling the samples according to their subtype and to re-identify known subtype driver alterations. The achieved model accuracy is com-pared to the accuracy achieved by the well established models. Following, a newly assembled cohort of 500 patients, suffering from primary or metastatic prostate cancer, is inspected using the EDL. Again, the accuracy is inspected in the context of the well established models. For both investigations the co-horts have been assembled using the FirebrowseR software and normalized by the methods established through Web-TCGA, as described in 4.2. Finally, the determined models are viewed under the aspects of their interpretability 3.6.4.

4.1 Test Data

The data sets introduced in 3.2 have been used by the machine learning community for several years up to decades. They act as benchmark to com-pare newly developed models to established ones. Since they are used by a broad spectrum of people not only in the machine learning community, common obstacles are known and improvements on classification accuracy can be determined easily.

To have a fair comparison between the different models, all models, ex-cept for the multinomial regression, have been tuned. For the SVM, the slack parameter C and the γ radial-kernel parameter have been determined using grid search with C ∈ {2²,2⁰, . . . ,2⁶} and γ ∈ {2⁶,2⁴, . . . ,2²}. For the classification tree, the minimum split parameter has been tuned with

msplit ∈ {1,3, . . . ,19}, indicating the minimum number of samples in a node to be considered for further splitting. Additionally, pruning has been uti-lized to simplify and potentially improve the tree. The mtry parameter for the random forest model, indicating the number of sampled predictors for each split, has been chosen from log₂(n),log₁₀(n),p

(n), n/2. As the multi-nomial regression does not have a tuning parameter, it has been guaranteed that the model converged. The list length parameterlfor the EDL has been tuned by hand, with respect to an upfront chosen prior.

With respect to the bias-variance trade-off introduced in 3.6.3, 10-fold cross-validation has been utilized. During each run, each model has been tuned on the k−1 training samples and the final model performance has been carried out by thekth test set. Kappa and accuracy are given as mean with the corresponding SD, to detect potential outliers.

As these data sets are only used to compare EDLs performance to other models, all decision lists are postponed to the appendix A.1.

Im Dokument Predicting Rules for Cancer Subtype Classiﬁcation using Grammar-Based Genetic Programming on various Genomic Data Types (Seite 51-56)