Comparison of the Approaches - A comparison study of prediction approaches for multiple trainin

3.2 Data

4.1.6 Comparison of the Approaches

This part of the thesis compares the results of the different random forest based approaches on the TCGA data for its various patterns of block-wise missingness. For the comparison of the diverse approaches, only the best settings of each approach are used. For the block- & fold-wise approach the results were the best with the ’F-1 Score’ as weight metric, and for the single-block approach, the performance was the best, when using the feature-single-block

’CNV’. In ’Pattern 1’ and ’Pattern 3’ ’A’ stands for the ’CNV’ block | in

’Pattern 2’ ’D’ stands for the ’CNV’ block | in ’Pattern 4’ ’B’ stands for the

’CNV’ and ’Mutation’ block.

Pattern 1

Figure 23 shows the boxplots for the average results of the 5-fold cross-validation of the diverse approaches on the 14 TCGA data sets with block-wise missingness in the training-set according to ’Pattern 1’. In the test-situations that do not contain the feature-block ’A’, the predictive performance is in general much worse, as in the test-situations that contain the feature-block ’A’ (in this pattern ’A’ stands for the ’CNV’ block). The single-block and complete-case approach can not create predictions for all test-situations, while the fold-wise, block-wise, and imputation approach can do so. Regarding the predictive performance, the fold-wise approach has the best median metric in 11/20 test-situations, the block-wise approach has the best median F-1 score in 3/20 test-situations and the imputation approach has the best median F-1 score in 4/20 test-situations. The single-block approach never has the best median performance in any test-situation, while the complete-case approach has the best median performance in only one. In this setting, it is clear to say that the fold-wise approach has the best predictive performance. The second-best approach in this scenario is the imputation approach which is only marginally better than the block-wise approach. The single-block and complete-case approach have a rather bad predictive performance compared to the other three approaches.

Additionally, these approaches have the drawback that they can not generate predictions for all possible test-situations.

Summary: The fold-wise approach has by far the best predictive performance and can create predictions for every test-situation. The imputation and block-wise approach can create predictions for all test-situations as well but have a worse predictive performance than the fold-wise approach. The complete-case and single-block approach are inflexible and can only provide predictions in certain test-situations. But even in these test-situations, the approaches are usually worse than at least one of the remaining approaches.

The results with the balanced accuracy/ MCC as metric are similar -corresponding figures A-3/ A-4 can be found in the attachment.

Figure 23: Comparison of the different approaches on the TCGA data with induced block-wise missingness according to pattern 1.

Pattern 2

Figure 24 shows the boxplots for the average results of the 5-fold cross-validation of the diverse approaches on the 14 TCGA data sets with block-wise missingness in the training-set according to ’Pattern 2’. In the test-situations that do not contain the feature-block ’D’, the predictive performance is in general much worse, as in the test-situations that contain the feature-block ’D’ (in this pattern ’D’ stands for the ’CNV’ block). The complete-case approach can not create predictions for all test-situations, while the other four approaches can do so. Regarding the predictive performance, the imputation and block-wise approach have each the best

median F-1 score in 8/20 test-situations each. The fold-wise, complete-case, and single-block approaches never have the best median F-1 score in any test-situation. In this setting, it is clear to say then that the imputation and block-wise approach have the best predictive performance. The fold-wise approach performs poor in this setting and can be compared with the single-block and complete-case approach.

Figure 24: Comparison of the different approaches on the TCGA data with induced block-wise missingness according to pattern 2.

Summary: The imputation and block-wise approach have the best predictive performance and can create predictions for every possible test-situation. The fold-wise and complete-case approach can create predictions for all test-situations as well but have a worse predictive performance.

The single-block approach is inflexible and can only provide predictions in certain test-situations. Regarding the predictive performance, the single-block approach never leads to the best results. The results with the balanced accuracy/ MCC as metric are similar - corresponding figures A-5/ A-6 can be found in the attachment.

Pattern 3

Figure 25 shows the boxplots for the average results of the 5-fold cross-validation of the diverse approaches on the 14 TCGA data sets with block-wise missingness in the training-set according to ’Pattern 3’. In the test-situations that do not contain the feature-block ’A’ the predictive performance is in general much worse, as in the test-situations that contain the feature-block ’A’ (in this pattern ’A’ stands for the ’CNV’ block). Only the single-block and complete-case approach can not create predictions for all test-situations. Regarding the predictive performance, the block-wise approach has the best median F-1 score in ten out of the 20 test-situations.

The imputation approach has the best performance in four test-situations, while the fold-wise approach has the best performance in only two test-situations. The complete-case and single-block approaches never have the best median F-1 score. The single-block approach can provide predictions only for the test-situations with the feature-block ’A’, and there is no single test-situation where this approach has the best median F-1 score. Also, the complete-case approach never has the best median F-1 score.

Summary: The block-wise approach has the best predictive performance and can create predictions for every possible test-situation. The imputation and fold-wise approach can create predictions for all test-situations as well but have a worse predictive performance than the block-wise approach. The single-block and complete-case approach are inflexible and can only provide predictions in certain test-situations - regarding the predictive performance, the approaches never lead to the best results. The results with the balanced accuracy/ MCC as metric are similar - corresponding figures A-7/ A-8 can be found in the attachment.

Figure 25: Comparison of the different approaches on the TCGA data with induced block-wise missingness according to pattern 3.

Pattern 4

Figure 26 shows the boxplots for the average results of the 5-fold cross-validation of the diverse approaches on the 14 TCGA data sets with block-wise missingness in the training-set according to ’Pattern 4’. In the test-situations that do not contain the feature-block ’B’, the predictive performance is in general much worse, as in the test-situations that contain the feature-block ’B’ (in this pattern ’B’ stands for the ’CNV’ & ’Mutation’

block). Only the single-block and complete-case approach can not create predictions for all test-situations. Regarding the predictive performance, the fold-wise approach has the best median F-1 score in four out of the six

test-situations. The complete-case approach has the best median F-1 score once. In contrast, the remaining three approaches (block-wise, imputation and single-block) never resulted in the best median F-1 score in any test-situation. In this setting, it is clear to say that the fold-wise approach has the best predictive performance. The single-block approach can provide predictions only for the test-situations with the feature-block ’A’, and there is no single test-situation where this approach has the best median F-1 score.

Figure 26: Comparison of the different approaches on the TCGA data with induced block-wise missingness according to pattern 4.

Summary: The fold-wise approach has the best predictive performance and can create predictions for every possible test-situation. The imputation and block-wise approach can create predictions for all test-situations as well but never have the best median F-1 score. The single-block and complete-case approach are inflexible and can only provide predictions in certain test-situations. Regarding the predictive performance, the single-block approach never leads to the best results, while the complete-case approach has the best median F-1 score in one test-situation. The results with the balanced accuracy/ MCC as metric are similar - corresponding figures A-9/ A-10 can be found in the attachment.

Im Dokument A comparison study of prediction approaches for multiple training data sets and test data with block-wise missing values (Seite 63-70)