• Keine Ergebnisse gefunden

K-Nearest Neighbor for Patient Similarity-based Health Prediction 62

K-nearest neighbor KNN is one of the simplest machine learning methods. It is one of the top 10 data mining algorithms listed by [90]. KNN is another approach for analyzing patient similarity for health prediction. It is described by [35]:

• Supervised learning algorithm: It refers to the predicted class label of a test instance from the labeled training data. From the input patient vectors with class labels, the KNN method can predict the class label for an unseen example (i.e., a new patient).

• Non-parametric learning algorithm: KNN algorithm has no depen-dency on parameter. The parameters are not fixed in advance. Thus, no assumptions are made on the shape of the decision boundary. However,

62

this property causes a performance reduction with a dataset that has a large feature number.

• Instance-based learning algorithm: The prediction for a new instance xis from the training instances. That is why KNN is called a lazy learner.

It does not build a training model for generalization. However, to esti-mate a class label for a new instance or test instance (i.e., patientx), the KNN learner compares it to all the training instances and find the nearest neighbors that help with prediction.

Prediction of the class label for a new instancexis from the training instances.

A user-defined a positive integer k, it identifies the k nearest neighbor to x from which the predicted class ofx is assigned. Hence, distances or similarities betweenxand all the training instances are computed. Various distance metrics can be used, for instance, the discussed ones in Section 3.2.2.2. To estimate the predicted value to x a conditional probability for each class j in the k set is calculated as follows Equation 4.9 [35]:

P r(Y =j |X=x0) = 1 k

X

i∈N0

I(yi =j). (4.9) This gives a fraction of points with class value j. Where Y is the class label andx0 is the test instance that we want to predict its class label. N0 represents thek nearest neighbors to x0. I() is an indicator function that returns 1 when the argument is true and 0 otherwise. For each value of the class label occurs in the nearest neighbor instances (N0), the predicted value pr is calculated. The predicted class value ofxis the class with the highest probabilityP ramongN0 (i.e., the class to which the majority ofk instances belong).

Another formula of Equation 4.9 to predict the class label of a test instance is given by [90] and called Majority Voting:

P r(Y =j|X =x0) =argmaxyi

X

i∈N0

I(yi =j). (4.10)

Important Parameters

Some critical key choices and issues affect the performance of KNN [90]:

• The value of k nearest neighbour: “The choice of K has a drastic effect on the KNN classifier obtained” [35]. The choice of k value affects the performance of KNN. Small k gives low bias but very high variance classifier. The opposite with large k the classifier gives high bias and low variance. There is no a rule of thumb of the k value. For avoiding tied votes of the binary labels among N0, it is helpful to choose k odd.

However, the decision of the exact value ofk depends on the data.

63

• The approach to combine the class labels:

After estimating the probability of each class label occurring in the set of k nearest neighbors instances by Equation 4.9, a decision has to be made to select the predicted class label. The general approach is to take the majority vote (i.e., select the predominant class label). A weakness of this approach occurs when the nearest neighbors are widely in their distance, and the closest neighbors have the reliable indicator to the class predicted value. The class label that commonly occurs will affect the predicted value.

Therefore, to assure that the nearest neighbors affect more than the distant ones assign a weight to the neighbors’ contribution or vote. Its distance weights the vote of each neighbor. The weight factor is: wi = 1/d(x0, xi)2 the weight of thexi vote is the reciprocal of the squared distance between the test instance x0 and xi. This approach is by Equation 4.11:

P r(Y =j|X =x0) = 1 k

X

i∈N0

wi×I(yi =j). (4.11)

This approach is less sensitive to the choice of k.

• The choice of distance metric:

Huet al. [33] examine the effect of the used distance method on the classi-fication performance of KNN. They test four distance methods Euclidean, Cosine, Chi-square, and Minkowsky on medical datasets. The classifica-tion accuracy is tested on three different feature data types: numerical, categorical and mixed. They find that Chi-square distance function out-performs the other distance functions over all the different data types. In specific, with the mixed data type the other distance functions perform worst. However, this is not a silver bullet to which is the best distance metric. The selection of the best fit distance metric should be based on the data.

Different values of these factors can be tested to find the best choice. Finding the lowest test error rate sheds light on the right decision. However, using test data for this test purpose cause overfitting. Hence, cross-validation is one approach to use. In this approach, a subset of the training dataset is used for testing.

To predict the class of a test instancexthe KNN classifier identifies the nearest observations tox by measuring the distance. Therefore, similar to any distance metric the scale of the variables is essential in KNN. Large-scale variables have more significant effects on the distance measure and then on the KNN classifier.

In the previous approach, Section 3.2.2.2 to predict an output class of patient x, we need to compute the distance between x and all the other patients in the dataset. Then the output data of similar patients are used with one of a

64

predictive model to predict such an output. However, in KNN after calculating all the similarities or distances betweenxand all the other patients in the train-ing dataset, a majority vote of the knearest neighbor is deciding the predicted value.

Tuning KNN Parameters

We use a dataset of 32.635 patients. For KNN model we test different param-eters of distance metrics and the value of k. First, we test the distance metric parameter with 10 fold cross-validation. Different distance metrics are used, but all with the same k. In this example, we test k= 5. From Figure 4.17 you can see the different performance KNN model has with different distance metrics.

Figure 4.17: KNN with Different Distance Metrics

• main criterion Accuracy: We test k parameter for a specific distance metric Euclidean Distance to optimize accuracy. To find the best k we test with 10 fold cross validation the value from 1 to 50 with linear scaled 1,6,11,16,21,26,30,35,40,45,50. Figure 4.18 shows the performance differ-ence of different k values. As a result the optimal k is k= 21 that gives 87.02 % accuracy.

Then, we test the parameter of the approach to combine the class la-bels—the distance metric set to the Euclidean distance and k value set to k= 21. The only thing we change is the way to decide the predicted class label. We compare the two different approaches: the majority vote and the weighted vote. The approach of the weighted vote is selected by Grid optimize parameters. However, the accuracy measures of using the two approaches were almost the same which is 88.84%.

• main criterion AUC: We search for the optimal k value to optimize AUC. We use the same range and linear steps for kas the previous test.

65

Figure 4.18: KNN with DifferentkValues with Euclidean Distance to Optimize Accuracy

Figure 4.19 shows the performance difference of different k values. As a result, the optimal k is k = 50 that gives AUC 0.777 +/− 0.009. The other metrics accuracy 88.86 %, precision 79.05%, a very low recall 0.90%, and f measure 1.78%. Imbalanced data affect the ML model’s parameter values. For instance, KNN with imbalanced class distribution dataset with increasing the K the predicting of the minority class is lowered (i.e., the larger the k is the less probability of minority class accrues as neighbors and more majority class instances as neighbors).

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5 5 0

k - N N . k 0.500

0.525 0.550 0.575 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775

AUC

Figure 4.19: KNN with DifferentkValues with Euclidean Distance to Optimize AUC

The last parameter we test the tuning effect is the approach to combine the class labels. We select the same distance metric (the Euclidean distance) and the samekvaluek= 50 with 10-folds cross-validation. The approach of the weighted vote is selected by Grid optimize parameters. The accu-racy of both approaches was almost the same. However, looking more in

66

detail of the performance result, we find that differences were in AUC and the Precision and Recall of predicting the positive class (i.e., predicting mortality). The majority vote gives AUC of 0.777% and 70.07 precision and 0.85% recall. The weighted vote improves the performance where it gives 0.778% AUC, 1.15% recall, and precision 77.78%. The reason lies back to a problem we have in our dataset (imbalanced class distribution).

In the following chapter, we will discuss the problem in detail. We previ-ously mentioned that the class label that commonly occurs would affect the predicted value. In our dataset, the survival class is the common one.

Thus, using the majority vote with a larger k value is not a good choice in our case.

Strength and Weaknesses

The KNN algorithm is easy to implement. However, a drawback of KNN is that its performance is affected by the increases in the feature size. “This decrease in performance as the dimension increases is a common problem for KNN”[35].

In the high dimensional space (i.e., large feature number), there are very few neighbors near to any test instance. There is a small difference between the nearest neighbors and the farthest neighbors. Therefore, KNN will be slow to find the nearest neighbor. This deterioration in the performance of KNN is a fact of the non-parametric approaches, which has poor performance with large feature number. This problem is called the curse of dimensionality.

Another drawback is the need for a large memory where all the training data need to be stored since the decision on prediction is based on all the training data instances. Furthermore, the test phase is costly. It is a lazy instance-based learner that no generalization model needs to be built – the classification is estimated for each test instance. Thus, the classifying phase to predict the class of an instance is computationally expensive. This requires pairwise distance computations to all the training instances, which is costly with a large training dataset. However, some methods exist to avoid pairwise distance computations to all the training instances. The goal is to help reduce the computational cost without affecting the classifying accuracy.

4.7 Choosing the Optimal ML Model

In the previous, we test the four models with tuning different parameters. In this Section, we compare the accuracy of the four models. The highest accuracy of the optimal parameters (that found with main criterion was accuracy) of each model is represented in Figure 4.20.

Figure 4.20 shows that the models have high accuracy. Moreover, they have almost the same predictive accuracy. This suspicious result makes us look closer to the test result.

67

89.23%

LR

88.35%

DT

88.25%

GBDT

88.86%

KNN

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Accuracy %

Figure 4.20: Compare Models Accuracy

Performance Metric LR DT GBDT KNN

Accuracy 89.23% 88.35% 88.25% 88.86%

AUC 0.801 0.756 0.865 0.778

Precision 58.63% 45.07% 47.78% 77.78%

Recall 13.41% 16.36% 49.06% 1.15%

Table 4.1: Compare Models Performance

We retest the four models with the optimal parameters we found from the previous sections (when AUC was the main criterion). Besides the accuracy metric, we calculate different performance metrics. The result are given in Table 4.1 and Figure 4.21

We find that the models were successful in predicting survival cases rather than the death cases. The high predictive accuracy was a sign for overall pre-diction of the majority class which is the survived case. This situation where the higher accuracy metric is not an indicator of an excellent classifier performance is calledAccuracy paradox [82]. It is paradoxical when accuracy is not the good metric for the predictive model.

The used dataset MIMIC-III from 32,635 include 28,974 alive patients and only 3,661 dead patients. The ratio of the instances of Class-1 (survived patient) to Class-2 (died patient) is 89:11. This problem of imbalanced class distribution causes the classifier to be extremely biased toward the majority class. As a result, the models’ high accuracy was obtained by predicting all instances as a majority class. Thus, it is the model accuracy in predicting most of the dominant class instances and discounting the accuracy in predicting the minor class ones. Nevertheless, the minority class (i.e., the died patient) is the positive class which is the class of interest (i.e., we focus on predicting this class).

We can conclude from this case that having a highly accurate model is not enough indication of a useful model. Valverde-Albacete et al. [82] state that a predictive classifier model with low accuracy may have higher predictive power than a model with high accuracy. In particular, they stress that this is applied

68

LR DT GBDT KNN 0

20 40 60 80 100

89.23 88.35 88.25 88.86

58.63

45.07 47.78

77.78

13.41 16.36

49.06

1.15

Accuracy Precision Recall

Figure 4.21: Compare Models Prediction Performance

to the highly imbalanced orskewed training data where the classifier produces a highly accurate result by assigning all the cases to the majority class. For instance, even though DT and KNN have high accuracy, they have a very low Recall (that measures how often a positive class instance is truly predicted as a positive one). Therefore, we should use other metrics to evaluate our models beside accuracy. Hoenset al. and Chawla [30, 10] state that the predictive ac-curacy is inappropriate when data is imbalanced. They recommend alternative metrics to evaluate the classifier performance on the imbalanced dataset. Hoens et al. [30] recommend balanced accuracy, ROC curves, Precision and Recall, and F-measure. Chawla [10] recommends ROC curves, Precision and Recall, and Cost-sensitive Measures (cost curves and cost matrix). He et al. [29] state that accuracy is sensitive to the class distribution, while Precision and Recall are not. Therefore, from now on, we will use Recall, Precision, AUC, and F-measure as classification performance metrics. Thus, also considering AUC in selecting the features was a better metric than accuracy. The models’ AUC are shown in Figure 4.22.

Looking to Figure 4.21 and the Table 4.1, we find that GBDT has the highest and the best performance trade off through all the metrics (accuracy, AUC, Precision and Recall). In specific, for the Recall which we consider a critical metric the GBDT has the highest value. Moreover, comparing to DT and KNN the LR has a higher AUC and Recall values. The DT give the random guessing value in AUC and the lowest Recall value. The KNN also has a low Recall and a low AUC values in comparison with GBDT and LR.

Keep in mind that GBDT and LR give this high performance without any

69

0.801 LR

0.756 DT

0.865 GBDT

0.778 KNN

0 1

AUC

Figure 4.22: Compare Models AUC

performance optimization regarding the data pre-processing (to solve the imbal-anced data problem) and without further feature selection. Therefore, For the next tests of performance optimization, we will consider GBDT and LR which we expect they give high performance. In contrast, we hypothesized that DT and KNN would not provide significant improvements.

70

5

Performance Optimization

This chapter presents different approaches for performance optimization. It discusses the effect of normalized and un-normalized data on prediction per-formance. It shows the practical implementation of different feature selection methods to find the optimal subset of the features. Filter, wrapper, and embed-ded methods are applied. In the end, it proposes an approach for performance optimization by filtering the patients by the diagnoses code.

Contents

5.1 Scope of the Chapter . . . 74 5.2 Data Pre-processing Normalized vs. Un-normalized

Data . . . 74 5.3 Result of Feature Selection Methods . . . 76 5.3.1 Filter Selection by Chi Squared . . . 76 5.3.2 Forward Selection . . . 76 5.3.3 Backward Elimination . . . 79 5.3.4 Embedded Feature Selection Method of GBDT . . . 80 5.3.5 Summary . . . 82 5.4 Data Sampling with Patient Filtering by Diagnoses

Code . . . 84 5.4.1 Filtering the Group of the Highest Occurrence Code 84 5.4.2 Filtering the Group of the Highest Mortality

Occur-rences . . . 87 5.4.3 Feature Selection after Filtering by the Diagnoses Code 89 5.4.4 Summary . . . 90

73

5.1 Scope of the Chapter

In this chapter, we will represent the step of performance optimization by tuning the accuracy factors. Different approaches are described as further data pre-processing, feature selection, and filtering patients by their diagnoses codes.

For every optimization, we will apply the predictive ML model to test the effect of this performance optimization.

5.2 Data Pre-processing Normalized vs. Un-normalized