Evaluation and Method Comparison - A Systematic Evaluation of Efficient Uncertainty Estimation

Using the deterministic model version, the predicted class label is determined by choosing the argmax value of the Softmax output. Similarly, the argmax value of the mean of the Softmax output over several stochastic forward passes dictates the class label in case of the stochastic model. Predictions of both model variants are compared by their error rate on the whole LOO validation set, which are presented within Appendix A in Table A.2 for models that were trained on the MNIST dataset and in Table A.10 for models trained on the CIFAR-10 dataset.

Furthermore, a comparison based on subsets was performed and the evaluation setup is visualized in Figure 5.2. On the very left, the evaluation on the whole

LOO test set with

892 injected instances LOO test set with

210 injected instances (30 ×) LOO test set without injected instances

Fisher's exact test t-test or MWU-test¹ Error Rate (ER)

Figure 5.2: Evaluation on subsets of equal size. Blue dots depict instances with class labels that the model has seen during training. Red dots represent articially injected instances of class label ve. Each circle depicts all instances that the models have been tested on. The rejected subset, which results from applying the threshold, is visualized as the smaller section of the circle using a black separation line.

MNIST-LOO validation set with 892 injected instances is shown. The experiment was only performed once for each model. In the middle, a repetitive evaluation on the whole LOO validation set with 210 mutually dierent injected instances.

The experiment was performed 30 times and the mean of the amount of injected and rejected instances is the basis for the statistical t-Test or Mann-Whitney-U-Test (MWU-Mann-Whitney-U-Test)². The third image on the right represents the evaluation setup if no additional instances are injected into the LOO validation set. In this case a comparison of the error rate is performed.

All subsets were formed by applying a threshold on either the Softmax output value of the argmax class or on the determined predictive uncertainty value. Ap-plying those thresholds yields a set of rejected instances that have low Softmax- or high uncertainty values. To ensure a fair comparison, a determination of subsets of equal size is necessary. Therefore, the threshold values are set to a value where no instances are rejected at all and then adjusted step wise until the number of rejections reaches a predened amount. The number of rejected instances is always set to the number of injected instances.

Figure 5.3 shows the confusion matrix that is used as a basis to compare the model performance when instances of class label ve are injected into the LOO validation set. The True Positive (TP) value represents the articially injected instances that could be rejected by either the deterministic or stochastic model variant. In a best case scenario, all injected instances are also rejected.

2Also called Wilcoxon-Mann-Whitney-Test.

rejected not rejected

injected class TP FN

all other classes FP TN

Figure 5.3: Evaluation and method comparison based on a confusion matrix.

The False Positive (FP) value represents the number of instances with other class labels than ve that are also within the rejected subset. Those instances have class labels that the model has seen during training. The False Negative (FN) value indicates injected instances that could not be rejected due to high Softmax values or low uncertainty estimates. The True Negative (TN) value shows the number of remaining instances of the LOO validation set. The number of rejected instances (rejection-subset size) is forced to be equal to the number of injected instances by adjustment of the applied threshold. Precision, Recall and F1-score are therefore equal as well.

To compare both methods, it is sucient to look at the TP values or at the TP/FP ratio. The leftmost image in Figure 5.2 shows the scenario where the whole MNIST validation set is used and the rejected subset size is set to 892 instances, which is the total amount of instances with class label ve within the validation set.

In case of the CIFAR-10 dataset, the number of rejected instances is set to 1000 because the validation set is uniformly distributed.

In a rst step, the complete validation set is used and predictions are collected once. As a matter of fact, the rejected subsets may not contain the same instances when the deterministic and stochastic model variants are used to determine them.

There will be some overlap between the two sets but one set will probably contain instances that are missing in the other. Dierences in the TP/FP proportions are tested for statistical signicance with Fisher's exact test.

A summary of the results of this evaluation can be found in Table A.4 in Ap-pendix A, which shows all models that were trained on the MNIST-LOO dataset and use the predictive entropy as uncertainty measure. The rejected subset size represents a rejection rate of 8.92%. All stochastic models within this setup could reject more injected instances than the deterministic one but several exhibit only small dierences and the hypothesized equality of the TP/FP ratios could not be rejected.

Class label 0 1 2 3 4 5 6 7 8 9

# inst. 5923 6742 5958 6131 5842 5421 5918 6265 5851 5949 Table 5.1: Class distribution among the MNIST training set comprising 50.000 instances in total.

To reduce the inuence of the particular instances, the pool of possible candidates that can be injected was extended by the instances of the training set in the second step. This larger set was split into junks of size 210 for the MNIST dataset and size 200 for the CIFAR-10 dataset. Those 30 mutual dierent sets were injected and the rejected subsets determined for both model variants of each model. Dierences in the mean of the TP values are tested for statistical signicance either with a t-Test or MWU-Test in cases where the assumption of normally distributed data was rejected but variance homogeneity can be hypothesized. The results of the signicance tests for models trained on the MNIST-LOO dataset using the predictive entropy as uncertainty measure can be found in Table A.3 in Appendix A.

In the third step, the error rates within the LOO instances are determined and compared. This was done by evaluating the instances that were rejected but have dierent class labels than ve. These results are summarized in Figures 5.13, 5.14, and 5.15 in case of the MNIST dataset.

Additionally, Error-Reject curves are created for dierent thresholds. They show results for setups with injected instances and without. An evaluation without any injected instance is shown in the lower half of Figure 5.16 in Section 5.3.3 and Figure 5.21 in Section 5.3.4 for the MNIST dataset.

Im Dokument A Systematic Evaluation of Efficient Uncertainty Estimation in Neural Networks / submitted by David Kowanda (Seite 72-75)