Model Performance - Evaluation on the CIFAR-10 Dataset

5.4 Evaluation on the CIFAR-10 Dataset

5.4.1 Model Performance

Figure 5.23 visualizes the predictive entropy of all instances of the full CIFAR-10 validation set using the least complex model structure. Figures 5.24 and 5.25 show the prediction uncertainty for the CNN with larger structure applying 10%

Dropout and 50% Dropout respectively. All gures additionally show the mean of the maximal Softmax output value (determined via the stochastic forward passes) versus the maximal Softmax value of the deterministic model on the left.

The only model that shows a statistical signicant dierence for both experi-mental setups is the large_cnn_cifar10_a with the lowest error rate. The stochastic version could reject 12.4% more instances on average. In both other cases, the stochastic model performed worse than the deterministic version. Dierences in the error rates between the deterministic and stochastic models evaluated on the complete LOO validation set are again negligible.

Interesting is a comparison between models with very large Dropout (Figure 5.25) and lower Dropout rates (Figures 5.23 and 5.24). The rst plot does not show a concentration for the injected instances with class label ve at high Softmax values, as both other plots do. The larger Dropout rate has a benecial impact here.

Nevertheless, also the Softmax value of the deterministic model behaves similarly and already encodes this information.

(a) Mean of the max. Softmax output of the stochastic model versus the max. Softmax value of the deterministic model.

(b) Predictive entropy of the max. Softmax output versus the max. Softmax value of the deterministic model.

Figure 5.24: Large CNN - 10% Dropout applied after the last two inner layer.

(a) Mean of the max. Softmax output of the stochastic model versus the max. Softmax value of the deterministic model.

(b) Predictive entropy of the max. Softmax output versus the max. Softmax value of the deterministic model.

Figure 5.25: Large CNN - 50% Dropout applied after each inner layer.

Chapter 6 Discussion and Conclusion

Two dierent scenarios will be discussed in the following. First, testing on instances with class labels that the model has also seen during training and second, testing on instances that have been excluded from the training set but are fed to the networks during test time. These instances are considered to be further away from the training distribution and are expected to yield higher uncertainty values during classication.

To cover the rst part, training and testing on the LOO sets was performed and evaluated via the error rates based on the deterministic and stochastic predictions.

Results show that there is only a minor eect on the classication capability of the evaluated model structures and no clear favorable tendency can be observed. Note that in the experiments only the dropout rate was optimized via grid search. To fur-ther improve those results, deterministic regularization (e.g. L₂ regularization with weight decay λ) may be used and jointly optimized with the dropout rate. Also optimization individually for each layer could be performed but will require more computational eort. However, the presented stochastic models do not provide a considerable improvement, in terms of error rate, compared to the deterministic model versions. Note that also the dierences between the standard dropout tech-nique (weight scaling) and the MC approximation, as presented in [24], were within one standard deviation and pretty close to each other.

Even if an improvement could be shown, it is questionable if a very extensive search for the best dropout rate is of practical use, especially if the model struc-ture is very large and the evaluation of dierent hyperparamter settings exhibits exponential complexity. One may argue that strategies like Bayesian optimization of the hyperparamters can lift this burden to some extent. Nevertheless, notice that

0 1 2 3 4 5 6 7 8 9 0

0.2 0.4 0.6 0.8 1

class label

Softmaxvalue

20% / 80%

0 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

class label

Softmaxvalue

50% / 50%

0 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

class label

Softmaxvalue

100% / 0%

0 1 2 3 4 5 6 7 8 9

0 0.2 0.4 0.6 0.8 1

class label

Softmaxvalue

Figure 6.1: Exemplary Softmax output values for a classication task with ten dierent class labels.

such a strategy may have their own hyperparameters and may also not be easy to use.

The second scenario deals with the situation of classifying and detecting instances that the model has not seen during training. The presented results indicate that the uncertainty estimates can be used to improve the number of identied injected in-stances. Longer training improves the performance of the stochastic model versions because due to the random creation of sub-networks, it will take more time until all weights are well trained and distributed across the whole network. This eect is also mentioned in [24] and [7].

Another aspect is the inuence of the general capability of the trained network.

A model that has a high error rate will more likely produce output constellations

as shown in the rst row of Figure 6.1. Such output distributions induce that classication decisions incur a high risk. A uniform distribution eectively means that there is no preferential class label and the model simply does not know which class label it should predict.

Using the predictive entropy as uncertainty estimate will yield the highest pos-sible value in case of a uniform distribution. Note, however, that there are two possible reasons for such a distribution. First, the model may be incapable of prop-erly assigning a class label to an instance that it has seen during training because the features are not clearly separable. Second, the predictive distribution is pushed in the direction of the uniform distribution due to high uncertainty of the presented instance, which is completely novel to the model and may be far apart from the train-ing distribution. Instances with high uncertainty estimates are therefore a mixture and the separation between the two causal reasons may be impossible.

Nevertheless, all stochastic models show that there are still several instances that produce very certain decisions although they are far apart from the training distribution. The model clearly underestimates uncertainty that should be assigned to such instances. A possible explanation was already given in Chapter 4, via the relationship between a Dropout neural network and the sparse spectrum Gaussian process approximation. It may also be that if there is a certain overlap in discrimina-tive features between instances that are further apart from the training distribution and those seen during training, the predictive distribution may be pretty equally certain. Moreover, the particular choice of non-linear activation function may have a strong inuence on the induced distributions over functions and the corresponding uncertainty estimates. In the work [20], the authors also found it dicult to obtain uncertainty estimates utilizing Dropout as a Bayesian approximation. They propose an alternative attempt to recover uncertainty information by using the nonparamet-ric bootstrap method. Their ndings suggest that randomly initialized weights in combination with sub-sampled training sets are sucient to obtain good GP ap-proximations. Figure 6.2 shows an example of the recovered predictive distribution.

In conclusion it can be said that if an accurate model with a carefully evaluated use of Dropout is already available, the method might come in handy. It is also not obvious how the non-linear activation function impacts the uncertainty estimates.

Methods that produce uncertainty estimates that are more related to the strength of the evidence may even be more useful than using Dropout as an approximation.

(a) Gaussian process regression. (b) Approximated predictive distribution with uncertainty estimates using the nonparametric bootstrap.

Figure 6.2: Visualization of the predictive distribution using the nonparametric bootstrap. Figure is taken from [20].

Appendix A

Hyperparameters and Model Performance

Name Type Units training

epochs drop

outfactor

model_11 DenseNN 30-10 10 0.1

model_12 DenseNN 30-10 10 0.3

model_13 DenseNN 30-10 10 0.5

model_14 DenseNN 30-10 10 0.7

model_15 DenseNN 30-10 10 0.9

model_16 DenseNN 30-10 20 0.1

model_17 DenseNN 30-10 20 0.3

model_18 DenseNN 30-10 20 0.5

model_19 DenseNN 30-10 20 0.7

model_110 DenseNN 30-10 20 0.9

model_111 DenseNN 30-10 100 0.1

model_112 DenseNN 30-10 100 0.3

model_1₁₃ DenseNN 30-10 100 0.5

model_114 DenseNN 30-10 100 0.7

model_1₁₅ DenseNN 30-10 100 0.9

model_21 DenseNN 500-10 10 0.1

Name Type Units training epochs drop

outfactor

model_2₂ DenseNN 500-10 10 0.3

model_23 DenseNN 500-10 10 0.5

model_2₄ DenseNN 500-10 10 0.7

model_25 DenseNN 500-10 10 0.9

model_2₆ DenseNN 500-10 20 0.1

model_27 DenseNN 500-10 20 0.3

model_2₈ DenseNN 500-10 20 0.5

model_29 DenseNN 500-10 20 0.7

model_2₁₀ DenseNN 500-10 20 0.9

model_211 DenseNN 500-10 100 0.1

model_2₁₂ DenseNN 500-10 100 0.3

model_213 DenseNN 500-10 100 0.5

model_2₁₄ DenseNN 500-10 100 0.7

model_215 DenseNN 500-10 100 0.9

model_3₁ DenseNN 1000-10 10 0.1

model_32 DenseNN 1000-10 10 0.3

model_3₃ DenseNN 1000-10 10 0.5

model_34 DenseNN 1000-10 10 0.7

model_3₅ DenseNN 1000-10 10 0.9

model_36 DenseNN 1000-10 20 0.1

model_3₇ DenseNN 1000-10 20 0.3

model_38 DenseNN 1000-10 20 0.5

model_3₉ DenseNN 1000-10 20 0.7

model_310 DenseNN 1000-10 20 0.9

model_3₁₁ DenseNN 1000-10 100 0.1

model_312 DenseNN 1000-10 100 0.3

model_3₁₃ DenseNN 1000-10 100 0.5

model_314 DenseNN 1000-10 100 0.7

model_3₁₅ DenseNN 1000-10 100 0.9

Name Type Units training epochs drop

outfactor

simple_cnn_mnist₁ CustomCNN 512-10 5 0.1

simple_cnn_mnist2 CustomCNN 512-10 5 0.3

simple_cnn_mnist₃ CustomCNN 512-10 5 0.5

simple_cnn_mnist4 CustomCNN 512-10 5 0.7

simple_cnn_mnist₅ CustomCNN 512-10 5 0.9

simple_cnn_mnist6 CustomCNN 512-10 10 0.1

simple_cnn_mnist₇ CustomCNN 512-10 10 0.3

simple_cnn_mnist8 CustomCNN 512-10 10 0.5

simple_cnn_mnist₉ CustomCNN 512-10 10 0.7

simple_cnn_mnist10 CustomCNN 512-10 10 0.9

simple_cnn_mnist₁₁ CustomCNN 512-10 50 0.1

simple_cnn_mnist12 CustomCNN 512-10 50 0.3

simple_cnn_mnist₁₃ CustomCNN 512-10 50 0.5

simple_cnn_mnist14 CustomCNN 512-10 50 0.7

simple_cnn_mnist₁₅ CustomCNN 512-10 50 0.9

dense_bayes_nn_mnist1 DenseBayesNN 30-10 - -dense_bayes_nn_mnist2 DenseBayesNN 500-10 - -Table A.1: Enumeration of all model structures and hyperparameter settings that were trainined on the MNIST dataset.

Name Type Units ER[%](full test

set)

ER[%](LOO)

ER[%](LOO, stoch.)

model11 DenseNN 30-10 4.34 3.79 3.69

model12 DenseNN 30-10 5.01 4.23 4.37

model1₃ DenseNN 30-10 6.15 5.38 5.37

model14 DenseNN 30-10 7.56 6.38 6.40

model1₅ DenseNN 30-10 11.61 10.07 10.77

model16 DenseNN 30-10 3.47 2.94 3.17

model1₇ DenseNN 30-10 4.37 3.62 3.48

model18 DenseNN 30-10 5.67 4.46 4.77

model1₉ DenseNN 30-10 7.56 5.93 6.13

model110 DenseNN 30-10 10.49 8.11 9.94

model1₁₁ DenseNN 30-10 3.55 2.98 2.95

model112 DenseNN 30-10 4.38 3.58 3.51

model1₁₃ DenseNN 30-10 5.23 4.48 4.50

model114 DenseNN 30-10 7.25 5.91 6.05

model1₁₅ DenseNN 30-10 10.58 8.94 9.99

model21 DenseNN 500-10 1.82 1.80 1.55

model2₂ DenseNN 500-10 1.89 1.64 1.71

model23 DenseNN 500-10 1.87 1.75 1.78

model2₄ DenseNN 500-10 2.09 1.98 2.00

model25 DenseNN 500-10 3.62 3.28 3.49

model2₆ DenseNN 500-10 1.78 1.70 1.87

model27 DenseNN 500-10 1.54 1.59 1.60

model2₈ DenseNN 500-10 1.56 1.58 1.52

model29 DenseNN 500-10 1.69 1.69 1.82

model2₁₀ DenseNN 500-10 2.87 2.73 2.88

model211 DenseNN 500-10 1.55 1.46 1.47

Name Type Units ER[%](full test

set)

ER[%](LOO)

ER[%](LOO, stoch.)

model212 DenseNN 500-10 1.55 1.38 1.43

model213 DenseNN 500-10 1.48 1.48 1.47

model2₁₄ DenseNN 500-10 1.59 1.55 1.57

model215 DenseNN 500-10 2.59 2.40 2.56

model3₁ DenseNN 1000-10 1.72 1.52 1.49

model32 DenseNN 1000-10 1.58 1.68 1.69

model3₃ DenseNN 1000-10 1.61 1.65 1.66

model34 DenseNN 1000-10 1.81 1.75 1.76

model3₅ DenseNN 1000-10 2.72 2.73 2.83

model36 DenseNN 1000-10 1.78 1.62 1.62

model3₇ DenseNN 1000-10 1.60 1.62 1.60

model38 DenseNN 1000-10 1.57 1.60 1.61

model3₉ DenseNN 1000-10 1.52 1.50 1.53

model310 DenseNN 1000-10 2.28 2.27 2.26

model3₁₁ DenseNN 1000-10 1.47 1.41 1.38

model312 DenseNN 1000-10 1.39 1.30 1.30

model3₁₃ DenseNN 1000-10 1.41 1.31 1.36

model314 DenseNN 1000-10 1.37 1.44 1.43

model3₁₅ DenseNN 1000-10 1.90 1.92 1.99

simple_cnn_mnist1 CustomCNN 512-10 1.21 1.20 1.16 simple_cnn_mnist₂ CustomCNN 512-10 1.43 0.89 0.90 simple_cnn_mnist3 CustomCNN 512-10 1.18 0.79 0.81 simple_cnn_mnist₄ CustomCNN 512-10 1.38 1.11 1.12 simple_cnn_mnist5 CustomCNN 512-10 1.72 1.43 1.45 simple_cnn_mnist₆ CustomCNN 512-10 1.14 0.83 0.82 simple_cnn_mnist7 CustomCNN 512-10 1.13 0.79 0.79

Name Type Units ER[%](full test

set)

ER[%](LOO)

ER[%](LOO, stoch.)

simple_cnn_mnist8 CustomCNN 512-10 1.64 0.88 0.91 simple_cnn_mnist9 CustomCNN 512-10 0.96 1.04 1.02 simple_cnn_mnist₁₀ CustomCNN 512-10 1.35 1.31 1.34 simple_cnn_mnist11 CustomCNN 512-10 0.83 0.61 0.63 simple_cnn_mnist₁₂ CustomCNN 512-10 0.75 0.56 0.55 simple_cnn_mnist13 CustomCNN 512-10 0.86 0.51 0.49 simple_cnn_mnist₁₄ CustomCNN 512-10 0.86 0.58 0.59 simple_cnn_mnist15 CustomCNN 512-10 1.00 0.76 0.75 dense_bayes_nn_mnist1 DenseBayes 30-10 6.00 4.80 -dense_bayes_nn_mnist2 DenseBayes 500-10 3.58 3.20 -Table A.2: Error rates for dierent model structures and hyperparameter settings.

All models are evaluated on the full MNIST test set and the whole LOO test set.

Predictions are determined via the argmax of the Softmax- or mean Softmax value of the model output.

Name MeanofTP(det.model) MeanofTP(stoch.model) Lilliforsp-value1 Lilliforsp-value2 Levene-Testp-value t-Testp-value MWU-Testp-value model11 45.03 52.40 0.20 0.63 0.54 < 1E-3 < 1E-3 model1₂ 35.77 42.80 0.48 0.60 0.04 < 1E-3 < 1E-3 model13 33.40 41.50 0.31 0.15 0.03 < 1E-3 < 1E-3

Name MeanofTP(det.model) MeanofTP(stoch.model) Lilliforsp-value1 Lilliforsp-value2 Levene-Testp-value t-Testp-value MWU-Testp-value

model14 30.50 34.03 0.34 0.94 0.49 0.093 0.091

model15 28.13 25.07 0.07 0.49 0.40 0.056 0.025

model1₆ 51.10 56.37 0.27 0.48 0.07 0.005 0.011

model17 38.43 44.10 0.19 0.73 0.22 < 1E-3 0.002 model1₈ 36.17 49.03 0.06 0.39 0.02 < 1E-3 < 1E-3 model19 26.93 36.53 0.32 0.76 0.04 < 1E-3 < 1E-3 model1₁₀ 21.23 28.50 0.19 0.60 0.76 < 1E-3 < 1E-3 model111 44.03 55.57 0.02 0.03 0.02 < 1E-3 < 1E-3 model1₁₂ 35.70 42.37 0.69 0.16 0.19 < 1E-3 < 1E-3 model113 30.90 40.10 0.01 0.31 0.35 < 1E-3 < 1E-3

model1₁₄ 36.63 41.03 0.50 0.77 0.28 0.109 0.091

model115 26.80 32.93 0.45 0.51 0.36 < 1E-3 < 1E-3 model2₁ 61.17 72.10 0.39 0.31 0.39 < 1E-3 < 1E-3 model22 50.93 64.47 0.47 0.18 0.01 < 1E-3 < 1E-3 model2₃ 61.93 77.47 0.76 0.26 0.28 < 1E-3 < 1E-3 model24 52.87 65.30 0.80 0.77 0.09 < 1E-3 < 1E-3 model2₅ 35.00 45.30 0.06 0.73 0.03 < 1E-3 < 1E-3

model26 51.37 56.57 0.01 0.26 0.76 1E-3 1E-3

model2₇ 58.77 75.63 0.65 0.09 0.02 < 1E-3 < 1E-3 model28 63.67 81.37 0.46 0.06 0.74 < 1E-3 < 1E-3 model2₉ 51.50 70.80 0.42 0.50 0.53 < 1E-3 < 1E-3 model210 42.20 53.07 0.41 0.36 0.01 < 1E-3 < 1E-3 model2₁₁ 54.63 61.60 0.20 0.14 0.17 < 1E-3 < 1E-3 model212 61.10 70.70 0.51 0.01 0.71 < 1E-3 < 1E-3 model2₁₃ 57.33 81.57 0.16 0.20 0.63 < 1E-3 < 1E-3 model214 54.97 74.40 0.56 0.23 0.53 < 1E-3 < 1E-3

Name MeanofTP(det.model) MeanofTP(stoch.model) Lilliforsp-value1 Lilliforsp-value2 Levene-Testp-value t-Testp-value MWU-Testp-value model215 37.77 56.13 0.48 0.11 0.04 < 1E-3 < 1E-3

model31 55.03 60.60 0.18 0.43 0.35 1E-3 0.002

model3₂ 56.00 63.50 0.18 0.03 0.47 < 1E-3 < 1E-3 model33 52.40 66.43 0.30 0.04 0.92 < 1E-3 < 1E-3 model3₄ 55.77 63.67 0.03 0.47 0.41 < 1E-3 < 1E-3 model35 46.73 55.30 0.36 0.33 0.02 < 1E-3 < 1E-3

model3₆ 56.63 61.40 0.51 0.48 0.82 0.006 0.006

model37 55.57 63.80 0.03 0.55 0.90 < 1E-3 < 1E-3 model3₈ 50.80 65.83 0.06 0.25 0.89 < 1E-3 < 1E-3 model39 55.30 70.90 0.01 0.18 0.02 < 1E-3 < 1E-3 model3₁₀ 44.53 57.33 0.06 0.19 0.37 < 1E-3 < 1E-3

model311 58.20 61.33 0.34 0.58 0.95 0.04 0.060

model3₁₂ 61.23 67.60 0.34 0.02 0.87 < 1E-3 < 1E-3 model313 62.37 78.97 0.52 0.06 0.47 < 1E-3 < 1E-3 model3₁₄ 59.37 80.13 0.01 0.15 0.08 < 1E-3 < 1E-3 model315 44.53 60.50 0.04 0.06 0.18 < 1E-3 < 1E-3 simple_cnn_mnist₁ 57.47 58.87 0.48 0.57 0.54 0.475 0.656 simple_cnn_mnist2 70.87 78.80 0.29 0.07 0.78 < 1E-3 < 1E-3 simple_cnn_mnist₃ 88.77 100.20 0.58 0.95 0.23 < 1E-3 < 1E-3 simple_cnn_mnist4 74.27 82.00 0.85 0.30 0.66 < 1E-3 < 1E-3 simple_cnn_mnist₅ 54.90 55.80 0.97 0.16 0.26 0.657 0.921 simple_cnn_mnist6 83.43 89.37 0.62 0.81 0.83 < 1E-3 < 1E-3 simple_cnn_mnist₇ 83.57 99.70 0.18 0.80 0.42 < 1E-3 < 1E-3 simple_cnn_mnist8 80.87 85.90 0.48 0.01 0.70 0.007 0.003 simple_cnn_mnist₉ 94.17 103.77 0.02 0.08 0.35 < 1E-3 < 1E-3 simple_cnn_mnist10 48.30 52.13 0.04 0.09 0.98 0.040 0.032

Name MeanofTP(det.model) MeanofTP(stoch.model) Lilliforsp-value1 Lilliforsp-value2 Levene-Testp-value t-Testp-value MWU-Testp-value simple_cnn_mnist11 92.17 95.70 0.37 0.63 0.82 0.037 0.042 simple_cnn_mnist12 89.73 98.20 0.09 0.09 0.24 < 1E-3 < 1E-3 simple_cnn_mnist₁₃ 92.60 107.67 0.04 0.93 0.26 < 1E-3 < 1E-3 simple_cnn_mnist14 94.57 105.80 0.29 0.12 0.16 < 1E-3 < 1E-3 simple_cnn_mnist₁₅ 77.17 88.37 0.18 0.53 0.86 < 1E-3 < 1E-3 dense_bayes_nn_mnist1 43.20 47.63 0.04 0.02 0.85 < 1E-3 < 1E-3 dense_bayes_nn_mnist2 45.10 49.70 0.33 0.29 0.55 0.004 0.002 Table A.3: Signicance tests on the MNIST dataset using the predictive entropy.

The experiment was performed 30 times and the second and third column show the mean of the number of injected and rejected instances (TP values). In total 210 instances are injected into the LOO test set. The size of the rejected subset was also 210 in each case. The fourth and fth column show the p-value of the Kolmogorow-Smirnow-Test (KS-Test) with Lillifors correction, which tests for nor-mality of the data. The sixth column shows the p-value of the Levene-Test, which tests for homoscedasticity. Normality and homoscedasticity cannot be rejected if the p-value is above the signicance level. In an ideal scenario, all TP values are larger for the stochastic model and all tests show a statistical signicant dierence.

Irregularities to the ideal case are coloured, using a signicance level of 95% for all statistical tests. If normality of the data can be rejected, but variance homogeneity cannot, the MWU-Test is used.

Name TP(det.

model) FP(det.

model)

TP(stoch.model) FP(stoch.model)

Fisher's Test

p-value

model11 408 484 451 441 0.047

model1₂ 349 543 388 504 0.067

model13 314 578 389 503 < 1E-3

model1₄ 274 618 339 553 0.001

model15 308 584 313 579 0.842

model1₆ 415 477 448 444 0.129

model17 354 538 395 497 0.055

model1₈ 316 576 385 507 < 1E-3

model19 259 633 351 541 < 1E-3

model1₁₀ 237 655 294 598 0.004

model111 420 472 455 437 0.107

model1₁₂ 346 546 383 509 0.083

model113 325 567 375 517 0.017

model1₁₄ 326 566 365 527 0.064

model115 277 615 350 542 < 1E-3

model2₁ 501 391 509 383 0.738

model22 454 438 474 418 0.368

model2₃ 487 405 538 354 0.017

model24 459 433 498 394 0.071

model2₅ 355 537 382 510 0.211

model26 465 427 472 420 0.776

model2₇ 499 393 523 369 0.271

model28 520 372 569 323 0.020

model2₉ 459 433 528 364 0.001

model210 385 507 440 452 0.010

model2₁₁ 472 420 494 398 0.318

model212 483 409 506 386 0.295

Name TP(det.

model) FP(det.

model)

TP(stoch.model) FP(stoch.model)

Fisher's Test

p-value

model213 486 406 543 349 0.007

model2₁₄ 461 431 535 357 < 1E-3

model215 374 518 470 422 < 1E-3

model3₁ 464 428 466 426 0.962

model32 465 427 479 413 0.537

model3₃ 467 425 498 394 0.154

model34 476 416 502 390 0.234

model3₅ 418 474 449 443 0.155

model36 488 404 496 396 0.739

model3₇ 465 427 487 405 0.319

model38 477 415 498 394 0.342

model3₉ 457 435 502 390 0.037

model310 403 489 467 425 0.003

model3₁₁ 485 407 494 398 0.703

model312 476 416 505 387 0.183

model3₁₃ 501 391 544 348 0.043

model314 473 419 539 353 0.002

model3₁₅ 421 471 510 382 < 1E-3

simple_cnn_mnist1 458 434 477 415 0.393 simple_cnn_mnist₂ 587 305 597 295 0.652 simple_cnn_mnist3 592 300 597 295 0.841 simple_cnn_mnist₄ 571 321 592 300 0.320 simple_cnn_mnist5 498 394 508 384 0.667 simple_cnn_mnist₆ 614 278 631 261 0.409 simple_cnn_mnist7 599 293 614 278 0.477 simple_cnn_mnist₈ 586 306 613 279 0.190 simple_cnn_mnist9 641 251 665 227 0.219

Name TP(det.

model) FP(det.

model)

TP(stoch.model) FP(stoch.model)

Fisher's Test

p-value

simple_cnn_mnist10 465 427 478 414 0.569 simple_cnn_mnist₁₁ 586 306 599 293 0.547 simple_cnn_mnist12 610 282 635 257 0.216 simple_cnn_mnist₁₃ 643 249 671 221 0.147 simple_cnn_mnist14 641 251 671 221 0.120 simple_cnn_mnist₁₅ 583 309 607 285 0.248 dense_bayes_nn_mnist1 394 498 452 440 0.007 dense_bayes_nn_mnist2 426 466 442 450 0.477

Table A.4: Experiment on the full test set of the MNIST dataset. The rejected subset was determined only once for each model and the size was set to 892 instances.

Fisher's exact test was performed to test if the TP/FP ratio of the deterministic model equals those of the stochastic model. Equality can be rejected if the p-value is below the signicance level. Coloured values show cases where the hypothesized equality cannot be rejected, based on a signicance level of 95%.

Model name Units Dropout Rejection rates in percent model112 30-10 30% 0, 17.0, 34.0, 51.3, 69.6, 92.5 model213 500-10 50% 0, 6.3, 12.6, 19.0, 26.4, 75.1 model3₁₃ 1000-10 50% 0, 4.8, 9.5, 14.3, 19.4, 45.6 simple_cnn_mnist₁₃ 512-10 50% 0, 5.3, 10.7, 16.1, 21.9, 56.0 Table A.5: Points of equal rejection rates for models evaluated on the full MNIST test set using the variance as uncertainty estimate.

Model name Units Dropout Rel. improvement in percent model1₁₂ 30-10 30% 0, 41.6, 67.6, 80.9, 93.5, 87.5 model213 500-10 50% 0, 12.1, 28.5, 45.0, 52.9, 60.0 model313 1000-10 50% 0, 8.2, 16.0, 29.4, 28.0, 48.6 simple_cnn_mnist13 512-10 50% 0, 10.3, 23.1, 47.1, 33.3, 0.0

Table A.6: Relative improvement of the stochastic model for points of equal rejection rates listed in Table A.5, which indicates how many more injected instances the remaining set would contain if the deterministic model were used instead.

Model name Units Dropout Rejection rates in percent model1₁₂ 30-10 30% 0, 16.8, 33.6, 50.7, 68.7, 90.7 model213 500-10 50% 0, 6.3, 12.6, 19.0, 26.4, 75.1 model313 1000-10 50% 0, 4.8, 9.5, 14.3, 19.4, 45.6 simple_cnn_mnist₁₃ 512-10 50% 0, 5.3, 10.7, 16.1, 21.9, 56.0 Table A.7: Points of equal rejection rates for models evaluated on the full MNIST test set using the predictive entropy as uncertainty estimate.

Model name Units Dropout Rel. improvement in percent model112 30-10 30% 0, 23.8, 53.6, 70.8, 87.5, 91.7 model2₁₃ 500-10 50% 0, 8.2, 23.5, 42.6, 46.2, 80.0 model313 1000-10 50% 0, 5.9, 13.6, 27.3, 25.9, 40.0 simple_cnn_mnist13 512-10 50% 0, 5.6, 14.5, 33.3, 28.6, 0.0

Table A.8: Relative improvement of the stochastic model for points of equal rejection rates listed in Table A.7, which indicates how many more injected instances the remaining set would contain if the deterministic model were used instead.

Name Type Dense-Units OptimizerEpochs Drop

out simple_cnn_cifar10 CNN 512-10 Adam 50 0.2 large_cnn_cifar10a1 CNN 4096-4096-10 SGD 40 0.1 large_cnn_cifar10_b² CNN 4096-4096-10 Adam 70 0.5

Table A.9: Model structures with hyperparameters that were trained on the CIFAR-10 dataset.

Name Type Dense-Units ER[%](full test

set)

ER[%](LOO)

ER[%](LOO, stoch.)

simple_cnn_cifar10 CNN 512-10 17.62 15.34 15.29 large_cnn_cifar10a CNN 4096-4096-10 11.56 9.11 9.14 large_cnn_cifar10b CNN 4096-4096-10 - 15.96 16.93

Table A.10: Models tested on the full CIFAR-10 test set and the LOO test set.

Training on the full dataset in case of the last model was omitted due to time constraints.

1Dropout applied after the last two layers with 4096 units

2Dropout applied after each inner layer (also within convolutional layers) and initialized with pre-trained weights.

Name Mean

ofTP(det.

model)

Mean

ofTP(stoch.model) Lillifors

p-value 1

Lillifors p-value

Levene-test p-value

t-test p-value

MWU-test p-value

simple_cnn_cifar10 14.00 12.47 0.37 0.23 0.49 0.07 0.04 large_cnn_cifar10_a 20.73 23.30 0.13 0.13 0.60 0.014 0.023 large_cnn_cifar10b 14.17 11.07 0.04 0.14 0.128 < 1E-3 < 1E-3 Table A.11: Signicance test on the CIFAR-10 dataset. All values are the average over 30 evaluations with 200 mutual exclusive injected instances.

Name TP(det.

model) FP(det.

model)

TP(stoch.model) FP(stoch.model)

Fisher's p-value

simple_cnn_cifar10 278 722 303 697 0.237 large_cnn_cifar10a 319 681 360 640 0.026 large_cnn_cifar10b 282 737 249 757 0.105

Table A.12: Experiment on the full test set of the CIFAR-10 dataset with 1000 instances of class ve.

List of Figures

1.1 Two dierent methods to determine model uncertainty. . . 3

2.1 Marginalization of a two dimensional Gaussian density. . . 10

2.2 Posterior distribution for a binary classication task. . . 12

2.3 Decision regions under dierent cost schemes. . . 14

2.4 Extension to the full posterior predictive distribution. . . 18

3.1 Standard Neural Network and Bayesian Neural Network . . . 24

3.2 Bayesian regression with xed basis functions. . . 28

3.3 Gaussian process priors and RBF regression. . . 31

3.4 Transformation of a Gaussian density by a non-linear function. . . 33

3.5 Dierence between generative and discriminative modeling. . . 37

3.6 Gaussian process regression with dierent observation noise models. . 39

4.1 Applied Dropout in a Standard Neural Network . . . 42

4.2 Comparison between standard Dropout and MC Dropout . . . 43

4.3 Inuence of Dropout on features. . . 44

4.4 Dropout's eect on sparsity . . . 44

4.5 Drawn functions from the approximate Gaussian process likelihood. . 49

4.6 Bi-modal approximate posterior. . . 52

4.7 Fourier expansion of a neural network output function. . . 57

4.8 Comparison between a sparse spectrum GP and a full GP. . . 59

5.1 Experimental setup during training and testing. . . 66

5.2 Evaluation on subsets of equal size. . . 67

5.3 Evaluation and method comparison based on a confusion matrix. . . 68

5.4 Training of a dense model with 30 hidden units and 70% dropout. . . 71

5.5 Training of a dense model with 500 hidden units and 70% dropout. . 72

5.6 Training of a dense model with 1000 hidden units and 70% dropout. . 73

5.7 Training of a model with 4 convolutional layer and additional 512 dense hidden units and 70% dropout. . . 74

5.8 Variance as uncertainty estimate . . . 77

5.9 Predicitve entropy as uncertainty estimate . . . 78

5.10 Mutual-information as uncertainty estimate . . . 79

5.11 Variation-ratio as uncertainty estimate . . . 80

5.12 Performance comparison of models with increasing complexity using the variance as uncertainty estimate. . . 82

5.13 Detailed comparison between the deterministic and stochastic model using the variance as uncertainty estimate and a short training time. 83 5.14 Detailed comparison between the deterministic and stochastic model using the variance as uncertainty estimate and a medium training time. 85 5.15 Detailed comparison between the deterministic and stochastic model using the variance as uncertainty estimate and a long training time. . 86

5.16 Detailed comparison between the deterministic and stochastic model using the variance as uncertainty estimate for various threshold values. 87 5.17 Performance comparison of models with increasing complexity using the predictive entropy as uncertainty estimate. . . 90

5.18 Detailed comparison between the deterministic and stochastic model using the predictive entropy as uncertainty estimate and a short train-ing time. . . 91

5.19 Detailed comparison between the deterministic and stochastic model using the predictive entropy as uncertainty estimate and a medium training time. . . 92

5.20 Detailed comparison between the deterministic and stochastic model using the predictive entropy as uncertainty estimate and a long train-ing time. . . 93

5.21 Detailed comparison between the deterministic and stochastic model using the predictive entropy as uncertainty estimate for various thresh-old values. . . 94

5.22 Training of the simple CNN model on the LOO-CIFAR-10 training set. 97 5.23 Simple CNN (20% Dropout) trained on the CIFAR-10 dataset. . . 98

5.24 Large CNN (10% Dropout) trained on the CIFAR-10 dataset. . . 99

5.25 Large CNN (50% Dropout) trained on the CIFAR-10 dataset. . . 99 6.1 Exemplary Softmax output for a classication task. . . 102 6.2 Gaussian process approximation using the nonparametric bootstrap. . 104

List of Tables

3.1 Overview over the Bayesian hierarchy. . . 20 5.1 Class distribution among the MNIST training set comprising 50.000

instances in total. . . 69 5.2 Class distribution among the MNIST validation set comprising 10.000

instances in total. . . 70 A.1 Enumeration of all models that were trainined on the MNIST dataset 107 A.2 Error rates for dierent model structures and hyperparameter settings

evaluated on the MNIST test set . . . 110 A.3 Signicance tests on the MNIST dataset using the predictive entropy. 113 A.4 Experiment on the full test set of the MNIST dataset. . . 116 A.5 Explicit common rejection rates to evaluate the variance as

uncer-tainty measure on the MNIST test set. . . 117 A.6 Relative improvement of the stochastic model for common rejection

rates listed in Table A.5. . . 117 A.7 Explicit common rejection rates to evaluate the predictive entropy as

uncertainty measure on the MNIST test set. . . 117 A.8 Relative improvement of the stochastic model for common rejection

rates listed in Table A.7 . . . 117 A.9 Models trained on the CIFAR-10 dataset. . . 118 A.10 Model performance evaluated on the CIFAR-10 dataset. . . 118 A.11 Signicance test on the CIFAR-10 dataset for the repetitive experiment.119 A.12 Signicance test on the CIFAR-10 dataset for the whole test set. . . . 119

Im Dokument A Systematic Evaluation of Efficient Uncertainty Estimation in Neural Networks / submitted by David Kowanda (Seite 104-141)