Variance as Uncertainty Estimate - Evaluation on the MNIST Dataset

5.3 Evaluation on the MNIST Dataset

5.3.3 Variance as Uncertainty Estimate

Figure 5.12: Performance comparison of models with increasing complexity (from left to right) using the variance as uncertainty estimate. The rst row depicts the validation error rate evaluated on the LOO validation set. The second row shows the number of instances of the excluded class that could be rejected correctly by the deterministic model. The third row visualizes the number of correct rejections of the stochastic version of the model. The number of rejected instances (last two rows) represents the average over 30 runs. For each run, 210 dierent instances of the excluded class have been mixed into the validation set. The rst three dense models where evaluated after a training period of 10, 20, and 100 epochs. The last convolutional model on the very right column was trained 5, 10, and 50 epochs.

Figure 5.13: Detailed comparison of models with increasing complexity (from left to right). The rst three dense models (rst three columns) have been trained for 10 epochs, the fourth convolutional model (fourth column) for 5 epochs. The rst row depicts the validation error rate of the deterministic model evaluated on the LOO validation set. The second row compares the number of correctly rejected instances of the excluded class between the deterministic (solid line) and stochastic (dashed line) model. The third row shows the average error rate within all other classes (all classes except the excluded class) of the remaining instances in the rejected subset.

In total, 210 instances have been rejected by adjusting the threshold for both, the deterministic and stochastic, models. Therefore, the rejected subset consists of a mix of excluded class instances and all other classes that the model was trained on. The number of rejected instances (last two rows) represents the average over 30 runs. For each run, 210 dierent instances of the excluded class have been mixed into the validation set. Additionally, a statistical test was performed to determine the signicance of the dierence between both model versions. All decisions are based on a 95% condence level. Statistical signicant dierences are shown as orange circles. Black squares depict cases where the test could not be performed due to violations of the preconditions or where the result yielded a statistical insignicant dierence.

of the number of correctly rejected instances.

First observation here is the strong impact of the training time. Comparing the least complex model (rst column) with the most complex model (fourth column) reveals that it is not only the classication capability of the deterministic model that inuences the performance of the stochastic model. A comparison at a Dropout rate of 50% shows that the deterministic convolutional model with an error rate of 0.79%

has a much better classication performance than the low complex model with only 5.37%. Nevertheless, the stochastic version of the most complex model rejected less instances with class label ve than its deterministic counterpart. But notice that the error rate within the rejected instances with other class labels (third row) shows a dierent picture. The stochastic version of the lower complex model rejects less misclassications. The general performance improvement which can be observed for increased training time in Figures 5.14 and 5.15 further supports this argument.

Figure 5.14 shows the comparison after doubling the initial training time. A general improvement is noticeable for all stochastic models over a larger variety of Dropout rates. The error rate within the instances of the LOO validation set (third row) is still below those of the deterministic models. The gap increases for Dropout rates above 50%. This behaviour is still present for the longest training time shown in Figure 5.15. The number of correct rejections goes down for very large Dropout rates in case of both model variants independently from the training time.

Figure 5.15 shows the results for the longest training time. The number of epochs is ten times larger than the initially evaluated time step. Rejections of the articially introduces instances are above the deterministic models in almost all cases. The error rate within the remaining instances of the rejected subset (third row) is at least as good as that of the deterministic version or slightly above for moderate Dropout rates. The only exception is the dense model with 30 units (rst column). The reason for the bad performance is not solely the low complexity of the model but also the type of uncertainty estimate that was used.

Figures 5.13, 5.14, and 5.15 represent an evaluation on a validation set with a total of 9318 instances and a rejection subset size of 210 instances. This setup corresponds to a rejection rate of 2.25%. Figure 5.16 concludes the evaluation of the variance as uncertainty measure via a comparison using dierent threshold values.

The rst two rows show the evaluation using the complete MNIST validation set containing 892 instances with class label ve. The results of the last two rows are

Figure 5.14: Detailed comparison of models with increasing complexity (from left to right). The rst three dense models (rst three columns) have been trained for 20 epochs, the fourth convolutional model (fourth column) for 10 epochs. The rst row depicts the validation error rate of the deterministic model evaluated on the LOO validation set. The second row compares the mean of the number of correctly rejected instances of the excluded class between the deterministic (solid line) and stochastic (dashed line) model. The third row shows the average error rate within all other classes (all classes except the excluded class) of the remaining instances in the rejected subset. All decisions are based on a 95% condence level. Statistical signicant dierences are shown as orange circles. Black squares depict cases where the test could not be performed due to violations of the preconditions or where the result yielded a statistical insignicant dierence.

Figure 5.15: Detailed comparison of models with increasing complexity (from left to right). The rst three dense models (rst three columns) have been trained for 100 epochs, the fourth convolutional model (fourth column) for 50 epochs. The rst row depicts the validation error rate of the deterministic model evaluated on the LOO validation set. The second row compares the mean of the number of correctly rejected instances of the excluded class between the deterministic (solid line) and stochastic (dashed line) model. The third row shows the average error rate within all other classes (all classes except the excluded class) of the remaining instances in the rejected subset. All decisions are based on a 95% condence level. Statistical signicant dierences are shown as orange circles. Black squares depict cases where the test could not be performed due to violations of the preconditions or where the result yielded a statistical insignicant dierence.

Figure 5.16: Comparison of models that were trained longest (50 and 100 epochs training time). All plots show the performance over a large range of dierent thresh-olds. Model complexity is increased from left to right. The rst model (rst column) uses a Dropout rate of 30% and the other three a rate of 50%. The rst row de-picts the semi-logarithmic Error-Reject trade-o for the deterministic (solid blue line) and stochastic model (dashed orange line) evaluated on the whole validation set, which includes 892 instances of the class that was excluded during training.

Lower values represent better performance because the y-axis shows the error rate within the remaining subset. The third row shows the Error-Reject curves evaluated on the LOO validation set without articially introduced instances of the excluded class. The second row shows the total number of instances with class label ve that are still contained by the remaining (not rejected) subset w.r.t. some rejection rates. The fourth row depicts the number of incorrectly classied and remaining instances using only the LOO validation set respectively. All explicitly picked value pairs, in the second and fourth row, share a common rejection rate and are therefore directly comparable. The numbers that are shown next to the bars indicate the absolute dierence in the number of remaining incorrect instances. Positive values indicate better performance of the stochastic model compared to the deterministic one. Numbers in the bottom right plot are omitted but similar to the other three models to the left.

based on the LOO validation set without any injected instance. The rst and third row shows the absolute error rate within the remaining set versus the rejection rate for a particular threshold. Threshold values are adjusted stepwise in the most ne-grained way and may result in unequal rejection rates for the individual methods.

An explicit comparison between both model variants for all data points is therefore impossible without interpolation. Nevertheless, several error rates share a common rejection rate and an excerpt of those points are used to explicitly compare both model variants as depicted in the second and fourth row. Table A.5 in Appendix A lists all rejection rates for the second row of the Figure and Table A.6 shows the corresponding relative improvement of the stochastic model at those points.The error rate in the rst row results from misclassied instances of the LOO validation set plus the misclassications of the injected instances that are erroneous by design.

Notice that in the second row only instances with class label ve are considered to keep the results consistent with the previous evaluation setup (comparison of the TP values). The rst data point at a rejection rate of zero contains therefore always all 892 instances for both model variants. This is dierent in case of the comparison shown in the last row of the plot, which presents remaining misclassications within the predictions using the LOO validation set solely. The dierence in the number of remaining incorrect instances at a rejection rate of zero depends on the general performance disparity between the deterministic and stochastic model.

The evaluation reveals that the performance dierence is dominated by the inu-ence of the injected instances. The tendency of the stochastic model to identify and reject instances that are further apart from the training distribution also continues for higher rejection rates and yields overall lower error rates within the remaining set. Nevertheless, there is no real dierence in the error rates for the presented model structures if only instances with class labels are used that the model has seen during training. Additionally, models with higher error rates struggle if the variance is used as uncertainty estimate as can be seen in the lower half of the rst column.

The dense model with 30 units, 30% Dropout, and an error rate of 3.57% on the LOO dataset is still not as good as the deterministic version below a rejection rate of 50%.

Im Dokument A Systematic Evaluation of Efficient Uncertainty Estimation in Neural Networks / submitted by David Kowanda (Seite 87-95)