Limitations of Generative and Discriminative ModelsModels

Up to this point, several relationships between Standard Neural Networks, Bayesian Neural Networks, and Gaussian processes were discussed. The great versatility and expressiveness of the GP opens up a lot of possibilities. In this section, the concep-tual dierence between generative and discriminative models will be sketched. Since neural networks are discriminative models, this will help to interpret the modeled uncertainty in the predictive distribution. This will be useful in Chapter 4, where the posterior predictive distribution will be recovered from a set of neural network output functions. The connection to an approximate GP with constant observation noise model will be established in the rst part of the next section and the resulting uncertainty estimates should therefore be interpreted regarding this context.

The joint probability distribution p(x, y) models the complete information that was available during training. Therefore, it is also challenging to model because one needs to capture all necessary details in the data with adequate precision.

The advantage of generative models is the possibility to determine the evidence p(x) from the joint by marginalization. This allows the assessment of the strength of the evidence according to a posterior prediction. New, yet unseen instances with low probability p(x) may indicate that those samples are further apart from the training distribution because such instances were only sparsely or not at all present within the training dataset. Identication of such instances is the concern of so called outlier- or novelty detection methods, as described in [2, 22].

A discriminative model, which determines the classication risk p(y|x) directly, does not provide this kind of information any more. The modeling advantage is apparently a reduction in complexity by omitting the intermediate step of modeling p(x|y). However, this loss of information prevents reasoning about the uncertainty

of a decision because both, strong and weak evidential values, may lead to the same posterior risk.

Figure 3.5 depicts the dierence in encoded information in generative and dis-criminative models. To illustrate the mentioned surjective projection, consider the region below values x < −0.7 and at x = 0. The total probability of observing a value x in the interval ]− ∞,−0.7] is roughly 0.0047 % and P(x|c2) is absolutely negligible in that region. This leads to a posterior distribution ofp(y|x) = [1.0,0.0]. The same posterior value is assigned to the region around x = 0 but with a much stronger evidential basisp(x= 0|c1).

It is important to recognize that it is assumed that p(x|y) represents the true underlying distribution. A probability of zero means that the occurrence of a certain feature valuexis strictly impossible (e.g. due to physical constraints). If this is truly the case and all presented instances, during test time or when used in production, also reect this distribution, the asserted risk will be correct. Misclassication in regions with low evidence will have little eect if such instances do really occur very unlikely. The maximal posterior class probability is therefore also often interpreted as model condence, which is only valid if the training set and test set distribution reect the same distribution³.

To restrict the number of misclassications, the decision process may include the application of a threshold on the maximal posterior class probability. Classications beneath the threshold value are rejected to avoid decisions with high risk. This will constrain the probability of misclassications to a certain level.

However, the training data in real situations will rarely comprise the whole set of all possible feature combinations and their corresponding outputs. A classica-tion decision based on the posterior always minimizes the risk w.r.t. the training data (or identically distributed test data), which is the source of the modeled joint distribution. Low probability values ofp(x,y) are therefore either a result of a true association (those values do really rarely occur) or missing evidence.

If a model is confronted with data that is very dissimilar to the training distri-bution (not identically distributed), a classication decision may severely underes-timate the true misclassication risk. It is desirable to incorporate that knowledge

3Instances of both sets are independent and identically distributed (i.i.d.).

4Normalization ofp(x|y)·p(y)by division withp(x).

(a) Generative modelp(x|y). (b) Discriminative modelp(y|x).

Figure 3.5: Dierence between generative and discriminative models. The prior class probability is P(c1) = P(c2) = 0.5. The generative approach, shown on the left, models details in the feature distribution which are not absolutely necessary to determine the posterior p(y|x). This behaviour can be best observed for values of x between −0.7 and 1.3. The additional model complexity of p(x|c1) has no inuence on the posterior because of very low probability values p(x|c2) in that region. Moreover, the posterior is continued for values x lower than −0.7. The general low evidence p(x) of the feature x spreads⁴ the posterior. The content of the gure is based on a visualization in [3].

in a discriminative model approach, to be able to reason about a decision. More expressive models, which provide a certain amount of uncertainty information, may help to combine the advantages of both modeling approaches. One may want to investigate instances with high uncertainty to rene the model by further training or reject doubtful classications for example.

A very versatile non-parametric model is a Gaussian process, which can be seen as a multidimensional normal distribution that incorporates the whole training set to assess the variability of function values at arbitrary test points. Figure 3.6 shows some intuitive examples that visualize the expressiveness of the GP model. Figure 3.6a shows the probability distribution of the evidencep(x)for a very small training set comprising ten observations. Figure 3.6b models the observed values y via a noise-free GP model⁵, which will lead to very distinct predictive values where observations are obtained. It is simple to express constant observation noise instead of assuming a noise-free model, which will retain a minimum and constant amount of uncertainty. This constant Gaussian noise model is expressed by

y=f(x,ω) +, (3.54)

∼ N 0, τ⁻¹I

, (3.55)

with model precisionτ = 1/σ²_n. Instead of constant observation noise, a more elab-orate heteroscedastic noise model may allow to increase the noise level individually as necessary, shown in Figure 3.6c.

Note that a heteroscedastic noise model could also be used to express a dierent kind of uncertainty, which is proportional to the probability of the evidence, as shown in Figure 3.6d. However, to obtain that kind of uncertainty, a dierent objective will be necessary. Risk minimization will lead to solutions similar to Figures 3.6b or 3.6c.

The quality of the uncertainty estimates in regions where no observations are available is strongly dependent on the type of covariance kernel and its hyperpa-rameters. Optimization via the marginal likelihood (ML-II) of the GP will be a trade-o between model t and model complexity (values of the covariance matrix).

Without Bayesian treatment of the hyperparameters, this may also lead to overtted solutions and an overcondent reduction in uncertainty estimates near observations.

5The variance of the noiseσn is set to10⁻⁶to maintain numerical stability of the matrix inversion problem.

(a) Evidencep(x). (b) GPp(y|x)with noise-free observations.

model. (d) GP p(y|x) with heteroscedastic noise

model.

Figure 3.6: Visualization of dierent observation noise models for the Gaussian process model. The dataset comprises ten observations and the GP model uses a squared exponential kernel with magnitude σ²_f = 16 and length scale l² = 1. The hyperparameters are not optimized and only chosen for visualization purpose.

Chapter 4 Dropout as a Bayesian Approximation

The authors of [8] propose an ecient method to obtain predictive uncertainty esti-mates utilizing popular Stochastic Regularization Techniques (SRTs). They further state that ANNs with arbitrary depth and non-linearities that use Dropout before each weight layer can be interpreted as approximation to the deep GP model. To briey cover the mathematical details, Section 4.1 will describe a SRT known as Dropout or Standard Dropout. Section 4.2 will touch upon the relationship to GPs and Section 4.3 summarizes how to gather approximated uncertainty information for regression and classication.

Im Dokument A Systematic Evaluation of Efficient Uncertainty Estimation in Neural Networks / submitted by David Kowanda (Seite 41-47)