• Keine Ergebnisse gefunden

A Systematic Evaluation of Efficient Uncertainty Estimation in Neural Networks / submitted by David Kowanda

N/A
N/A
Protected

Academic year: 2021

Aktie "A Systematic Evaluation of Efficient Uncertainty Estimation in Neural Networks / submitted by David Kowanda"

Copied!
141
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Submitted by David Kowanda Submitted at Department of Compu-tational Perception Supervisor

Gerhard Widmer, Univ.-Prof., Dr.

Co-Supervisor

Bernhard Lehner, Dipl.-Ing

Matthias Dorfer, Univ.-Ass., Dipl.-Ing. November 2018 JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, ¨Osterreich

A Systematic Evaluation

of Efficient Uncertainty

Estimation

in

Neural

Networks

Master Thesis

to obtain the academic degree of

Diplom-Ingenieur

in the Master’s Program

(2)
(3)

Abstract

Deep neural networks are powerful discriminative models, which comprise vast amounts of parameters. Ecient training methods therefore often optimize these parameters to provide point estimates of the full posterior distribution. This reduc-tion of complexity comes with a price, namely the loss of uncertainty informareduc-tion. Incorporation of uncertainty estimates in discriminative approaches would be advan-tageous but also often implies computational intractability or a signicant increase in the number of parameters. The following work will review dierent concepts such as Bayesian neural networks and Gaussian processes, which inherently provide the desired properties. Since these methods are not well scalable to larger datasets, the application of stochastic regularization methods may be utilized as an ecient alternative to recover model uncertainty from deterministic neural networks. The provided practical evaluation will address the question, whether such approxima-tions provide qualitatively sucient estimates.

(4)

Zusammenfassung

Tiefe neuronale Netze sind leistungsfähige diskriminative Modelle, die durch eine Vielzahl von Parametern bestimmt sind. Diese Parameter werden daher häug in ei-ner Weise optimiert, welche lediglich Punktschätzungen der A-posteriori-Verteilung liefern. Solch eine Verringerung der Komplexität hat jedoch ihren Preis, nämlich den Verlust von Unsicherheitsinformation. Die Miteinbeziehung von Unsicherheiten in diskriminativen Modellen wäre von Vorteil, ist jedoch häug zu rechenintensiv oder führt zu einer signikanten Erhöhung in der Anzahl an Parametern. In der fol-genden Arbeit werden verschiedene Konzepte wie Bayes'sche neuronale Netze und Gauÿprozesse untersucht, die viele gewünschte Eigenschaften aufweisen. Da diese beiden Methoden aber in Hinblick auf groÿe Datenmengen schlecht skalierbar sind, kann die Anwendung stochastischer Regularisierungsverfahren als eziente Alter-native genutzt werden um Modellunsicherheiten von deterministischen neuronalen Netzen näherungsweise zu bestimmen. In der vorliegenden Arbeit wird der Fra-ge nachFra-geganFra-gen, ob diese NäherungslösunFra-gen qualitativ ausreichende SchätzunFra-gen liefern.

(5)

Contents

1 Introduction 1

1.1 Thesis Outline . . . 2

2 Probabilistic Modeling 7 2.1 Models for Classication and Regression . . . 9

2.2 Interpretation of the Posterior . . . 11

2.2.1 Posterior Uncertainty . . . 15

3 Bayesian Treatment of Uncertainty 19 3.1 Model Selection and Model Averaging . . . 23

3.2 Bayesian Neural Networks . . . 24

3.2.1 Gaussian processes . . . 27

3.2.2 Transformation of Random Variables . . . 33

3.3 Limitations of Generative and Discriminative Models . . . 35

4 Dropout as a Bayesian Approximation 41 4.1 Dropout . . . 41

4.2 Relationship to Gaussian processes . . . 43

4.2.1 Neural Network with Dropout . . . 45

4.2.2 Gaussian process approximation . . . 46

4.2.3 Approximate inference . . . 52

4.2.4 Interpretation of the relationship . . . 55

4.3 Estimated Uncertainty Measures . . . 58

4.3.1 Regression . . . 60

(6)

5 Results of the Evaluation 65

5.1 Experimental Setup . . . 65

5.2 Evaluation and Method Comparison . . . 66

5.3 Evaluation on the MNIST Dataset . . . 69

5.3.1 Model Structures and Training . . . 70

5.3.2 Visualization of the Predictive Uncertainty . . . 76

5.3.3 Variance as Uncertainty Estimate . . . 81

5.3.4 Predictive Entropy as Uncertainty Estimate . . . 89

5.4 Evaluation on the CIFAR-10 Dataset . . . 96

5.4.1 Model Performance . . . 98

6 Discussion and Conclusion 101

(7)

Chapter 1

Introduction

The general idea of machine learning is to rst, formalize a given task, and second, to nd an adequate description or model that represents the relationships between the formalized quantities, as described in [23]. In the simplest form, the model is a direct mapping between some inputs x and outputs y. However, the modeling process will most often involve some kind of abstraction (generalization) with the goal of identifying the true underlying dependencies. Such an abstraction may also include simplications to focus on the most inuential parts of a relationship.

Whenever a model is learned from limited data, by either identifying the model structure or by tuning parameters of a given structure, the resulting model is inher-ently uncertain. Modeling decisions about what is an appropriate description of the data can take place on several levels. For example, by choosing from dierent model classes like the Gaussian density, support vector machines , or neural networks. But also within a specic model class, a model is dened by hyper-parameters η and maybe a xed set of parameters θ that limit its expressiveness. Variabilities in the model class, hyper-parameters, and model parameters will be expressed via distri-butions over those random variables and will be referred to as model uncertainty in the following. The remaining work will point out how uncertainty can be addressed in models that are used for classication and regression. The focus will be laid on a special class of models, so called articial neural networks, and how these mod-els can be extended to also incorporate uncertainty in their parameters. This will eventually lead to Bayesian neural networks and the Gaussian process model.

In case of classication and regression it is not only of interest to describe dif-ferent models via distributions over parameters but also via distributions over the

(8)

resulting predictions p(y|x, θ). Such a distribution over predictions will be called predictive distribution. Due to tractability issues of analytical solutions, training and inference needs to utilize approximations, which gives rise to the idea of approx-imate uncertainty estimation. In the recent work [8], a technique called Dropout is used to estimate uncertainty in the predictions of neural networks that are used for regression and classication. The last part of the work will be dedicated to a review of Dropout and its relationship to the Gaussian process model.

Originally, Dropout was invented as an ecient method to perform approximate Bayesian model averaging by applying weight scaling that has a regularizing eect on the solution. The novelty of the provided derivation in [8] is to establish a link between Dropout and an approximate Gaussian process model, the so called sparse Gaussian process. Such a relationship suggests that the predictive distribution de-termined via Dropout behaves similarly to that of a sparse Gaussian process and can therefore be used to estimate uncertainty in the predictions of a neural network. These uncertainty estimates should then provide information about the model's con-dence in the estimated risk. By not only deriving point predictions of the risk, this may allow to detect instances that are further away from the training distribution. Based on that information, a model may be retrained to rene its predictions for certain instances or predictions may even be rejected due to high uncertainty in the assigned risk. To evaluate the practical value and possible limitations of this method, the thesis will be concluded by some experiments.

1.1 Thesis Outline

Chapter 2 starts with a description of the general principle of probabilistic mod-eling in case of regression and classication as extension to a deterministic model. Section 2.1 will discuss which aspects of the data can be modeled and how the posterior distribution is used to make predictions. Within this section, it is assumed to exactly know the true underlying joint distribution p(x, y|θ∗) with the true parameter setting θ∗ that describes the relationships within the observed dataset {x, y}. Section 2.2 reviews decision theory, which is concerned with risk minimization. Assuming that the model reects the true distribution, the risk can be determined exactly. Subsection 2.2.1 ex-tends this idea by considering dierent parameter settings θ, which represent

(9)

b θ(s)= (

b µ,bσ)

(s) S→∞−−−→ E[bµ], E[bσ], . . .

'induced sampling distribution of an estimator δ(·)'

D e D1 De2 De3 δ( eD1) δ( eD2) δ( eD3) b θ1 bθ23 θi∼ p(θ|D) = p(D|θ)·p(θ) p(D) parameter prior p(θ) D θ1 θ2 θ3

'the posterior considers knowledge gained from data'

Figure 1.1: Two dierent methods to determine model uncertainty. The left image shows the frequentist view. Model uncertainty is induced by uncertainty in the data. Several datasets can be created via re-sampling with replacement for example. For each dataset, individual parameters bθ are estimated. The collection of dierent parameter estimates induces the so called sampling distribution of an estimator δ. The image on the right visualizes the Bayesian perspective. The observed data is assumed to be xed. In contrast to the frequentist view, the uncertainty about the model is expressed via a prior distribution over the parameters p(θ). The knowledge gained from the data D allows to derive the posterior distribution over the model parameters p(θ|D). The rst and third row of the right picture show three explicit samples from the prior and posterior respectively. The model is represented by a Gaussian density function with variable parameters µ and σ2, the mean and variance

of the Gaussian. As an advantage, the Bayesian derivation allows analytical solutions via the application of Bayes' rule under certain conditions.

(10)

dierent models p(x, y|θ). By introducing a prior distribution over the pa-rameters p(θ), the risk objective can again be solved via integration. On this general level, there is no need to distinguish between model parameters θ and hyper-parameters η because both variable quantities form a specic model and integration can be extended to η as well. It is only of interest to identify the two dierent concepts to determine model uncertainty and their impact on the resulting risk objective, as explained and visualized in Figure 1.1. Once a distribution over dierent model parameters is available, the full posterior predictive distribution can be derived which represents uncertainty in the pos-terior risk.

Chapter 3 will cover the Bayesian perspective to treat model uncertainty in more detail. The full Bayesian approach comprises distributions over parameters, hyper-parameters, and even dierent models. However, analytical solutions are only possible in special cases. Moreover, it is very common to only ap-ply integration over the model parameters, which is necessary to derive the marginal likelihood and the posterior over the model parameters via Bayes' rule. This will become important in Chapter 4, where the relationship between stochastic regularization methods and the Gaussian process model are drawn. Section 3.2 extends the idea of standard neural networks to Bayesian neural networks and discusses the relationship to Gaussian processes in the limit of innitely large hidden layers. However, in networks with only a limited num-ber of hidden layers, functions that are drawn from the posterior predictive distribution will only be composed of a limited set of basis functions. Further, it is also discussed why the use of adaptive basis functions makes analytic solutions intractable. The relationship between a nite set of units and the sparse Gaussian process model, which is only composed of several frequency components, will then be relevant in Chapter 4. Section 3.3 will conclude the chapter by a comparison of the dierences between generative and discrimina-tive models. In this context, the great versatility of the full Gaussian process will be shown via a simple example. Incorporation of a heteroscedastic noise model allows very expressive solutions. This will also be key to understand that not the Gaussian process model itself is a limiting factor to express un-certainty but rather the limitations arise due to the use of a nite amount of basis functions, in combination with a constant noise model.

(11)

Chapter 4 revisits Dropout in Section 4.1, which was invented as an approxima-tion to Bayesian model averaging. Secapproxima-tion 4.2 explains the relaapproxima-tionship to the approximate (sparse) Gaussian process. The use of Variational inference combined with a Gaussian mixture distribution as approximation to the poste-rior distribution over model parameters allows to derive a similar optimization objective. Subsection 4.2.4 will summarize the implications on uncertainty estimates that are determined by using Dropout as Bayesian approximation. At this point, the more extensive elaboration, provided in Chapter 3, will be useful. Section 4.3 provides dierent estimated uncertainty measures that can be used for regression and classication tasks.

Chapter 5 shows the results of the practical evaluation of a classication task for two data sets. A comparison over several models is provided to determine the quality of the estimated uncertainty measures. It will turn out that Dropout as approximation may still be of limited practical use and that a ne grained hyper-parameter evaluation is necessary to even be able to utilize this method. Chapter 6 summarizes the limitations of the method on the basis of the

(12)
(13)

Chapter 2

Probabilistic Modeling

The following chapter will discuss the extension of deterministic to probabilistic models, to be able to incorporate and express uncertainty in observed data. In the rst part, the modeled joint distributions are considered to be known in advance. It will be shown how to interpret the posterior distribution and how decisions based on the posterior are assessed by the risk objective. In the second part, model uncertainty will be included and two possible strategies to determine variations in the model are shown.

For machine learning that is concerned with regression, binary classication, or categorical classication it is often sucient to abstract the underlying task by a formalization

y = f (x), (2.1)

as explained in [23]. A model describing that relationship, may either be non-parametric1, representing the function values based on all available data points x,

or use a parametric form like

y = f (x, ω). (2.2)

The fundamental dierence between Equation (2.1) and (2.2) is that the para-metric model constraints the total number of parameters to a nite set while the non-parametric model scales with the number of input samples. For example, a lin-ear model with parameters slope and intercept or a non-parametric model in form of a lookup table dening f : x → y.

1Except so called hyperparamters which control the distribution of the model parameters [3]. In such a case, the data points itself may be considered as model parameters.

(14)

In probabilistic modeling it is inevitable to describe those quantities via proba-bilities P (X = x), with x as a specic manifestation (event) of a random variable X, probability distributions pd(Xd), for discrete random variables Xd, or

probabil-ity densities pc(Xc), for continuous random variables Xc. Nevertheless, there exist

dierent ways of doing so. Upper case letters and the explicit distinction between densities pcand distributions pdare omitted in the following description for the sake

of brevity. Depending on the way in which the relationship

p(y, x)= p(x|y) · p(y) =p(y|x)· p(x) (2.3) is modeled, literature distinguishes generative from discriminative models. A gen-erative model is completely dened by either p(x|y) and p(y) or p(y|x) and p(x), which both represent the joint probability density as given in Equation (2.3). Such models can be trained in supervised and unsupervised manner. A discriminative approach omits modeling of the evidence p(x) and directly describes the mapping from inputs x to outputs y via p(y|x). As described in detail by [18], it is clear that training of conditional probabilities always needs a labeled dataset (supervision) because somebody must provide the division of the data meeting those conditions. Learning unconditioned probabilities like p(x) is the eld of unsupervised density estimation. Most examples in this chapter will make use of simple univariate con-ditional distributions. Note, however, the joint distribution p(x, y) will always be a multidimensional distribution over at least two dimensions2.

Both, generative and discriminative models, can be non-parametric or parametrized. For example, a simple non-parametric generative model is the Nearest-Neighbor3

classier. Another generative but parametrized model is the Naive-Bayes4 classier

, where each feature dimension is modeled by a one-dimensional Gaussian probabil-ity densprobabil-ity that is exactly determined by two parameters µ and σ. This simple model can be extended to a single multi-dimensional multivariate Gaussian density, which incorporates all pairwise feature correlations via covariances within the covariance

2x ∈ R, y ∈ R → (x, y) ∈ R2.

3The model parameters are the underlying data points and the radius h or the number of neighbors k that control the shape of the distribution are hyperparameters.

4The representation of the data points are reduced to a xed set of parameters of a probability distribution with a specic form.

(15)

matrix Σ5, or even be a mixture of several multivariate Gaussian densities forming

a Gaussian Mixture Model (GMM). Notice that the use of parametric distributions do not necessarily enforce a parametric model. If single Gaussian densities (Kernel functions) are placed above each data point x to describe the complete distribu-tion, the resulting model is a non-parametric one that scales with the number of input points. Examples of discriminative non-parametric models are Support Vec-tor Machines (SVMs) with Radial Basis Function (RBF) Kernel or Decision-Trees. Linear SVMs and Articial Neural Networks (ANNs) are examples of discriminative parametric models.

Whenever a parametric approach is chosen, a certain amount of work is shifted from test time to training time to determine and assess appropriate model parameter settings. The way how those parameters are estimated diers and will be further described in Section 2.2.1 and Chapter 3.

2.1 Models for Classication and Regression

Lets assume that the joint probability density p(x, y), which describes the proba-bility of occurrence of a pattern with feature value6 x and output value y, is given.

Further, let K be the number of dierent unique class labels if the output value is a discrete set of categories. In case of regression, the output value will be a continuous quantity y ∈ RD, with D, the dimensionality of the output.

Since the generative model denes the joint distribution, Bayes' rule is applied to derive the posterior probability distribution

p(y|x) = p(x, y) p(x) =

p(x|y) · p(y)

p(x) . (2.4)

Equation (2.4) is used to make predictions of the output variable y given some

5All modeled correlations are between feature dimensions and not between all data points as it is in case of a Gaussian process.

6In case of n features instead of just one feature value, x is a vector and element of the feature space Rn.

(16)

(a) Two dimensional Gaus-sian density. Ellipses depict points of equal probability density. Blue dots represent 40 samples drawn at random from the multivariate Gaus-sian. Straight solid lines vi-sualize the direction of the Eigenvectors of the covari-ance matrix. p(x) depicts the marginal density of vari-able x in the Euclidean basis.

(b) Samples projected onto axes are distributed accord-ing to the marginal proba-bility density in the corre-sponding reference system. Note, however, that the marginal of a Gaussian den-sity is always Gaussian, in-dependent of the co-ordinate system, but will dier in its variance.

(c) Resulting marginal dis-tribution p(z) is again Gaus-sian but only in one di-mension. The random vari-able x is transformed into random variable z = uT

2 · (x − µx) and the new ref-erence system is the eigen-basis (u1, u2) of the covari-ance matrix. The transfor-mation of x into z is a re-sult of the change of the co-ordinate system.

Figure 2.1: Visualization of the marginalization of a two dimensional Gaussian den-sity over one dimension. The plots also show some samples drawn according to the joint distribution p(x, y). Imagine a reconstruction of the density from that points. An increase in dimensionality of the feature space leads to a sparse sampling if the number of data points remains constant. To keep the sampling dense, the number of data points need to be increased exponentially in the number of dimensions. This behaviour is also known as the curse of dimensionality. The gure is based on [3].

feature observation x. Additionally, the probability density of the evidence p(x) = Z y p(x|y) · p(y) dy = K X i=1 p(x|yi) · P (yi), (2.5)

which serves as a normalization factor, is derived by the law of total probability. The process of summing (integrating) over latent variables is also called marginalization. The result is independent of the marginalized variable. See an example in Figure 2.1 where a two-dimensional Gaussian density is marginalized over one dimension. The result is again a Gaussian density but only in one dimension.

(17)

Note that p(x|y) is a proper probability density in the continuous variable x but not necessarily in y and is called the likelihood of y in the context of x. This implies R p(x|y) dx = 1but not necessarily R p(x|y) dy = 1 [3]. In case of classication, it is also referred to as class-conditional probability density7.

Figure 2.2 visualizes the transformation from the joint probability density to innitely many posterior distributions for a binary classication task. The generative model is dened by p(x|y) and p(x) (top line). The bottom line shows the derived posteriors and an explicit example for a particular feature constellation x1. If a

discriminative model is used, the posterior is modeled directly, which is also often more intuitive in case of regression.

2.2 Interpretation of the Posterior

To interpret the posterior, it is necessary to think about how predictions of a model are assessed. This chapter will mostly be based on the work [18] but will also provide an intuitive example, which is presented in [3]. For consistency with Section 2.1, only observable quantities like inputs x ∈ X and outputs y ∈ Y are considered for the moment. In case of regression, the quality of the model f : X → Y will generally be related to the distance between the model's produced output y0 = f (x) and the

true target value y. In classication, the model (classier) δ : X → A will choose a class label (also called action or decision) a = yi = δ(x) ∈ A. This decision will

either be correct or incorrect and the severeness of a mistake may be addressed by assigning a cost to misclassication. An optimal prediction or classication will be one that minimizes the loss function8

`(y, δ(x)), (2.6)

which quantitatively incorporates those quality aspects. In probabilistic modeling, the predictions and decisions will again be based on probability densities and prob-ability distributions describing the relationship between the input and output. Due to the random nature of the variables, the assessment will take place on expectations

7p(x|y)is a density if x is a continuous quantity or a distribution if x is of discrete nature. 8Without loss of generality, only δ(x) is used in further notations because in regression we can set

(18)

(a) Probability densities of the feature models p(x|y).

(b) Probability density of the evidence p(x).

(c) All possible posterior distributions p(y|x) for p(y1= c1) = 0.4and p(y2= c2) = 0.6.

(d) One explicit posterior distribution p(y|x1) for input x1.

Figure 2.2: Determination of the posterior distribution for a binary classication task applying Bayes' rule. Variations in the input x yield variations in the posteriors p(y|x). Again note that the posterior is a distribution in y and not in x. For each feature value x, an explicit posterior can be derived. The visualization is based on the binary classication example provided in [6].

(19)

of the loss R(p∗, δ) = E(x,y)∼p∗[`] = Z x Z y `(y, δ(x)) · p∗(x, y) dy dx, (2.7)

with p∗, the true but unknown underlying distribution. The notation (x, y) ∼ p∗

is a shortcut and means that instances of the dataset are distributed according to the distribution p∗. The expected loss in Equation (2.7) is also called risk or the

frequentist risk.

To evaluate the expected loss, assumptions about the joint distribution must be made. The simplest way is to place independent delta-distributions9 δX(·) over the

training set to obtain the empirical distribution pemp(x, y) = 1 N N X i=1 δxi(x) · δyi(y). (2.8)

On the basis of the empirical distribution, in combination with Equation (2.7), the empirical risk is derived to

R(pemp, δ) = 1 N N X i=1 `(yi, δ(xi)). (2.9)

Evaluation of Equation (2.9), using a 0-1-loss function leads to the misclassication rate. The risk in Equation (2.7) describes the cost of transforming a probability into a decision [6]. To make things more clear, a simple classication example will be provided next, which is taken from [3]. Assuming that a model of the joint probability p(y, x) is given and derived on the basis of some training data. Further, let K be the number of discrete categories or class labels, which makes it possible to replace the integral over y by a sum. The one-dimensional features x are of continuous quantity. Since the true class label for a new instance x is unknown, the best decision will be determined by minimizing the overall expected loss10

X i∈K E [`] = X i∈K X k∈K λik· p(mistake) = X i∈K X k∈K  λik· Z Ri p(yk, x)dx  , (2.10)

9Notice the dierence in notation between the symbols for the delta-distribution δ

Xand the decision function δ of the classier.

(20)

(a) Decision regions under a symmetric cost schema. Predictions minimize the error rate by just accounting for the number of misclassi-cations. The conditional risk risk(yi= c1|x) is abbreviated by risk(c1|x).

(b) Decision regions under an asymmetric cost schema. Predictions of class c2 have a 80% higher cost if the true and correct class is c1.

Figure 2.3: Decision regions under dierent cost schemes can directly be determined via the posterior distribution as shown in Equation (2.11). The preferred class within a region is determined by the minimum conditional risk and visualized by coloring the area under the curve.

as explained in [3]. Ri denotes the decision region in x, in which the classier will

decide for class yi. The cost λik = λ(yi, yk) denes the penalty of deciding for class

yi, if the true class is yk. Note that setting i = k normally will not be considered to

be a mistake but to explicitly account for that, the corresponding penalty λik must

be set to zero. Setting the penalty λik = 0 for i = k and λik = 1 for i 6= k is known

as 0-1 loss. Minimizing the weighted probabilities of making a mistake (deciding for the wrong class), will minimize the overall risk under the provided cost setting [6].

As further described in [3], the membership of some feature x to a certain decision region Ri, with corresponding class yi, can be decided via the minimum of the

conditional risk Ri(x) = argmin i {risk(yi|x)} = argmini ( X k∈K λik· p(yk|x) ) . (2.11)

Note the replacement of p(y, x) with p(y|x)·p(x) and the cancellation of the common and constant term p(x). Figure 2.3 visualizes the resulting decision regions for a binary classication task under two dierent cost schemes. A classier δ(x) that uses the decision regions determined by Equation (2.11), is called a Bayes estimator or Bayes decision rule [18].

(21)

2.2.1 Posterior Uncertainty

The rst observation one can make from Figure 2.2 in Section 2.2 is that the joint distribution p(y, x) and the resulting posterior do not reect any uncertainty about the model itself. So far, only observable quantities x and y have been considered. For each input xi, a distinct posterior distribution p(y|xi)was derived. Dierent model

classes M will most probably yield dierent posterior solutions. But also within a specic model class, there will be some source of variability namely the parameters and hyper-parameters that determine the model and therefore the exibility of the posterior response. Those model parameters represent the hidden internal state of the model, its internal state of belief. To incorporate the dependency on the parameters θ, Equation (2.7) will be adapted to

L(θ∗, δ) = R(pθ, δ(x)) = E(x,y)∼pθ[`] = Z x Z y `(y, δ(x)) · pθ(x, y|θ∗) dy dx. (2.12)

Equation (2.12) describes the dependency on the hidden state θ∗, which is evaluated

on observable quantities y and δ(x). Further, it is assumed that the true under-lying distribution pθ is determined by the true but unknown parameter setting θ∗.

Without further assumptions about the true parameters, Equation (2.12) is still unsolvable.

On one hand, dierent parameterizations will lead to dierent posterior solu-tions p(y|x, θ) that will be more or less reasonable descripsolu-tions of the observed data. Some of them may be too simple (undertted model) and others may be too complex (overtted model) than the true underlying relationship between input x and output y. On the other hand, if there is a perfect parameter setting that explains the distribution of the observed data, the only source of variability is the training data itself. The training set can be seen as a set of samples, drawn from the true distribution.

This circumstance gives rise to two dierent ways of addressing uncertainty of the estimated parameters bθ. The so called frequentist's view in contrast to the Bayesian interpretation. Those dierences are covered in detail by [18] but are only summarized and mentioned briey in the following. Depending on the source of

(22)

uncertainty, two possible expectations can be calculated. First, R(θ∗, δ) = Ep( eD|θ)[L] = Z e D L(θ∗, δ( eD)) · p( eD|θ∗) d eD, (2.13) again called the frequentist expected loss or frequentist risk. Second,

ρ(δ|D, π) = Ep(θ|D,π)[L] =

Z

θ

L(θ, δ(D)) · p(θ|D, π) dθ, (2.14)

the Bayesian posterior expected loss. Equation (2.13) represents the frequentist per-spective. This time, not only the expectation of the loss within a particular observed dataset D is calculated as already done in Equation (2.12), but the expectation w.r.t. several datasetsDe, sampled from the true distribution, is considered. To obtain un-certainty in the parameters, an estimator11 bθ = ( eD) is applied on several sampled

datasets, which induces the sampling distribution of the estimator. The true param-eters θ∗ are still unknown. Monte Carlo approximations to estimate the sampling

distribution use the empirical distribution obtained by either sampling new data points from p(·|bθ), using an estimate of the parameters bθ = (D), or re-sample the original dataset D directly to obtain a new dataset De. The rst approach is known as parametric bootstrap, the latter as non-parametric bootstrap. Note that putting a prior p(θ∗)over the true parameters makes it possible to integrate Equation (2.13).

A proof is given by [18] and the result RB(δ) =

Z

θ∗

R(θ∗, δ) · p(θ∗) dθ∗, (2.15) is called Bayes risk, integrated risk, expected risk, or preposterior risk of an estimator. Optimal decision boundaries can be obtained by minimizing

RB(δ) =

X

x

ρ(δ(x)|x) · p(x), (2.16)

the posterior expected loss for individual inputs x, which is identical to Equation (2.11).

11The estimator is usually also denoted by δ(·) but will be replaced by (·) to avoid confusion with the decision function.

(23)

The Bayesian approach has the advantage to consider the dataset D not as source of uncertainty but instead as xed and given. To express uncertainty about the hidden state, a prior distribution p(θ|π) is placed over the parameters. Based on the observed data, the posterior over the parameters p(θ|D, π) is derived and used to determine the posterior expected loss, shown in Equation (2.14).

Once the knowledge about the distribution of the parameters has been gained, it can be used to adjust the posterior predictive distribution for some test points y∗|x∗ by using the law of total probability

p(y∗|x∗, D) =

Z

θ

p(y∗|x∗, θ) · p(θ|D) dθ. (2.17)

Equation (2.17) summarizes all possible posterior solutions weighted by the posterior over dierent parameterizations and can be considered as the full posterior predictive distribution. It incorporates the model uncertainty (uncertainty in the parameters) and expresses uncertainty in the risk. Further, placing delta distributions δθe(·) over

the parameters is identical to the case where only observable quantities have been considered.

The replacement of p(θ|D) with δeθ(θ) leads to p(y|x) = p(y|x, eθ) and is also

called a plug-in approximation or snapshot of the posterior, as shown in Figure 2.2d. Under certain conditions, the posterior predictive distribution can be derived analytically. Figure 2.4 visualizes the prior and full posterior predictive distribution for a Gaussian Process model, used for regression.

So far, two points of view to the topic of model uncertainty were discussed. The resulting posterior predictive distribution accounts for dierent possible models that are reasonable descriptions of the observed data. The next chapter will go into details of the Bayesian model approach, which will be the basis for the remaining parts of the work.

(24)

(a) Prior solutions without knowledge about the data, purely based on parameter samples drawn from the prior distribution.

(b) Full posterior predictive distribution, de-rived by Equation 2.17.

Figure 2.4: Prior and full posterior predictive distribution of a Gaussian process model.

(25)

Chapter 3

Bayesian Treatment of Uncertainty

As briey discussed in Chapter 2, the expressiveness and exibility of a model is dependent on the parameters θ, hyperparameters η, and the specic model class M. Bayesian inference can be performed on several levels to incorporate knowledge about those variable quantities. The constraint of using a nite amount of param-eters lowers the computational demand but limits the expressiveness of the model. Such a limitation is not necessarily a drawback because it may prevent the model to t to noise in the data, which likely occurs in high dimensional feature spaces due to the sparse representation of feature combinations (curse of dimensionality). This chapter will be dedicated to the general Bayesian approach to express model uncertainty by placing distributions over parameters, hyper-parameters, and dier-ent model classes. Without simplications, derivation of the posterior over model parameters, which comprises gained knowledge from observations, is computation-ally demanding. Parameter tuning in standard neural networks therefore often uses more ecient plug-in approximations like the Likelihood or Maximum-A-Posteriori estimate. Section 3.2 will briey discuss how model uncertainty is introduced in the context of neural networks. Subsection 3.2.1 shows under which conditions such a network will be equivalent to a Gaussian process and Subsection 3.2.2 touches upon the issue of analytical intractability.

A very intuitive explanation about the Bayesian way of expressing model uncer-tainty is given in [18], what is called Bayesian concept learning. The parameters of the model can be interpreted as possible hypotheses h ∈ H that try to explain ob-servations. All possible parameter combinations are part of the hypothesis space H

(26)

Method Denition Maximum likelihood θ = argmaxb

θ {p(D|θ, η)}

MAP estimation θ = argmaxb

θ {p(D|θ, η) · p(θ|η)}

ML-II (Empirical Bayes) η = argmaxb η R p(D|θ, η) · p(θ|η) dθ MAP-II η = argmaxb η R p(D|θ, η) · p(θ|η) · p(η) dθ

= argmax

η {p(D|η) · p(η)}

Full Bayes p(θ, η|D) ∝ p(D|θ, η) · p(θ|η) · p(η)

Table 3.1: Overview over dierent approximations on several levels of the Bayesian hierarchy. Table taken from [18].

that the model is capable of expressing. After observing data, the hypotheses space is reduced to the version space, which contains all hypotheses that are consistent with the data. But the version space may be huge because a lot of very complex parameter settings may yield valid explanations for the observations.

The believe in a hypothesis to explain the data D is expressed by the likelihood p(D|θ, η). The likelihood assigns a probability to hypotheses of the version space in terms of preference. For example, lets consider a regression task where observations y are assumed to be perturbed by Gaussian noise. To encode this knowledge, a Gaussian likelihood may be used, which assigns a low probability value to function values f(x, θ) that are far away from the observations.

However, due to the tendency of the likelihood to choose an overly complex ex-planation within the version space, the data may be overtted. The prior distribu-tion p(θ|η) acts as a correcdistribu-tion and encodes prior assumpdistribu-tions about the hypothesis space.

The posterior over the parameters is then derived via Bayesian inference and combines the likelihood and the prior by

p(θ|D, η) = p(D|θ, η) · p(θ|η)

p(D|η) . (3.1)

Equation (3.1) can also be written as

(27)

As further described in [18], the posterior will eventually get peaked after observing enough data. The more data there is available, the more dominant the likelihood term becomes. Point estimates are often used as approximation to the likelihood or posterior via b θM L = argmax θ {p(D|θ, η)} , (3.3) b θM AP = argmax θ {p(θ|D, η)} . (3.4)

If the true parameter setting θ∗ is in the hypothesis space, then the maximum a

posteriori estimate bθM AP and the maximum likelihood estimate bθM L will converge

upon the same value, which will be the true hypothesis θ∗, in the limit of innite

data. But if the MAP/ML estimate are used as plug-in approximations in the pos-terior predictive distribution they may underestimate uncertainty especially in case of limited data.

Nonetheless, it may be hard to reason about which internal state of the model is the correct one. Especially if the model has several thousands of parameters. The choice of the prior distribution will therefore often be a choice of convenience rather than an informed decision, as also mentioned in [3]. It may be desirable to go further up in the Bayesian hierarchy and dene a distribution over the hyperparameters η that control the distribution of the parameters θ|η, known as hierarchical Bayes. The advantage is that the hyperparameter space is often low dimensional and therefore not so prone to overtting.

The full Bayesian approach now comprises both latent variables via p(θ, η|D) = p(D|θ, η) · p(θ|η) · p(η)

p(D) . (3.5)

Equation (3.5) implies the highest computational burden. The denominator p(D) = p(D|m) = Z η Z θ p(D|θ) · p(θ|η) · p(η) dθ dη (3.6)

(28)

quan-tities. In some cases, analytical integration over the model parameters θ via p(D|η) =

Z

θ

p(D|θ, η) · p(θ|η) dθ (3.7)

is possible. Equation 3.7 is called the marginal likelihood, integrated likelihood, or evidence and plays a central role in Bayesian model selection. After integration over the parameters θ, Equation (3.6) can be written as

p(D) = Z

η

p(D|η) · p(η) dη. (3.8)

Integration over the hyperparameters may be hard, which leads again to approxi-mations. Using a point estimate p(η|D) = δηb(η) with

b

η = argmax

η {p(D|η)} , (3.9)

replaces the integration with an optimization problem, also known as empirical Bayes or maximum likelihood type-II. Similarly one can optimize the posterior which leads to the maximum a posteriori type-II estimate

b

η = argmax

η {p(D|η) · p(η)} . (3.10)

Note that Equation (3.9) can be obtained from Equation (3.10) by assuming a uniform distribution p(η). A further advantage of Equation (3.9) is that it allows numerical optimization to nd the best hyperparameters ηb, instead of using grid search or random search. However, such point estimates may again underestimate uncertainty similar to the estimates of Equations (3.3) and (3.4). Table 3.1 summa-rizes all discussed shortcut approximations to the full Bayesian treatment.

(29)

3.1 Model Selection and Model Averaging

An additional layer in the Bayesian hierarchy can be used to apply Bayesian model selection by deriving the posterior over dierent models m ∈ M by

p(m|D) = Pp(D|m) · p(m)

m∈Mp(m, D)

. (3.11)

If a at prior over models p(m) and a at prior over p(η) is applied, the determination of p(D|m) = p(D|θ, η, m) reduces to the determination of the marginal likelihood p(D|η, m). As discussed in [21], integration over the model parameters θ as shown in Equation (3.7), is the primary dierence between a Bayesian treatment and a scheme based on optimization. Bayesian model selection prevents overtting by an automatic trade-o between model complexity and data-t, an eect known as Bayesian Occam's razor. As explained in detail in Chapter 4, the application of Dropout in a neural network can be interpreted as approximate integration over the model's parameters, which also explains it's tendency to avoid overtting. The best model can be determined by

b

m = argmax

m {p(m|D)} = argmaxm {p(D|m) · p(m)} . (3.12)

Omitting the integration over the model parameters θ and instead using the maxi-mum likelihood estimate bθ = bθM L or maximum a posteriori estimate bθ = bθM AP may

again result in overtting because the most complex parameter setting will yield the greatest likelihood value p(D|bθ, m).

Instead of picking the best model one can use the posterior over models to calculate the average prediction by

p(y∗|x∗, D) =

X

m∈M

p(y∗|x∗, D, m) · p(m|D), (3.13)

which is known as Bayesian Model Averaging (BMA), according to [18]. As fur-ther explained, BMA is not equal to ensemble learning. The latter is a convex

(30)

(a) Standard Neural Network (b) Bayesian Neural Network

Figure 3.1: Standard Neural Network and Bayesian Neural Network. Images taken from [4].

combination of models in a way like p(y∗|x∗, π) =

X

m∈M

πm· p(y∗|x∗, m). (3.14)

3.2 Bayesian Neural Networks

Previous sections described the Bayesian hierarchy and the possibility of ecient approximations. The incorporation of model uncertainty in Neural Networks allows Bayesian treatment on all discussed levels. The following very simple example will point out how the model parameters can be derived analytically in case of regression with xed basis functions. It will also provide insight of the impact of a limited set of basis functions on the uncertainty, which is encoded in the posterior predictive distribution.

In a Bayesian Neural Network (BNN), prior distributions over the weights are used instead of scalar model parameters, as it is the case in a Standard Neural Net-work (SNN). BNNs have been extensively studied in [19]. Figure 3.1 visualizes the conceptual dierence. The use of prior distributions over the parameters allows to treat such networks within the Bayesian framework to derive posterior distributions over the parameters considering the observed data. A simple example can be

(31)

con-structed in case of linear regression. The following explanation is based on [21, 3]. Let's assume a one-dimensional input x ∈ R and a one-dimensional output y ∈ R for the moment. To accomplish linear regression, a BNN with one unit and linear activation function is constructed. Formally, this can be expressed by

f (x, w, b) = w · x + b. (3.15) With induced prior distributions p(w) = N (µw, σw2) and p(b) = N (µb, σb2). In

principle, any kind of distribution can be used but the choice of Gaussian priors will be very convenient because it will allow to derive the posterior analytically. In a mathematically equivalent but more compact representation, the weight w and bias b can be combined in a vector to form

f (x, w) =hb w i · " 1 x # = wT · x. (3.16) The prior distribution of the weight vector is now the following independent multi-dimensional Gaussian p(w|η) = N " µb µw # , " σ2b 0 0 σw2 #! = N µp, Σp . (3.17)

Derivation of the posterior over the model parameters, utilizing Equation (3.1), will need the denition of a likelihood. A usual choice in regression is the Gaussian likelihood, which will be the measure of similarity between an observed value yi and

the constructed linear function f(x, w). This was already discussed in the context of Bayesian concept learning at the beginning of the chapter. The likelihood

p(D|θ, η) = p(y|x, w, η) = N f (x, w), σn2I (3.18) = √ 1 2πσn exp  −1 2 (yi− f (x, w))2 σ2 n  , (3.19) denes an isotropic noise model that expresses the dependency

y = f (x, w) + N 0, σ2n . (3.20) The noise level σ2

n is also often replaced by the model precision β = τ = σ −2

(32)

the likelihood is conjugate to the prior, the posterior is also Gaussian. The analytical derivation of Equation (3.1) is provided by [21, 3] and leads to

p(w|x, y, η) = N (µN, ΣN) (3.21)

µN = ΣN(Σ−1p µp + σ −2

n · x · yi) (3.22)

Σ−1N = Σ−1p + σ−2n · x · xT (3.23)

The derivation of Equation (3.21) for the multidimensional case x ∈ RQ is

anal-ogous. Also a feature transformation with xed basis functions1 (e.g. φ(x) =

[1, x, x2, x3, . . .]T) is incorporated easily. Let φ(x) ∈ RK×1 be a transformation from the Q-dimensional input space into a K-dimensional feature space. The vec-tor x ∈ RQ can then be simply replaced by φ(x) ∈ RK and the new model is

f (x,w) =e weTφ(x).

The posterior predictive distribution is derived similar to Equation (2.17) and given by

p(y∗|x∗, x, y, η) =

Z

p(y∗|x∗, w, η) · p(w|x, y, η) dw (3.24)

= N µ∗, σ∗2 . (3.25)

The mean and covariance of the posterior predictive are scalar values for a single predicted point y∗ and dened by

µ∗ = mTN · φ(x), (3.26)

σ2 = σn2 + φ(x)TΣNφ(x). (3.27)

The posterior p(θ|y, η) over the model parameter θ = {w} depends on several hy-perparameter η = {µb, µw, σb, σw, σn2}. Those are subject to Bayesian optimization

according to Equation (3.8). There are two great advantages of the Bayesian ap-proach. First, the optimization can be accomplished analytically if the posterior p(D|η) can be written in closed form. Second, Bayesian learning can be performed online, which means incorporating one training point at a time. In the simple ex-ample above, Equation (3.21) is evaluated for the rst instance of the training set,

(33)

using the prior mean µp ad the prior covariance Σp. The posterior quantities µN,

ΣN are then the priors for the next instance. Figure 3.2a shows an example of linear

regression in the input space x ∈ R.

As briey mentioned, Equation (3.22) and (3.23) can be modied to work with transformed inputs φ(x). An example of linear regression with xed basis functions is shown in Figure 3.2b. The inputs are transformed by three RBFs

φi(x) = exp  −(x − ci) 2 2`2  . (3.28)

The centers ci = xi are chosen to coincide with the observations. Figure 3.2b reveals

an important aspect of the posterior predictive distribution for a nite set of xed basis functions. In regions far away from the training points, the inuence of each RBF tends to zero. The remaining uncertainty is governed by the observation noise model σ2

n, as can be seen from Equation (3.27). This behaviour is also mentioned

in [3]. The posterior predictive distribution in Equation (3.24) therefore does not represent a full Gaussian Process (GP) yet, but will converge to one in the limit of innitely many basis functions K → ∞, as briey explained in Subsection 3.2.1.

A certain limitation arises in case of adaptive basis functions. If the BNN in-corporates non-linear activation functions σ(·), the dependency on the parameters becomes non-linear as well. Analytical solutions are not available and approxima-tions must be used, as described in Subsection 3.2.2. However, a BNN with one hidden layer and non-linear activation function will also converge to a GP in the limit of innitely many hidden units, as described by [19]. Some examples are pro-vided in the next subsection. The propro-vided example showed possible shortcomings when the number of basis functions is limited. To provide equivalent expressiveness as a GP, the concept must be extended to an innite dimensional space as it will be shown next.

3.2.1 Gaussian processes

The Gaussian process model, loosely speaking, denes a joint distribution in an innite dimensional space. Gaussian processes have been studied extensively in [21]. All examples in the previous section used a three dimensional space. In this section, the set of xed basis functions is generalized to K dimensions. The major dierence

(34)

(a) Bayesian linear regression in the input space x.

(b) Bayesian linear regression in the feature space φ(x). Three radial basis functions (K=3) are used. The center ci of each function coincides with the corresponding value xi of the observation. After each update of the mean and covariance, three parameter samples are drawn from the posterior. The dotted lines depict 50 function values, which are calculated via f(x,w) =b wb

T φ(x) for each parameter samplewb. The crosses depict the center points of the RBFs.

Figure 3.2: Bayesian linear regression with xed basis functions. Three values are observed in total and the predictive distribution is updated stepwise (from left to right). The hyperparameter (length-scale) `2, in case of the RBF, is set to one.

(35)

will be, that these K dimensions are interpreted as a subspace of an innite space. Mean vectors and covariance matrices are therefore specic evaluations of a mean and covariance function. Those functions will then fully determine the Gaussian process. At the end, it is shown under which conditions the equivalence also holds for adaptive basis functions instead of xed basis functions. But rst, let's start with the denition of a Gaussian process.

The probability distribution of a function f(x) is a Gaussian process if for any nite selection of points x(1), x(2), . . . , x(N ), the marginal density p(f(x(1)), f(x(2)),

. . . , f(x(N ))) is a Gaussian [17].

The following extension considers several data points and function values jointly and will be a summary that is based on the work [17]. Let X = [x(1), x(2), . . . , x(N )] ∈

RQ×N be the matrix comprising the N instances x(i) ∈ RQ of the dataset. The inputs x(i), transformed by the xed basis functions φ(x(i)) ∈ RK, form the matrix

Φ = Φ(X) = h φ(x(1)) φ(x(2)) . . . φ(x(N ))i (3.29) =       φ1,1 φ1,2 . . . φ1,N φ2,1 φ2,2 . . . φ2,N ... ... ... ... φK,1 φK,2 . . . φK,N       , (3.30)

of dimensionality RK×N. The output of a single unit BNN that takes K features as

input, can be described by

f (x(i), w) = wT · φ(x(i)) =

K

X

k=1

φk,i· wk. (3.31)

For simplicity, the weights wk are normally distributed with a mean of zero and a

variance of σw, according to

w ∼ p(w|η) = N 0, σw2 · I . (3.32) The individual outputs f(i)that correspond to the input values x(i), with i = 1 . . . N,

(36)

can be combined in the vector

y =hf (x(1), w) f (x(2), w) . . . f (x(N ), w)iT = wT · ΦT

= ΦT · w, (3.33) with dimensionality y ∈ RN ×1. The function values, expressed by Equation (3.31),

are a linear combination of Gaussian weights and therefore again Gaussian. It is sucient to determine the rst two moments

mean[y] = E [y] = ΦT · E [w] = 0 (3.34) cov[y] = Ey · yT = E ΦTw · wTΦ (3.35) = ΦT · Ew · wT · Φ = σw2 · ΦTΦ. (3.36)

The distribution of y is now completely dened and can be written as

y ∼ p(y|η) = N 0, σw2ΦTΦ , (3.37) with the hyperparameters η = {σ2

w}. Equation (3.37) represents the joint density of

the function values, marginalized over the function parameters w. The elements of the covariance matrix can be written as

σ2 w· Φ TΦ i,j = σ 2 w· K X k=1 φ(x(i))Tφ(x(j)). (3.38) The covariance matrix will not have full rank if the number of basis functions K is smaller than the number of data points N. The probability distribution of y will then be conned to a K-dimensional subspace, as explained in [17]. The same was observed in Figure 3.2b, where the number of basis functions was set to K = 3 but the number of predictions was set to M = 50. Far away from the three basis functions, the predictive variance converges to the noise level. In a full Gaussian process, the predictive distribution reverts back to the prior over functions. This is shown in Figure 3.3a, where a full GP is applied to the same data.

According to the denition of a Gaussian process, the function values must be jointly Gaussian for a nite set of arbitrary many points. The covariance matrix dened in Equation (3.35) can be interpreted as a covariance function k(·, ·) that was evaluated at the N discrete points of the dataset. Such a matrix is known as Gram matrix.

(37)

(a) Full Gaussian process with exponential kernel. The same data is used as in Figure 3.2b. The predictive distribution is now a full joint Gaussian of dimensionality M = 50.

(b) Induced smooth prior for a network with 10.000 hid-den units and tanh activa-tion funcactiva-tion.

(c) Induced Brownian prior for a network with 10.000 hidden units and step-activation function (-1/+1).

Figure 3.3: Gaussian process regression with squared exponential kernel and a full covariance matrix in Figure 3.3a. Figures 3.3b and 3.3c are taken from [19] and show induced prior functions for a one hidden layer Bayesian neural network with dierent non-linear activation functions in the hidden layer.

In the case of RBFs, the covariance function is determined in the limit of K → ∞. The basis function for a one-dimensional input x was already dened in Equation (3.28). With uniformly spaced basis functions, the covariance function can be de-termined by k(x, x0) = σw2 · Z φ(x)Tφ(x0)dk (3.39) = σw2 · Z exp  −(x − k) 2 2`2  · exp  −(x 0− k)2 2`2  dk (3.40) ∝ exp  −(x 0− x)2 4`2  . (3.41)

(38)

parameter k and only depends on the points x and x0. The full Gaussian process

f (x) ∼ GP (m(x), k(x, x0)) , (3.42)

m(x) = E [f (x)] , (3.43)

k(x, x0) = E [(f (x) − m(x)) · (f (x0) − m(x0))] , (3.44) is determined solely by the mean function and kernel function, as described in [21]. The full Gaussian process provides now a consistent framework that can be evaluated at arbitrary points.

Adaptive basis functions

The extension to adaptive basis functions arises by applying a non-linear activation function to the hidden units of a BNN. Dening a BNN with one hidden layer and non-linear activation σ(·) can be expressed by

f (x, w(2), w(1)) = K X k=1 wk(2)· hk(x, w(1)) + w (2) 0 (3.45) hk(x, w(1)) = σ I X i=1 w(1)k,i · xi+ w (1) k,0 ! . (3.46)

The xed basis functions are replaced by K adaptive basis functions hk(·). The

non-linear dependency on the parameters w(1) k,i, w

(1)

k,0 of the input layer prevents analytical

integration over the parameters. A visualization of a non-linear transformation of a random variable will be provided in the next subsection. However, placing Gaussian distributions over the hidden-to-output weights induces a prior distribution over functions f(x, w). In the limit of K → ∞, this prior converges to a Gaussian process, as shown by [19]. Again zero mean Gaussian distributions are used for those weights, so that the expected value of the output h of each hidden unit is also zero. The variance, σw = cw· 1/

K, with constant cw, of the hidden-to-output

weights w(2) k , w

(2)

(39)

Figure 3.4: Transformation of a Gaussian density by a non-linear function. The gure is taken from [18].

hidden units. The distribution over functions will be dened by E [f (x)f (x0)] = σw20 + K X i=1 σ2w kE [hi(x)hi(x 0 )] (3.47) = σw2 0 + cw· k(x, x 0 ) (3.48)

and will converge in the limit K → ∞, to a Gaussian process with mean zero and covariance function k(x, x0). The covariance function depends on the type of

non-linear activation and the priors over the weights. Note that the hidden-to-output weights will converge to zero in the limit of innitely many hidden units. Figures 3.3b and 3.3c show that dierent activation functions induce dierent distributions over BNN output functions.

3.2.2 Transformation of Random Variables

The previous subsection elaborated on the conceptual dierence between Bayesian Neural Networks and the full Gaussian process. The following subsection will dis-cuss why non-linear activation functions make analytical solutions dicult and how approximations can be applied. Chapter 4 will also make intensive use of such methods.

The Gaussian process is a powerful discriminative model that provides uncer-tainty information via the full posterior predictive distribution over functions and

(40)

therefore expresses uncertainty in the risk2. The qualitative dierence between

un-certainty in generative and discriminative models will be discussed in Section 3.3. Besides those more general issues, the application of GPs is not free of complica-tions. The most obvious drawback is the computational demanding inversion of the Gram matrix, which has time complexity O(N3).

The second diculty is that the use of adaptive basis functions causes non-linear dependencies in the parameters, which lead to transformations of the probability densities. An illustrative example is provided in [18] and shown in Figure 3.4. The random variable x is normally distributed with density

x ∼ p(x) = N (6, 1) . (3.49) The variable x is transformation by applying a non-linear logistic function of the form

y = f (x) = 1

1 + exp(−(x − 5)). (3.50) This will lead to a random variable y, which is distributed according to the trans-formed density py(y) = px(x) dx dy . (3.51)

Integration over the model parameter, as it is performed in Bayesian inference, will mitigate the change in the mode but may be intractable analytically. As approxi-mation, Monte Carlo methods can be applied.

Monte Carlo Approximation

The transformed density of y = f(X ) can be approximated by drawing samples b

x ∼ p(X ) from the distribution of the random variable X . Applying the function to each sample f(bx)will form the empirical distribution of the transformed random variable.

(41)

The rst two moments can simply be estimated by Monte Carlo integration ¯ x = E [f (X )] = Z f (x) · p(x) dx = 1 N N X i=1 f (xbi), (3.52) E [f (X )f (X )] = 1 N N X i=1 (f (bxi) − ¯x)2. (3.53)

3.3 Limitations of Generative and Discriminative

Models

Up to this point, several relationships between Standard Neural Networks, Bayesian Neural Networks, and Gaussian processes were discussed. The great versatility and expressiveness of the GP opens up a lot of possibilities. In this section, the concep-tual dierence between generative and discriminative models will be sketched. Since neural networks are discriminative models, this will help to interpret the modeled uncertainty in the predictive distribution. This will be useful in Chapter 4, where the posterior predictive distribution will be recovered from a set of neural network output functions. The connection to an approximate GP with constant observation noise model will be established in the rst part of the next section and the resulting uncertainty estimates should therefore be interpreted regarding this context.

The joint probability distribution p(x, y) models the complete information that was available during training. Therefore, it is also challenging to model because one needs to capture all necessary details in the data with adequate precision.

The advantage of generative models is the possibility to determine the evidence p(x) from the joint by marginalization. This allows the assessment of the strength of the evidence according to a posterior prediction. New, yet unseen instances with low probability p(x) may indicate that those samples are further apart from the training distribution because such instances were only sparsely or not at all present within the training dataset. Identication of such instances is the concern of so called outlier- or novelty detection methods, as described in [2, 22].

A discriminative model, which determines the classication risk p(y|x) directly, does not provide this kind of information any more. The modeling advantage is apparently a reduction in complexity by omitting the intermediate step of modeling p(x|y). However, this loss of information prevents reasoning about the uncertainty

(42)

of a decision because both, strong and weak evidential values, may lead to the same posterior risk.

Figure 3.5 depicts the dierence in encoded information in generative and dis-criminative models. To illustrate the mentioned surjective projection, consider the region below values x < −0.7 and at x = 0. The total probability of observing a value x in the interval ] − ∞, −0.7] is roughly 0.0047 % and P (x|c2) is absolutely negligible in that region. This leads to a posterior distribution of p(y|x) = [1.0, 0.0]. The same posterior value is assigned to the region around x = 0 but with a much stronger evidential basis p(x = 0|c1).

It is important to recognize that it is assumed that p(x|y) represents the true underlying distribution. A probability of zero means that the occurrence of a certain feature value x is strictly impossible (e.g. due to physical constraints). If this is truly the case and all presented instances, during test time or when used in production, also reect this distribution, the asserted risk will be correct. Misclassication in regions with low evidence will have little eect if such instances do really occur very unlikely. The maximal posterior class probability is therefore also often interpreted as model condence, which is only valid if the training set and test set distribution reect the same distribution3.

To restrict the number of misclassications, the decision process may include the application of a threshold on the maximal posterior class probability. Classications beneath the threshold value are rejected to avoid decisions with high risk. This will constrain the probability of misclassications to a certain level.

However, the training data in real situations will rarely comprise the whole set of all possible feature combinations and their corresponding outputs. A classica-tion decision based on the posterior always minimizes the risk w.r.t. the training data (or identically distributed test data), which is the source of the modeled joint distribution. Low probability values of p(x, y) are therefore either a result of a true association (those values do really rarely occur) or missing evidence.

If a model is confronted with data that is very dissimilar to the training distri-bution (not identically distributed), a classication decision may severely underes-timate the true misclassication risk. It is desirable to incorporate that knowledge

3Instances of both sets are independent and identically distributed (i.i.d.). 4Normalization of p(x|y) · p(y) by division with p(x).

(43)

(a) Generative model p(x|y). (b) Discriminative model p(y|x).

Figure 3.5: Dierence between generative and discriminative models. The prior class probability is P (c1) = P (c2) = 0.5. The generative approach, shown on the left, models details in the feature distribution which are not absolutely necessary to determine the posterior p(y|x). This behaviour can be best observed for values of x between −0.7 and 1.3. The additional model complexity of p(x|c1) has no inuence on the posterior because of very low probability values p(x|c2) in that region. Moreover, the posterior is continued for values x lower than −0.7. The general low evidence p(x) of the feature x spreads4 the posterior. The content of

(44)

in a discriminative model approach, to be able to reason about a decision. More expressive models, which provide a certain amount of uncertainty information, may help to combine the advantages of both modeling approaches. One may want to investigate instances with high uncertainty to rene the model by further training or reject doubtful classications for example.

A very versatile non-parametric model is a Gaussian process, which can be seen as a multidimensional normal distribution that incorporates the whole training set to assess the variability of function values at arbitrary test points. Figure 3.6 shows some intuitive examples that visualize the expressiveness of the GP model. Figure 3.6a shows the probability distribution of the evidence p(x) for a very small training set comprising ten observations. Figure 3.6b models the observed values y via a noise-free GP model5, which will lead to very distinct predictive values where

observations are obtained. It is simple to express constant observation noise instead of assuming a noise-free model, which will retain a minimum and constant amount of uncertainty. This constant Gaussian noise model is expressed by

y = f (x, ω) + , (3.54)  ∼ N 0, τ−1I , (3.55) with model precision τ = 1/σ2

n. Instead of constant observation noise, a more

elab-orate heteroscedastic noise model may allow to increase the noise level individually as necessary, shown in Figure 3.6c.

Note that a heteroscedastic noise model could also be used to express a dierent kind of uncertainty, which is proportional to the probability of the evidence, as shown in Figure 3.6d. However, to obtain that kind of uncertainty, a dierent objective will be necessary. Risk minimization will lead to solutions similar to Figures 3.6b or 3.6c.

The quality of the uncertainty estimates in regions where no observations are available is strongly dependent on the type of covariance kernel and its hyperpa-rameters. Optimization via the marginal likelihood (ML-II) of the GP will be a trade-o between model t and model complexity (values of the covariance matrix). Without Bayesian treatment of the hyperparameters, this may also lead to overtted solutions and an overcondent reduction in uncertainty estimates near observations.

5The variance of the noise σn is set to 10−6to maintain numerical stability of the matrix inversion problem.

(45)

(a) Evidence p(x). (b) GP p(y|x) with noise-free observations.

(c) GP p(y|x) with heteroscedastic noise

model. (d) GP p(y|x) with heteroscedastic noisemodel.

Figure 3.6: Visualization of dierent observation noise models for the Gaussian process model. The dataset comprises ten observations and the GP model uses a squared exponential kernel with magnitude σ2

f = 16 and length scale l2 = 1. The

(46)
(47)

Chapter 4

Dropout as a Bayesian

Approximation

The authors of [8] propose an ecient method to obtain predictive uncertainty esti-mates utilizing popular Stochastic Regularization Techniques (SRTs). They further state that ANNs with arbitrary depth and non-linearities that use Dropout before each weight layer can be interpreted as approximation to the deep GP model. To briey cover the mathematical details, Section 4.1 will describe a SRT known as Dropout or Standard Dropout. Section 4.2 will touch upon the relationship to GPs and Section 4.3 summarizes how to gather approximated uncertainty information for regression and classication.

4.1 Dropout

Dropout was invented to provide an ecient alternative to BMA. The following sec-tion will be a short summary about the technique and the interpretasec-tion following the work [24]. The full Bayesian approach to determine the posterior predictive distribution is to determine the posterior over the parameters θ and derive the regu-larized solution using Equation (2.17). Since this approach may be computationally very demanding, an approximation is proposed by the authors of [24] that leads to an ecient version during test time.

The idea is to sample subnetwork structures from the original xed network by disabling (dropping) several units during training, hence the name Dropout. Assuming n units in the original network, Dropout will lead to 2n possible sparse

(48)

(a) Standard Neural Network (b) Dropout applied during training to sample a substructure of the original network.

Figure 4.1: Standard Neural Network with Dropout applied during training. Images taken from [24].

subnetworks. Figure 4.1 shows one possible sampled substructure. To retain compu-tational eciency, all subnetworks use the same set of parameters (network weights), which are shared during training and combined during test time.

They further describe that this can be seen as if using a Bayesian averaging approach where each subnetwork, with shared parameters, is weighted with equal probability. In comparison to the BNN, which weights each parameter setting θ with the posterior p(θ|X, Y). The posterior incorporates the prior distribution and the knowledge from observed training data. Besides this intuitive explanation, a more theoretical derivation is provided by [8], as will be described in the next section.

During test time the weight averaging is accomplished by rescaling the weights by the probability of retention and disabling Dropout. This method is called Stan-dard Dropout. Interestingly, the authors of [24] also describe that Monte Carlo (MC) model averaging (MC Dropout) would be the more correct but also more ex-pensive way of approximating the BMA approach. Figure 4.2 shows an evaluation on the Modied National Institute of Standards and Technology (MNIST) dataset compared to the Standard Dropout method that performs weight averaging. Their conclusion is that weight averaging yields a pretty good approximation.

Referenzen

ÄHNLICHE DOKUMENTE

And if you are spread think around the world, and I believe that whether it’s 280 ships and if you just look at the numbers, we’re not building enough ships each year to sustain

Differences in the models performance in predicting GBIF re- cords outside the training arena may be related to how dissimilar the test arena was to the training arena.. The species

These facts were the motivation to establish the change of measure formula in Banach spaces for solutions of infinite dimensional stochastic differential processes driven by

a certain graph, is shown, and he wants to understand what it means — this corre- sponds to reception, though it involves the understanding of a non-linguistic sign;

Because the morbidity rates and survival probabilities are relatively constant in different regions, it is possible to use the estimation of the morbidity rates and/or

Candidates for “y ∈ B ⇔ Q A accepts y“ change („injury“) but only a finite number of times:.. • namely when some P&lt;Q terminates („priority“) and, once settled,

the top of the list: building a computer ca- pable of a teraflop-a trillion floating- point operations per second. Not surprisingly, Thinking Machines 63.. had an inside track

Previous experimental research has shown that such models can account for the information processing of dimensionally described and simultaneously presented choice