• Keine Ergebnisse gefunden

Machine Learning

N/A
N/A
Protected

Academic year: 2022

Aktie "Machine Learning"

Copied!
70
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning

Volker Roth

Department of Mathematics & Computer Science University of Basel

(2)

Section 4

Regression

(3)

Regression basics

In regression we assume that a response variable y ∈R is a noisy function of the input variable x ∈Rd.

y =f(x) +η.

We often assume thatf is linear,f(x) =wtx, and that η has a zero-mean Gaussian distribution with constant variance, ηN(0, σ2).

This is can equivalently be written as

p(y|x) =N(µ(x), σ2), withµ(x) =wtx. In one dimension: µ(x) =w0+w1x andx = (1,x).

w0 is the intercept or bias term andw1 is the slope.

Ifw1 >0, we expect the output to increase as the input increases.

(4)

Least Squares and Maximum Likelihood

Fit n data points (xi,yi) to a model that hasd + 1 parameters wj, j = 0, . . . ,d.

Notation: x ←(1,x) w0 is the intercept.

Frequentist view: w is an unknown parameter vector, not a RV.

We assume that then observations areiid.

Linear model: yi =wtxi+ηi, ηiN(0, σ2).

Observedyi generated from a normal distribution centered at wtxi. Model predicts linear relationship between conditional expectation of observations yi and inputsxi:

E[yi|xi] =w0+w1xi1+· · ·+wdxid =wtxi =f(xi;w).

Note: the expectation operator is linear andEi] = 0.

Regression function = conditional expectation.

(5)

LS and Maximum Likelihood

Likelihood function: conditional probability of all observedyi given their explanation, treated as a function of the model parameters w:

L(w)∝Y

i

exp

− 1

2(yiwtxi)2

Maximizing L = finding model that best explains observations:

ˆ

w = arg max

w L(w) = arg min

w [−L(w)] = arg min

w [−log(L(w))]

= arg min

w

X

i

(yiwtxi)2

Least-squares fit = ML solution under Gaussian error model.

ˆ

wMLE minimizes theresidual sum of squares RSS(w) =

n

X

i=1

ri2=

n

X

i=1

[yif(xi;w)]2=ky−Xwk2.

(6)

Least squares regression: Geometry

∂RSS(w)

∂w =

∂w

yty−2ytXw+wtXtXw

= −2Xty + 2XtXw =! 0

wˆ = (XtX)−1Xty

Xt(y−Xwˆ) =Xtˆr =0.

If follows that Pni=1Xijri = 0, ∀j = 0,1, . . . ,d.

Residual is orthogonal to 1 (j = 0)and to every input dimension X•j.

X[.,1]

X[.,2]

Adapted from Fig. 3.2 in (Hastie, Tibshirani, Friedman)

(7)

Least squares regression: Geometry

X[.,1]

X[.,2]

Adapted from Fig. 3.2 in (Hastie, Tibshirani, Friedman)

The fitted values at the training inputs are

f(x1), . . . ,fˆ(xn))t = ˆy =Xwˆ =X(XtX)−1Xty. H =X(XtX)−1Xt is called “hat” matrix (puts hat on y) Column vectors of X span the column spaceofX ⊂Rn. Minimizing RSS(w) choose ˆw such that r is orthogonal.

Fitted values ˆy are orthogonal projection ofy on column space.

(8)

Least squares regression: Algebra

H is orthogonal projection oncolumn spaceof X: HX =X(XtX)−1XtX =X.

Fundamental theorem of linear algebra: thenullspaceof Xt is the orthogonal complement of the column space of X.

M =InH is orthogonal projectiononnullspaceof Xt: MX = (InH)X =XX = 0.

H and M are symmetric (Ht =H) and idempotent (MM=M) The Algebra of Least Squares

H creates fitted values: ˆy =Hy ˆy ∈Col(X)

M creates residuals: r =My ˆr ∈Null(Xt)⇔Xtr =0

(9)

Frequentist confidence limits

Recall: yi =f(xi;w) +ηi, with independent Gaussian noise.

In matrix-vector form: y =Xw+η, with ηN(0, σ2In).

ˆ

w = (XtX)−1Xty

= (XtX)−1XtXw+ (XtX)−1Xtη

=w + (XtX)−1Xtη

wˆ −w = (XtX)−1Xtη=: Aη Linear functions of normals are normal:

ηN(0, σ2In) ⇒ N(0, σ2AAt).

Here: A= (XtX)−1XtAAt = (XtX)−1 Conditioned onX andσ2:

ˆ

ww|X, σ2N0, σ2(XtX)−1.

(10)

Frequentist confidence limits

Distribution completely specified confidence limits:

ˆ

wkwkN(0, σ2Skk),

whereSkk denotes the kth diagonal element of (XtX)−1. Thus, bothzk0 andzk =−zk0 are standard normal:

zk := (wkwˆk)/

σ2SkkN(0,1) CDF:

P(zk <kc) = 1

√ 2π

Z kc

−∞

e−t2/2dt =: Φ(kc) = 1−c Upper limit for wk:

P(zk <kc) = P(

σ2Skkzk <

σ2Skkkc)

= P(wk −(wkwˆk) >wk −√

σ2Skkkc)

= P( ˆwk >wk −√

σ2Skkkc)

= P(wk <wˆk +

σ2Skkkc) = 1−c.

(11)

Frequentist confidence limits

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−101234

x

y

Least-squares fit (red) and two lines with slopes according to upper (lower) 95% confidence limit (green).

(12)

Standard parametric rate

Assume we have estimated the parameters based onn samples:

( ˆwnw)N(0, σ2 XtX−1)

= N(0, σ2 XtX/n−1·1/n)

n( ˆwnw)N(0, σ2 XtX/n

| {z }

→Σ

−1

)

Since forn → ∞,XtX/n→Σ =const, this means that ˆ

wn converges tow at a rate of 1/√ n.

This is a very general result that holds in an asymptotic sense even without assuming normality central limit theorem.

Due to its universality, it is called the standard parametric rate.

(13)

Basis functions

Can be generalized to model non-linear relationships by replacing x with some non-linear function of the inputs, φ(x):

p(y|x) =N(wtφ(x), σ2).

Predictions can be based on a linear combination of a set of basis functionsφ(x) ={g0(x),g1(x), . . . ,gm(x)}, with gi(x) :Rd 7→R.

Can model the intercept by setting g0(x) = 1:

f(x;w) =w0+w1g1(x) +· · ·+wmgm(x).

additive models

0 5 10 15 20

−10

−5 0 5 10 15

degree 1

0 5 10 15 20

−10

−5 0 5 10 15

degree 2

Fig 1.7 in K.Murphy

(14)

Additive models

Examples:

Ifx ∈Rd andm=d+ 1,g0(x) = 1 andgi(x) =xi,i= 1, . . . ,d, then f(x;w) =w0+w1x1+· · ·+wdxd.

Ifx ∈R,g0(x) = 1 andgi(x) =xi,i= 1, . . . ,m, then f(x;w) =w0+w1x1+· · ·+wmxm.

Basis functions cancapture various properties of the inputs.

Example: Document analysis

x = text document (collection of words) gi(x) =

(1, if word i appears in the document 0, otherwise

f(x;w) = w0+ X

i∈words

wigi(x).

(15)

Additive models cont’d

We can also make predictions by gauging the similarity of examples to prototypes.

For example, our additive regression function could be f(x;w) =w0+w1g1(x) +· · ·+wmgm(x), where the basis functions are radial basis functions

gk(x) = exp(− 1

2kx−xkk2)

measuring the similarity to the prototypesxk.

The varianceσ2 controls how quickly the basis function vanishes as a function of the distance to the prototype.

Training examples themselves could serve as prototypes.

(16)

Additive models cont’d

Can view additive models graphically in terms of units andweights.

2 t

1 x

m m

ym 1

x

1

1 m

1 y

1

w w

y = g (x) y = g (x)

f(w y)

In neural networks the basis functions have adjustable parameters.

(17)

Example: Polynomial regression

0 5 10 15 20

200300400500600700800900

Polynomial basis functions. Degree = 1

x

y

Observations True function

Predicted

0 5 10 15 20

200300400500600700800900

Polynomial basis functions. Degree = 3

x

y

Observations True function Predicted

0 5 10 15 20

200300400500600700800900

Polynomial basis functions. Degree = 8

x

y

Observations True function

Predicted

0 5 10 15 20

200300400500600700800900

Polynomial basis functions. Degree = 10

x

y

Observations True function Predicted

(18)

Complexity and overfitting

With limited training examples our polynomial regression model may achieve zero training error but nevertheless has a large expected error.

training 1 n

n

X

i=1

(yif(xi; ˆw)2≈0 expectation E(x,y)∼p(y−f(x; ˆw)20 We suffer from over-fitting

should reconsider our model model selection.

We will discuss model selection from a Bayesian perspective first.

A frequentist approach will follow later in the chapter on statistical learning theory.

(19)

Subsection 1 Bayesian Regression

(20)

Bayesian interpretation: priors

Suppose our generative model takes an input x ∈Rd and maps it to a real valued outputy according to

p(y|x,w, σ2) =N(y|wtx, σ2)

We will keepσ2 fixed and only try to estimatew.

Given dataD={(x1,y1), . . . ,(xn,yn)}, the likelihood function is L(w;D) =

n

Y

i=1

N(yi|wtxi, σ2) =

n

Y

i=1

1 Z exp

− 1

2(yiwtxi)2

. In classical regression we used the maximizing parameters ˆw.

In Bayesian analysis we keep all regression functions, just weighted by their ability to explain the data.

Our knowledge aboutw after seeing the data is defined by the posterior distribution p(w|D).

(21)

Bayesian regression: Prior and posterior

We specify our prior belief about the parameter values asp(w).

For instance, we could prefer small parameter values:

p(w) =Nw|0, τ2I

The smaller τ2 is, the smaller values of w we prefer prior to seeing the data.

Posterior proportional to prior p(w) times likelihood:

p(w|D)∝L(w;D)p(w)

Here: posterior isGaussianp(w|D, σ2) =N(w|wN,VN) with mean wN and covarianceVN given by

wN = (XtX +λI)−1Xty, VN =σ2(XtX +λI)−1, with λ= στ22.

(22)

Bayesian regression: Posterior computation

Given variables x ∈Rdx andy ∈Rdy, assume linear Gaussian system:

p(x) =N(x|µx,Σx) ( prior)

p(y|x) =N(y|Ax+b,Σy) ( likelihood) The posterior is also Gaussian:

p(x|y) =N(x|µx|y,Σx|y) Σ−1x|y = Σ−1x +AtΣ−1y A

µx|y = Σx|yAtΣ−1y (y−b) + Σ−1x µx.

Gaussian likelihood and Gaussian prior form a conjugate pair.

The normalization constant (denominator in Bayes formula) is p(y) =N(y|Aµx +b,Σy +xAt).

(23)

Bayesian regression: Posterior predictive

Prediction of y for newx: use posterior as weights for predictions based on individualw’s Posterior predictive:

p(y|x,D, σ2) = Z

p(y|x,w, σ2)p(w|D)dw

= Z

N(y|xtw, σ2)N(w|wN,VN)

=N(y|wtNx, σN2(x)), with σ2N(x) =σ2+xtVNx.

The variance in this prediction, σN2(x), depends on two terms:

I the variance of the observation noise,σ2

I the variance in the parameters,VN

depends on how closex is to training dataD

error bars get larger as we move away from training points.

(24)

Bayesian regression: Posterior predictive

By contrast, the plugin approximationuses only the ML-parameter estimate with the degenerate distribution p(w|D, σ2) =δwˆ(w):

p(y|x,D, σ2) R

p(y|x,w, σ2wˆ(w)dw =p(y|x,w, σˆ 2) =N(y|xtwˆ, σ2).

−8 −6 −4 −2 0 2 4 6 8

0 10 20 30 40 50 60

plugin approximation (MLE) prediction

training data

−8 −6 −4 −2 0 2 4 6 8

−10 0 10 20 30 40 50 60 70 80

Posterior predictive (known variance) prediction

training data

Fig. 7.12 in (K. Murphy). Example with quadratic basis functions: posterior predictive distribution (mean and±1σ).

(25)

Sampling from posterior predictive

Left: plugin approximation: f(y) =φ(x)tw,ˆ

where φ(x) is the expanded input vector (1,x,x2)t.

Right: sampled functions φ(x)tw(s), where w(s) are samples from the posterior

−8 −6 −4 −2 0 2 4 6 8

0 5 10 15 20 25 30 35 40 45 50

functions sampled from plugin approximation to posterior

−8 −6 −4 −2 0 2 4 6 8

−20 0 20 40 60 80 100

functions sampled from posterior

Fig. 7.12 in (K. Murphy)

(26)

MAP approximation and ridge regression

Posterior proportional to priorp(w) =N w|0, τ2Itimes likelihood.

The MAP estimate is

wMAP= arg max{log[L(w;D)] + log[p(w)]}

= arg min{−log[L(w;D)]−log[p(w)]}

= arg min{ 1 2σ2

X

i

(yiwtxi)2+ 1

2wtw}

= arg min{X

i

(yiwtxi)2+σ2 τ2wtw}

= arg min{X

i

(yiwtxi)2+λwtw} In classical statistics, this is called ridge regression:

wMAP=wridge = (XtX +λI)−1Xty. In regularization theory, this is an example of

Tikhonov Regularization.

(27)

Subsection 2 Bayesian model selection

(28)

Example: Polynomial regression

0 5 10 15 20

200300400500600700800900

Polynomial basis functions. Degree = 1

x

y

Observations True function

Predicted

0 5 10 15 20

200300400500600700800900

Polynomial basis functions. Degree = 3

x

y

Observations True function Predicted

0 5 10 15 20

200300400500600700800900

Polynomial basis functions. Degree = 8

y

Observations True function

Predicted

0 5 10 15 20

200300400500600700800900

Polynomial basis functions. Degree = 10

y

Observations True function Predicted

(29)

Bayesian regression (again)

Suppose our parametrized model Fθ takes an input x ∈Rd and maps it to a real valued outputy according to

p(y|x,θ, σ2) =N(y;θtx, σ2)

We will keepσ2 fixed and only try to estimateθ.

Given dataD={(x1,y1), . . . ,(xn,yn)}, define likelihood L(θ;D) =

n

Y

i=1

N(yi;θtxi, σ2) =

n

Y

i=1

1 Z exp

− 1

2(yiθtxi)2

. In classical regression we used the maximizing parameters ˆθ.

In Bayesian analysis we keepall regression functions, just weighted by their ability to explain the data.

Knowledge aboutθ after seeing the data defined by posteriorp(θ|D).

(30)

Bayesian regression (again)

We specify ourprior belief about the parameter values asp(θ).

For instance, we could prefer small parameter values:

p(θ) =N(θ; 0, τ2I)

Smallτ2 smallθ preferred prior to seeing data.

Posterior proportional to priorp(θ) times likelihood:

p(θ|D)L(θ;D)p(θ)

Normalization constant, a.k.a. marginal likelihood:

p(y|F,X) = Z

L(θ;D)

| {z }

p(y|θ,X)

p(θ|F)dθ,

depends on model + data butnot on specific parameter values.

(31)

Example: Bayesian regression

Goal: choose among regression model families, specified by different feature mappings xφ(x).

Example: linear φ1(x) and quadratic φ2(x).

The model families we compare are:

F1 : p(y|x,θ1, σ2) =N(yt1φ1(x), σ2) F2 : p(y|x,θ2, σ2) =N(yt2φ2(x), σ2).

Focusing on p(y|F,X) =R L(θ;D)p(θ)dθ, two possibilities:

I F too flexible: posteriorp(θ|D) requires many training examples before it focuses on useful parameter values;

I F too simple: posterior concentrates quickly but the predictions remain poor.

Pragmatic choice: Select the family whose marginal likelihood (a.k.a. Bayesian score) is larger.

After seeing data Dwe would select model F1 if p(y|F1,X)>p(y|F2,X).

(32)

−4 −2 0 2 4

0510152025

2 4 6 8 10

−4.8−4.6−4.4−4.2−4.0−3.8

polynomial degree

Log marg. likelihood

(33)

−4 −2 0 2 4

0510152025

2 4 6 8 10

−11.0−10.5−10.0−9.5−9.0−8.5−8.0−7.5

polynomial degree

Log marg. likelihood

(34)

−4 −2 0 2 4

0510152025

2 4 6 8 10

−14−13−12−11−10

polynomial degree

Log marg. likelihood

(35)

−4 −2 0 2 4

0510152025

2 4 6 8 10

−18−16−14−12

polynomial degree

Log marg. likelihood

(36)

−4 −2 0 2 4

0510152025

2 4 6 8 10

−20−18−16−14

polynomial degree

Log marg. likelihood

(37)

−4 −2 0 2 4

0510152025

2 4 6 8 10

−30−28−26−24−22−20−18

polynomial degree

Log marg. likelihood

(38)

−4 −2 0 2 4

0510152025

2 4 6 8 10

−60−55−50−45−40−35−30−25

polynomial degree

Log marg. likelihood

(39)

Approximating the marginal likelihood

Problem: In most cases we cannot compute the marginal likelihood in closed form approximations are needed.

A specific approximation will lead to the Bayesian Information Criterion (BIC).

Key insight: when computing p(y|F,X) =

Z

p(y|θ,X)p(θ|F)dθ,

the integrand is a product of two densities integrand itself is an unnormalized density.

Laplace’s approximation uses a clever trick to approximate such integrals...

(40)

Approximation details: Laplace’s Method

Assume unnormalized densityp(θ) has peak at ˆθ. Goal: calculate normalizing constant

Zp= Z

p(θ)dθ

Taylor-expand logarithm around ˆθ:

lnp(θ)≈lnpθ)c

2(θ−θ)ˆ2+· · · , where

c :=−2

∂θ2 lnp(θ)θ=ˆθ. (note that first order term vanishes)

p(θ)

lnp(θ)

(41)

Laplace’s Method (cont’d)

Approximate p(θ) by unnormalized Gaussian

Q(θ) :=pθ) exph−c/2·(θ−θ)ˆ 2i A normalized Gaussian would be:

Q(θ|µ= ˆθ, σ2) = 1 ZQ exp

"

−(θ−θ)ˆ22

# , with ZQ =√

2πσ2 =R exp

(θ−θ)ˆ22

Approximate Zp=R p(θ) by

ZpZ

Q(θ)

=pθ) Z

exph−c/2·(θ−θ)ˆ 2i

=pθ)q2π/c c is the inverse variance

lnp(θ) & lnQ(θ)

p(θ) & Q(θ)

(42)

Laplace’s Method (cont’d)

Multivariate generalization in d dimensions:

second derivative Hessian matrix Hij = 2lnp(θ)

∂θi∂θj θ=ˆθ

Zppθ) Z

exp

−1

2(θ−θ)ˆ tH(θθ)ˆ

dθ

=pθ)

s(2π)d

|H| =pθ)

H

12

,

where the last equation follows from the properties of the determinant: |aM|=ad|M|for M ∈Rd×d, a∈R.

Another interpretation: complicated distributionp(θ) is approximated by Gaussian centered at the mode ˆθ:

p(θ)≈ N(θ|µ= ˆθ,Σ =H−1).

(43)

Example: Bayesian logistic regression

Linear logistic regression: model parameters are simply the weights w.

Likelihood: p(y|x,w) = Ber(y|sigm(wtx))

Unfortunately, there is no convenient conjugate prior. Let’s use a standard Gaussian prior: p(w) =N(w|0,V0)

Laplace’s approximation of posterior:

p(w|D)≈N(w|w,H−1)

w = arg maxJ[w], J[w] = logp(y|x,w)

| {z }

likelihood

+ logp(w)

| {z }

prior

H =∇2J(w) w

(44)

−10 −5 0 5

−8

−6

−4

−2 0 2 4 6 8

data Log−Likelihood

1 2

3 4

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2 0 2 4 6 8

Log−Unnormalised Posterior

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2 0 2 4 6 8

Laplace Approximation to Posterior

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2 0 2 4 6 8

(45)

Bayesian LOGREG: Approximating the posterior predictive

Posterior can compute credible intervals etc.

But in machine learning, interest usually focuses onprediction.

Theposterior predictive distribution has the form p(y|x,D) =

Z

p(y|x,w)p(w|D)dw.

Here (and in most cases), this integral is intractable.

The simplest approximation is the plug-in approximation p(y = 1|x,D)≈p(y = 1|x,w)

But such a plug-in estimate underestimates the uncertainty.

Better: Monte Carlo approximation p(y|x,D)≈ 1

S

S

X

s=1

sigm((ws)tx),

wherewsp(w|D) are samples from the Gaussian approximation to the posterior.

(46)

p(y=1|x, wMAP)

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2 0 2 4 6 8

−10 −8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2 0 2 4 6 8

decision boundary for sampled w

MC approx of p(y=1|x)

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2 0 2 4 6 8

(47)

Approximating the marginal likelihood

p(D|F) = Z

p(D|θ)·p(θ|F)dθ

p(D|θ)·p(θ|F)|H/(2π)|12 flat prior p(D|θ)|H/(2π)|ˆ 12 logp(D|F) logp(D|ˆθ)1

2log|H|+C, with θˆ=θMLE in F.

Focus on last term:

H=

n

X

i=1

Hi, with Hi =∇θθlogp(Di|θ).

Let’s approximate eachHi with afixed matrixH0

log|H|= log|nH0|= log(nd|H0|) =dlogn+ log(|H0|).

For model selection, last term can be dropped, because it is independent of F andn.

logp(D|F)≈logp(D|θ)ˆ −d

2 logn+C = BIC(F,n|D) +C.

(48)

Intuitive interpretation of BIC

The Shannon information content of a specific outcome a of a random experiment is

h(a) =−log2P(a) = log 1 P(a). It measures the “surprise” (in bits):

Outcomes that are less probable have larger values of surprise.

Information theory: Can find a code so that thenumber of bits used to encode each symbol a∈ Ais essentially−log2P(a).

Here:

−BIC(F,n|D) =

DL of observations given model

z }| {

n

X

i=1

−log2p(yi|xi,wˆ)

| {z }

surprise ofyi

+d

2log2(n)

The sum of surprises of all observations is the description length of the observations given the (most probable) model in F.

(49)

Intuitive interpretation of BIC

Second term: description length of the model. Intuitive explanation:

I The model, i.e. ˆwRd, was estimated based onnsamples.

I Can quantize every component into

nlevels. Why?

I Remember the standard parametric rate:

1/

nrepresents the magnitude of the estimation error no need for encoding with greater precision.

I Grid of (

n)d possible values for describing a model.

I We need log2((

n)d) = log2n(d/2)= (d/2) log2nbits to encode ˆw. In summary: -BIC = DL(data|model) + DL(model).

Maximizing BIC = minimizing joint DL of data and model Minimum Description Length principle.

(50)

Example: Bayesian logistic regression

Example: polynomial logistic regression,n = 100.

φ1(x) = (1,x1,x2)t,φ2(x) = (1,x1,x2,(x1+x2)2)t.

−BIC =

n

X

i=1

(−log2p(yi|xi,wˆ)) +d

2 log2(n)

−3 −2 −1 0 1 2

−1.5−1.0−0.50.00.51.01.5

degree #(param) DL(data) DL(model) BIC score

1 3 16.36 bits 9.97 bits -26.33

2 4 15.77 bits 13.29 bits -29.06

(51)

Example: Bayesian logistic regression

Example: polynomial logistic regression,n = 100.

φ1(x) = (1,x1,x2)t,φ2(x) = (1,x1,x2,(x1+x2)2)t.

−BIC =

n

X

i=1

(−log2p(yi|xi,wˆ)) +d

2 log2(n)

−4 −2 0 2 4 6

−4−202

degree #(param) DL(data) DL(model) BIC score

1 3 58.56 bits 9.97 bits -68.53

2 4 38.05 bits 13.29 bits -51.34

(52)

Subsection 3 Sparse models

(53)

Sparse Models

Sometimes, we have many more dimensions d than training cases n.

Corresponding design matrixX is “short and fat”, rather than

“tall and skinny”.

This is called small n , large d problem.

For example, with gene microarrays, it is common to measure the expression levels of d ≈20,000 genes, but to only get n≈100 samples (for instance, from 100 patients).

Q: what is the smallest set of features that can accurately predict the response in order to prevent overfitting, to reduce the cost of building a diagnostic device, or to help with scientific insight into the problem?

(54)

Bayesian variable selection

Let γj = 1 if featurej is relevant, and let γj = 0 otherwise.

Our goal is to compute the posterior over models p(γ|D) = exp(−f(γ))

P

γ0exp(−f(γ0)), wheref(γ) is the cost function:

f(γ) =−[logp(D|γ) + logp(γ)].

For example, suppose we generaten= 20 samples from ad = 10 dimensional linear regression model, yiN(wtxi, σ2), in which K = 5 elements ofw are non-zero.

Enumerate all 210= 1024 models and compute p(γ|D) for each one.

(55)

Bayesian variable selection

0 200 400 600 800 1000

−220

−200

−180

−160

−140

−120

−100

−80

−60

−40

log p(model, data)

Fig 13.1 in K. Murphy: Score function f(γ) for all possible models.

(56)

Bayesian variable selection

0 200 400 600 800 1000

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

p(model|data)

1 2 3 4 5 6 7 8 9 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p(gamma(j)|data

Fig 13.1 in K. Murphy. Left: Posterior over all 1024 models. Vertical scale has been truncated at 0.1 for clarity. Right: Marginal inclusion probabilities p(γj = 1|D). The true model is{2,3,6,8,9}

(57)

Bayesian variable selection

Interpreting the posterior over a large number of models is difficult seeksummary statistics.

A natural one is the posterior mode, or MAP estimate ˆ

γ= arg maxp(γ|D) = arg minf(γ).

However, the mode is often not representative of the full posterior mass. A better summary is themedian model, computed using

ˆ

γ={j :p(γj = 1|D)>0.5}

This requires computing the posterior marginal inclusion probabilities p(γj = 1|D).

(58)

Bayesian variable selection

The above example illustrates thegold standard for variable selection: the problem was small (d = 10)

we were able tocompute the full posterior exactly.

Of course, variable selection is most useful in the cases where the number of dimensions islarge.

There are 2d possible models (bit vectors) impossibleto compute the full posterior in general.

Even finding summaries (MAP, or marginal inclusion probabilities) is intractable

algorithmic speedups necessary.

But first, focus on the computation ofp(γ|D).

Referenzen

ÄHNLICHE DOKUMENTE

• The EU Commission, with the involvement of the European Data Protection Board (EDPB), should negotiate an effective follow-up instrument to the Privacy Shield with the

Wenn man aus einem Nickerchen erwacht, dass länger als 45 Minuten, aber weniger als 2 Stunden dauert, kann eine Schlaf-Trägheit eintreten. Das ist ein Zustand der Desorientie-

The domain of the HDI, as published in the Human Development Report, is between 0 and 1, but even the best performing country does not achieve the highest possible value.. To

The actual diagram of this rapidly growing Linguistic Linked Open Data (LLOD) framework 2 reflects the distinct types of language data that already exist in LOD

Thanks to its versatility, the FCM can be easily combined with different geometric models supporting a seamless discretization process. In this way the labour-intensive

Diese Methoden kön- nen in verschiedenen Fragestellungen sowohl zusammen als auch getrennt voneinander für die Modellierung ein- gesetzt werden. Der Umfang der

Base: Steel Cut Oats, Quinoa, Kürbispüree, Sprouds, Agave, Toppings: Banane,

Grindr hat gegen Artikel 15 DSGVO verstoßen, da es dem Ersuchen des Beschwerdeführers, eine Kopie seiner Daten und Informationen über die Verarbeitung seiner Daten durch Grindr