Machine Learning

(1)

Machine Learning

Volker Roth

Department of Mathematics & Computer Science University of Basel

(2)

Section 4 Regression

(3)

Regression basics

In regression we assume that a response variable y ∈R is a noisy function of the input variable x ∈R^d.

y =f(x) +η.

We often assume thatf is linear,f(x) =w^tx, and that η has a zero-mean Gaussian distribution with constant variance, η∼N(0, σ²).

This is can equivalently be written as

p(y|x) =N(µ(x), σ²), withµ(x) =w^tx. In one dimension: µ(x) =w₀+w₁x andx = (1,x).

w0 is the intercept or bias term andw1 is the slope.

Ifw₁ >0, we expect the output to increase as the input increases.

(4)

Least Squares and Maximum Likelihood

Fit n data points (xi,yi) to a model that hasd + 1 parameters w_j, j = 0, . . . ,d.

Notation: x ←(1,x) w0 is the intercept.

Frequentist view: w is an unknown parameter vector, not a RV.

We assume that then observations areiid.

Linear model: y_i =w^tx_i+η_i, η_i ∼N(0, σ²).

Observedyi generated from a normal distribution centered at w^txi. Model predicts linear relationship between conditional expectation of observations yi and inputsxi:

E[y_i|x_i] =w₀+w₁x_i1+· · ·+w_dx_id =w^tx_i =f(x_i;w).

Note: the expectation operator is linear andE[η_i] = 0.

Regression function = conditional expectation.

(5)

LS and Maximum Likelihood

Likelihood function: conditional probability of all observedyi given their explanation, treated as a function of the model parameters w:

L(w)∝^Y

i

exp

− 1

2σ²(y_i −w^tx_i)²

Maximizing L = finding model that best explains observations:

ˆ

w = arg max

w L(w) = arg min

w [−L(w)] = arg min

w [−log(L(w))]

= arg min

w

X

i

(yi−w^txi)²

Least-squares fit = ML solution under Gaussian error model.

ˆ

w_MLE minimizes theresidual sum of squares RSS(w) =

n

X

i=1

r_i²=

n

X

i=1

[y_i−f(x_i;w)]²=ky−Xwk².

(6)

Least squares regression: Geometry

∂RSS(w)

∂w = ∂

∂w

y^ty−2y^tXw+w^tX^tXw

= −2X^ty + 2X^tXw =^! 0

⇒ wˆ = (X^tX)⁻¹X^ty

⇒ X^t(y−Xwˆ) =X^tˆr =0.

If follows that ^Pⁿ_i=1Xijri = 0, ∀j = 0,1, . . . ,d.

Residual is orthogonal to 1 (j = 0)and to every input dimension X•j.

X[.,1]

X[.,2]

Adapted from Fig. 3.2 in (Hastie, Tibshirani, Friedman)

(7)

Least squares regression: Geometry

X[.,1]

X[.,2]

Adapted from Fig. 3.2 in (Hastie, Tibshirani, Friedman)

The fitted values at the training inputs are

(ˆf(x1), . . . ,fˆ(xn))^t = ˆy =Xwˆ =X(X^tX)⁻¹X^ty. H =X(X^tX)⁻¹X^t is called “hat” matrix (puts hat on y) Column vectors of X span the column spaceofX ⊂Rⁿ. Minimizing RSS(w) choose ˆw such that r is orthogonal.

Fitted values ˆy are orthogonal projection ofy on column space.

(8)

Least squares regression: Algebra

H is orthogonal projection oncolumn spaceof X: HX =X(X^tX)⁻¹X^tX =X.

Fundamental theorem of linear algebra: thenullspaceof X^t is the orthogonal complement of the column space of X.

M =I_n−H is orthogonal projectiononnullspaceof X^t: MX = (I_n−H)X =X −X = 0.

H and M are symmetric (H^t =H) and idempotent (MM=M) The Algebra of Least Squares

H creates fitted values: ˆy =Hy ˆy ∈Col(X)

M creates residuals: r =My ˆr ∈Null(X^t)⇔X^tr =0

(9)

Frequentist confidence limits

Recall: yi =f(xi;w) +ηi, with independent Gaussian noise.

In matrix-vector form: y =Xw+η, with η∼N(0, σ²In).

ˆ

w = (X^tX)⁻¹X^ty

= (X^tX)⁻¹X^tXw+ (X^tX)⁻¹X^tη

=w + (X^tX)⁻¹X^tη

⇒ wˆ −w = (X^tX)⁻¹X^tη=: Aη Linear functions of normals are normal:

η∼N(0, σ²In) ⇒ Aη∼N(0, σ²AA^t).

Here: A= (X^tX)⁻¹X^t ⇒ AA^t = (X^tX)⁻¹ Conditioned onX andσ²:

ˆ

w−w|X, σ²∼N0, σ²(X^tX)⁻¹.

(10)

Frequentist confidence limits

Distribution completely specified confidence limits:

ˆ

w_k −w_k ∼N(0, σ²S^kk),

whereS^kk denotes the kth diagonal element of (X^tX)⁻¹. Thus, bothz_k⁰ andz_k =−z_k⁰ are standard normal:

z_k := (w_k−wˆ_k)/

√

σ²S^kk ∼N(0,1) CDF:

P(z_k <k_c) = 1

√ 2π

Z kc

−∞

e^−t²^/2dt =: Φ(k_c) = 1−c Upper limit for w_k:

P(zk <kc) = P(

√

σ²S^kkzk <

√

σ²S^kkkc)

= P(w_k −(w_k −wˆ_k) >w_k −√

σ²S^kkk_c)

= P( ˆw_k >w_k −√

σ²S^kkk_c)

= P(w_k <wˆ_k +

√

σ²S^kkkc) = 1−c.

(11)

Frequentist confidence limits

●

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−101234

x

y

Least-squares fit (red) and two lines with slopes according to upper (lower) 95% confidence limit (green).

(12)

Standard parametric rate

Assume we have estimated the parameters based onn samples:

( ˆwn−w) ∼ N(0, σ² X^tX⁻¹)

= N(0, σ² X^tX/n⁻¹·1/n)

√n( ˆw_n−w) ∼ N(0, σ² X^tX/n

| {z }

→Σ

−1

)

Since forn → ∞,X^tX/n→Σ =const, this means that ˆ

w_n converges tow at a rate of 1/√ n.

This is a very general result that holds in an asymptotic sense even without assuming normality central limit theorem.

Due to its universality, it is called the standard parametric rate.

(13)

Basis functions

Can be generalized to model non-linear relationships by replacing x with some non-linear function of the inputs, φ(x):

p(y|x) =N(w^tφ(x), σ²).

Predictions can be based on a linear combination of a set of basis functionsφ(x) ={g₀(x),g1(x), . . . ,gm(x)}, with gi(x) :R^d 7→R.

Can model the intercept by setting g0(x) = 1:

f(x;w) =w₀+w₁g₁(x) +· · ·+w_mg_m(x).

additive models

0 5 10 15 20

−10

−5 0 5 10 15

degree 1

0 5 10 15 20

−10

−5 0 5 10 15

degree 2

Fig 1.7 in K.Murphy

(14)

Additive models

Examples:

Ifx ∈R^d andm=d+ 1,g0(x) = 1 andgi(x) =xi,i= 1, . . . ,d, then f(x;w) =w₀+w₁x₁+· · ·+w_dx_d.

Ifx ∈R,g₀(x) = 1 andg_i(x) =xⁱ,i= 1, . . . ,m, then f(x;w) =w0+w1x¹+· · ·+wmx^m.

Basis functions cancapture various properties of the inputs.

Example: Document analysis

x = text document (collection of words) g_i(x) =

(1, if word i appears in the document 0, otherwise

f(x;w) = w0+ ^X

i∈words

wigi(x).

(15)

Additive models cont’d

We can also make predictions by gauging the similarity of examples to prototypes.

For example, our additive regression function could be f(x;w) =w0+w1g1(x) +· · ·+wmgm(x), where the basis functions are radial basis functions

g_k(x) = exp(− 1

2σ²kx−x_kk²)

measuring the similarity to the prototypesx_k.

The varianceσ² controls how quickly the basis function vanishes as a function of the distance to the prototype.

Training examples themselves could serve as prototypes.

(16)

Additive models cont’d

Can view additive models graphically in terms of units andweights.

2 t

1 x

m m

ym 1

x

1

1 m

1 y

1

w w

y = g (x) y = g (x)

f(w y)

In neural networks the basis functions have adjustable parameters.

(17)

Example: Polynomial regression

●

0 5 10 15 20

200300400500600700800900

Polynomial basis functions. Degree = 1

x

y

Observations True function

Predicted ●

●

0 5 10 15 20

200300400500600700800900

x

y

Observations True function Predicted

●

0 5 10 15 20

200300400500600700800900

x

y

Predicted ●

●

0 5 10 15 20

200300400500600700800900

x

y

(18)

Complexity and overfitting

With limited training examples our polynomial regression model may achieve zero training error but nevertheless has a large expected error.

training 1 n

n

X

i=1

(yi −f(xi; ˆw)²≈0 expectation E_(x,y)∼p(y−f(x; ˆw)²0 We suffer from over-fitting

should reconsider our model model selection.

We will discuss model selection from a Bayesian perspective first.

A frequentist approach will follow later in the chapter on statistical learning theory.

(19)

Subsection 1 Bayesian Regression

(20)

Bayesian interpretation: priors

Suppose our generative model takes an input x ∈R^d and maps it to a real valued outputy according to

p(y|x,w, σ²) =N(y|w^tx, σ²)

We will keepσ² fixed and only try to estimatew.

Given dataD={(x₁,y1), . . . ,(xn,yn)}, the likelihood function is L(w;D) =

n

Y

i=1

N(yi|w^txi, σ²) =

n

Y

i=1

1 Z exp

− 1

2σ²(yi −w^txi)²

. In classical regression we used the maximizing parameters ˆw.

In Bayesian analysis we keep all regression functions, just weighted by their ability to explain the data.

Our knowledge aboutw after seeing the data is defined by the posterior distribution p(w|D).

(21)

Bayesian regression: Prior and posterior

We specify our prior belief about the parameter values asp(w).

For instance, we could prefer small parameter values:

p(w) =Nw|0, τ²I

The smaller τ² is, the smaller values of w we prefer prior to seeing the data.

Posterior proportional to prior p(w) times likelihood:

p(w|D)∝L(w;D)p(w)

Here: posterior isGaussianp(w|D, σ²) =N(w|w_N,V_N) with mean wN and covarianceVN given by

w_N = (X^tX +λI)⁻¹X^ty, V_N =σ²(X^tX +λI)⁻¹, with λ= ^σ_τ²2.

(22)

Bayesian regression: Posterior computation

Given variables x ∈R^d^x andy ∈R^d^y, assume linear Gaussian system:

p(x) =N(x|µ_x,Σ_x) ( prior)

p(y|x) =N(y|Ax+b,Σ_y) ( likelihood) The posterior is also Gaussian:

p(x|y) =N(x|µ_x|y,Σ_x|y) Σ⁻¹_x|y = Σ⁻¹_x +A^tΣ⁻¹_y A

µ_x|y = Σ_x|yA^tΣ⁻¹_y (y−b) + Σ⁻¹_x µ_x.

Gaussian likelihood and Gaussian prior form a conjugate pair.

The normalization constant (denominator in Bayes formula) is p(y) =N(y|Aµ_x +b,Σy +AΣxA^t).

(23)

Bayesian regression: Posterior predictive

Prediction of y for newx: use posterior as weights for predictions based on individualw’s Posterior predictive:

p(y|x,D, σ²) = Z

p(y|x,w, σ²)p(w|D)dw

= Z

N(y|x^tw, σ²)N(w|w_N,V_N)

=N(y|w^t_Nx, σ_N²(x)), with σ²_N(x) =σ²+x^tVNx.

The variance in this prediction, σ_N²(x), depends on two terms:

I the variance of the observation noise,σ²

I the variance in the parameters,V_N

depends on how closex is to training dataD

error bars get larger as we move away from training points.

(24)

Bayesian regression: Posterior predictive

By contrast, the plugin approximationuses only the ML-parameter estimate with the degenerate distribution p(w|D, σ²) =δ_w_ˆ(w):

p(y|x,D, σ²) ≈R

p(y|x,w, σ²)δwˆ(w)dw =p(y|x,w, σˆ ²) =N(y|x^twˆ, σ²).

−8 −6 −4 −2 0 2 4 6 8

0 10 20 30 40 50 60

plugin approximation (MLE) prediction

training data

−8 −6 −4 −2 0 2 4 6 8

−10 0 10 20 30 40 50 60 70 80

Posterior predictive (known variance) prediction

training data

Fig. 7.12 in (K. Murphy). Example with quadratic basis functions: posterior predictive distribution (mean and±1σ).

(25)

Sampling from posterior predictive

Left: plugin approximation: f(y) =φ(x)^tw,ˆ

where φ(x) is the expanded input vector (1,x,x²)^t.

Right: sampled functions φ(x)^tw^(s), where w^(s) are samples from the posterior

−8 −6 −4 −2 0 2 4 6 8

0 5 10 15 20 25 30 35 40 45 50

functions sampled from plugin approximation to posterior

−8 −6 −4 −2 0 2 4 6 8

−20 0 20 40 60 80 100

functions sampled from posterior

Fig. 7.12 in (K. Murphy)

(26)

MAP approximation and ridge regression

Posterior proportional to priorp(w) =N w|0, τ²Itimes likelihood.

The MAP estimate is

w_MAP= arg max{log[L(w;D)] + log[p(w)]}

= arg min{−log[L(w;D)]−log[p(w)]}

= arg min{ 1 2σ²

X

i

(yi −w^txi)²+ 1

2τ²w^tw}

= arg min{^X

i

(y_i−w^tx_i)²+σ² τ²w^tw}

= arg min{^X

i

(y_i−w^tx_i)²+λw^tw} In classical statistics, this is called ridge regression:

w_MAP=w_ridge = (X^tX +λI)⁻¹X^ty. In regularization theory, this is an example of

Tikhonov Regularization.

(27)

Subsection 2 Bayesian model selection

(28)

Example: Polynomial regression

●

0 5 10 15 20

200300400500600700800900

x

y

Predicted ●

●

0 5 10 15 20

200300400500600700800900

x

y

●

0 5 10 15 20

200300400500600700800900

y

Predicted ●

●

0 5 10 15 20

200300400500600700800900

y

(29)

Bayesian regression (again)

Suppose our parametrized model F_θ takes an input x ∈R^d and maps it to a real valued outputy according to

p(y|x,θ, σ²) =N(y;θ^tx, σ²)

We will keepσ² fixed and only try to estimateθ.

Given dataD={(x₁,y₁), . . . ,(x_n,y_n)}, define likelihood L(θ;D) =

n

Y

i=1

N(y_i;θ^tx_i, σ²) =

n

Y

i=1

1 Z exp

− 1

2σ²(y_i−θ^tx_i)²

. In classical regression we used the maximizing parameters ˆθ.

In Bayesian analysis we keepall regression functions, just weighted by their ability to explain the data.

Knowledge aboutθ after seeing the data defined by posteriorp(θ|D).

(30)

Bayesian regression (again)

We specify ourprior belief about the parameter values asp(θ).

For instance, we could prefer small parameter values:

p(θ) =N(θ; 0, τ²I)

Smallτ² smallθ preferred prior to seeing data.

Posterior proportional to priorp(θ) times likelihood:

p(θ|D)∝L(θ;D)p(θ)

Normalization constant, a.k.a. marginal likelihood:

p(y|F,X) = Z

L(θ;D)

| {z }

p(y|θ,X)

p(θ|F)dθ,

depends on model + data butnot on specific parameter values.

(31)

Example: Bayesian regression

Goal: choose among regression model families, specified by different feature mappings x →φ(x).

Example: linear φ1(x) and quadratic φ2(x).

The model families we compare are:

F₁ : p(y|x,θ₁, σ²) =N(y|θ^t₁φ₁(x), σ²) F₂ : p(y|x,θ2, σ²) =N(y|θ^t₂φ2(x), σ²).

Focusing on p(y|F,X) =^R L(θ;D)p(θ)dθ, two possibilities:

I F too flexible: posteriorp(θ|D) requires many training examples before it focuses on useful parameter values;

I F too simple: posterior concentrates quickly but the predictions remain poor.

Pragmatic choice: Select the family whose marginal likelihood (a.k.a. Bayesian score) is larger.

After seeing data Dwe would select model F₁ if p(y|F₁,X)>p(y|F₂,X).

(32)

−4 −2 0 2 4

0510152025

●●

●

2 4 6 8 10

−4.8−4.6−4.4−4.2−4.0−3.8

polynomial degree

Log marg. likelihood

(33)

−4 −2 0 2 4

0510152025

●

● ●

●

2 4 6 8 10

−11.0−10.5−10.0−9.5−9.0−8.5−8.0−7.5

polynomial degree

(34)

−4 −2 0 2 4

0510152025

● ●

●

● ●

●

2 4 6 8 10

−14−13−12−11−10

polynomial degree

(35)

−4 −2 0 2 4

0510152025

●

●●

●

2 4 6 8 10

−18−16−14−12

polynomial degree

(36)

−4 −2 0 2 4

0510152025

● ●

●

● ●

●

2 4 6 8 10

−20−18−16−14

polynomial degree

(37)

−4 −2 0 2 4

0510152025

● ● ●

●

● ● ●

●

2 4 6 8 10

−30−28−26−24−22−20−18

polynomial degree

(38)

−4 −2 0 2 4

0510152025

● ●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

2 4 6 8 10

−60−55−50−45−40−35−30−25

polynomial degree

(39)

Approximating the marginal likelihood

Problem: In most cases we cannot compute the marginal likelihood in closed form approximations are needed.

A specific approximation will lead to the Bayesian Information Criterion (BIC).

Key insight: when computing p(y|F,X) =

Z

p(y|θ,X)p(θ|F)dθ,

the integrand is a product of two densities integrand itself is an unnormalized density.

Laplace’s approximation uses a clever trick to approximate such integrals...

(40)

Approximation details: Laplace’s Method

Assume unnormalized densityp^∗(θ) has peak at ˆθ. Goal: calculate normalizing constant

Z_p= Z

p^∗(θ)dθ

Taylor-expand logarithm around ˆθ:

lnp^∗(θ)≈lnp^∗(ˆθ)−c

2(θ−θ)ˆ²+· · · , where

c :=−∂²

∂θ² lnp^∗(θ)_θ=ˆ_θ. (note that first order term vanishes)

p^∗(θ)

lnp^∗(θ)

(41)

Laplace’s Method (cont’d)

Approximate p^∗(θ) by unnormalized Gaussian

Q^∗(θ) :=p^∗(ˆθ) exp^h−c/2·(θ−θ)ˆ ²ⁱ A normalized Gaussian would be:

Q(θ|µ= ˆθ, σ²) = 1 Z_Q exp

"

−(θ−θ)ˆ² 2σ²

# , with Z_Q =√

2πσ² =^R exp

−^(θ−_2σ^θ)^ˆ₂²

dθ Approximate Zp=^R p^∗(θ)dθ by

Z_p≈ Z

Q^∗(θ)dθ

=p^∗(ˆθ) Z

exp^h−c/2·(θ−θ)ˆ ²ⁱdθ

=p^∗(ˆθ)^q2π/c c is the inverse variance

lnp^∗(θ) & lnQ^∗(θ)

p^∗(θ) & Q^∗(θ)

(42)

Laplace’s Method (cont’d)

Multivariate generalization in d dimensions:

second derivative Hessian matrix Hij = ∂²lnp^∗(θ)

∂θ_i∂θ_j θ=ˆθ

Z_p≈p^∗(ˆθ) Z

exp

−1

2(θ−θ)ˆ ^tH(θ−θ)ˆ

dθ

=p^∗(ˆθ)

s(2π)^d

|H| =p^∗(ˆθ)

H 2π

−¹₂

,

where the last equation follows from the properties of the determinant: |aM|=a^d|M|for M ∈R^d×d, a∈R.

Another interpretation: complicated distributionp(θ) is approximated by Gaussian centered at the mode ˆθ:

p(θ)≈ N(θ|µ= ˆθ,Σ =H⁻¹).

(43)

Example: Bayesian logistic regression

Linear logistic regression: model parameters are simply the weights w.

Likelihood: p(y|x,w) = Ber(y|sigm(w^tx))

Unfortunately, there is no convenient conjugate prior. Let’s use a standard Gaussian prior: p(w) =N(w|0,V₀)

Laplace’s approximation of posterior:

p(w|D)≈N(w|w^∗,H⁻¹)

w^∗ = arg maxJ[w], J[w] = logp(y|x,w)

| {z }

likelihood

+ logp(w)

| {z }

prior

H =∇²J(w) w^∗

(44)

−10 −5 0 5

−8

−6

−4

−2 0 2 4 6 8

data Log−Likelihood

1 2

3 4

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2 0 2 4 6 8

Log−Unnormalised Posterior

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2 0 2 4 6 8

Laplace Approximation to Posterior

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2 0 2 4 6 8

(45)

Bayesian LOGREG: Approximating the posterior predictive

Posterior can compute credible intervals etc.

But in machine learning, interest usually focuses onprediction.

Theposterior predictive distribution has the form p(y|x,D) =

Z

p(y|x,w)p(w|D)dw.

Here (and in most cases), this integral is intractable.

The simplest approximation is the plug-in approximation p(y = 1|x,D)≈p(y = 1|x,w^∗)

But such a plug-in estimate underestimates the uncertainty.

Better: Monte Carlo approximation p(y|x,D)≈ 1

S

X

s=1

sigm((w^s)^tx),

wherew^s ∼p(w|D) are samples from the Gaussian approximation to the posterior.

(46)

p(y=1|x, wMAP)

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2 0 2 4 6 8

−10 −8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2 0 2 4 6 8

decision boundary for sampled w

MC approx of p(y=1|x)

−8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2 0 2 4 6 8

(47)

Approximating the marginal likelihood

p(D|F) = Z

p(D|θ)·p(θ|F)dθ

≈ p(D|θ^∗)·p(θ^∗|F)|H/(2π)|⁻¹² ^{flat prior}≈ p(D|θ)|H/(2π)|ˆ ⁻¹² logp(D|F) ≈ logp(D|ˆθ)−1

2log|H|+C, with θˆ=θMLE in F.

Focus on last term:

H=

n

X

i=1

Hi, with Hi =∇_θ∇_θlogp(D_i|θ).

Let’s approximate eachHi with afixed matrixH⁰

log|H|= log|nH⁰|= log(n^d|H⁰|) =dlogn+ log(|H⁰|).

For model selection, last term can be dropped, because it is independent of F andn.

logp(D|F)≈logp(D|θ)ˆ −d

2 logn+C = BIC(F,n|D) +C.

(48)

Intuitive interpretation of BIC

The Shannon information content of a specific outcome a of a random experiment is

h(a) =−log₂P(a) = log 1 P(a). It measures the “surprise” (in bits):

Outcomes that are less probable have larger values of surprise.

Information theory: Can find a code so that thenumber of bits used to encode each symbol a∈ Ais essentially−log₂P(a).

Here:

−BIC(F,n|D) =

DL of observations given model

z }| {

n

X

i=1





−log₂p(yi|x_i,wˆ)

| {z }

surprise ofyi





+d

2log₂(n)

The sum of surprises of all observations is the description length of the observations given the (most probable) model in F.

(49)

Intuitive interpretation of BIC

Second term: description length of the model. Intuitive explanation:

I The model, i.e. ˆw∈R^d, was estimated based onnsamples.

I Can quantize every component into√

nlevels. Why?

I Remember the standard parametric rate:

1/√

nrepresents the magnitude of the estimation error no need for encoding with greater precision.

I Grid of (√

n)^d possible values for describing a model.

I We need log₂((√

n)^d) = log₂n^(d/2)= (d/2) log₂nbits to encode ˆw. In summary: -BIC = DL(data|model) + DL(model).

Maximizing BIC = minimizing joint DL of data and model Minimum Description Length principle.

(50)

Example: Bayesian logistic regression

Example: polynomial logistic regression,n = 100.

φ₁(x) = (1,x₁,x₂)^t,φ₂(x) = (1,x₁,x₂,(x₁+x₂)²)^t.

−BIC =

n

X

i=1

(−log₂p(y_i|x_i,wˆ)) +d

2 log₂(n)

−3 −2 −1 0 1 2

−1.5−1.0−0.50.00.51.01.5

degree #(param) DL(data) DL(model) BIC score

1 3 16.36 bits 9.97 bits -26.33

2 4 15.77 bits 13.29 bits -29.06

(51)

Example: Bayesian logistic regression

Example: polynomial logistic regression,n = 100.

φ₁(x) = (1,x₁,x₂)^t,φ₂(x) = (1,x₁,x₂,(x₁+x₂)²)^t.

−BIC =

n

X

i=1

(−log₂p(y_i|x_i,wˆ)) +d

2 log₂(n)

−4 −2 0 2 4 6

−4−202

degree #(param) DL(data) DL(model) BIC score

1 3 58.56 bits 9.97 bits -68.53

2 4 38.05 bits 13.29 bits -51.34

(52)

Subsection 3 Sparse models

(53)

Sparse Models

Sometimes, we have many more dimensions d than training cases n.

Corresponding design matrixX is “short and fat”, rather than

“tall and skinny”.

This is called small n , large d problem.

For example, with gene microarrays, it is common to measure the expression levels of d ≈20,000 genes, but to only get n≈100 samples (for instance, from 100 patients).

Q: what is the smallest set of features that can accurately predict the response in order to prevent overfitting, to reduce the cost of building a diagnostic device, or to help with scientific insight into the problem?

(54)

Bayesian variable selection

Let γj = 1 if featurej is relevant, and let γj = 0 otherwise.

Our goal is to compute the posterior over models p(γ|D) = exp(−f(γ))

P

γ⁰exp(−f(γ⁰)), wheref(γ) is the cost function:

f(γ) =−[logp(D|γ) + logp(γ)].

For example, suppose we generaten= 20 samples from ad = 10 dimensional linear regression model, y_i ∼N(w^tx_i, σ²), in which K = 5 elements ofw are non-zero.

Enumerate all 2¹⁰= 1024 models and compute p(γ|D) for each one.

(55)

Bayesian variable selection

0 200 400 600 800 1000

−220

−200

−180

−160

−140

−120

−100

−80

−60

−40

log p(model, data)

Fig 13.1 in K. Murphy: Score function f(γ) for all possible models.

(56)

Bayesian variable selection

0 200 400 600 800 1000

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

p(model|data)

1 2 3 4 5 6 7 8 9 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p(gamma(j)|data

Fig 13.1 in K. Murphy. Left: Posterior over all 1024 models. Vertical scale has been truncated at 0.1 for clarity. Right: Marginal inclusion probabilities p(γj = 1|D). The true model is{2,3,6,8,9}

(57)

Bayesian variable selection

Interpreting the posterior over a large number of models is difficult seeksummary statistics.

A natural one is the posterior mode, or MAP estimate ˆ

γ= arg maxp(γ|D) = arg minf(γ).

However, the mode is often not representative of the full posterior mass. A better summary is themedian model, computed using

ˆ

γ={j :p(γj = 1|D)>0.5}

This requires computing the posterior marginal inclusion probabilities p(γj = 1|D).

(58)

Bayesian variable selection

The above example illustrates thegold standard for variable selection: the problem was small (d = 10)

we were able tocompute the full posterior exactly.

Of course, variable selection is most useful in the cases where the number of dimensions islarge.

There are 2^d possible models (bit vectors) impossibleto compute the full posterior in general.

Even finding summaries (MAP, or marginal inclusion probabilities) is intractable

algorithmic speedups necessary.

But first, focus on the computation ofp(γ|D).