Machine Learning 2020

(1)

Machine Learning 2020

Volker Roth

Department of Mathematics & Computer Science University of Basel

18th May 2020

(2)

Section 9 Linear latent variable models

(3)

Factor analysis

One problem with mixture models: only a single latent variable. Each observation can only come from one of K prototypes.

Alternative: zi ∈R^k. Gaussian prior:

p(z_i) =N(z_i|µ₀,Σ₀)

z

x

µ Σ

µ Ψ W

N

i i

0 0

For observationsxi ∈R^p , we may use aGaussian likelihood.

As in linear regression, we assume the mean is a linearfunction:

p(x_i|z_i,θ) =N(Wz_i+µ,Ψ),

W: factor loading matrix, and Ψ: covariance matrix.

We take Ψ to bediagonal, since the whole point of the model is to

“force” z_i to explain the correlation.

(4)

Factor analysis: generative process

Generative process (k = 1,p = 2, diagonal Ψ):

Figure 12.1 in K. Murphy

We take an isotropic Gaussian “spray can” and slide it along the 1d line defined by wzi+µ. This induces a correlated Gaussian in 2d.

(5)

Inference of the latent factors

We hope that the latent factors z will reveal something interesting about the data compute posterior over the latent variables:

p(z_i|x_i,θ) = N(z_i|m_i,Σ)

Σ = (Σ⁻¹₀ +W^tΨ⁻¹W)⁻¹

mi = Σi(W^tΨ⁻¹(xi −µ) + Σ⁻¹₀ µ₀) The posterior means m_i are called the

latent scores, or latent factors.

(6)

Example

Example from (Shalizi 2009). p= 11 variables andn = 387 cases describing aspects of cars: engine size, #(cylinders), miles per gallon (MPG), price, etc.

Fit a p = 2 dim model. Plotm_i scores as points inR².

Try to understand the “meaning” of latent factors: project unit vectorse₁ = (1,0, . . . ,0),e₂= (0,1,0, . . . ,0), etc. into the low dimensional space (blue lines)

Horizontal axis represents price(features labeled “dealer” and

“retail”), with expensive cars on the right.

Vertical axis represents fuel efficiency (measured in terms of MPG) versus size:

heavy vehicles: less efficient higher up, light vehicles: more efficient lower down.

(7)

Example

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

Retail Dealer Engine

Cylinders Horsepower

CityMPGHighwayMPG

Weight WheelbaseLengthWidth

Component 1

Component 2

rotation=none

Porsche 911 GT2 Honda Insight

GMC Yukon XL 2500 SLT

Mercedes−Benz CL600 Kia Sorento LX

Mercedes−Benz G500 Saturn Ion1

Nissan Pathfinder Armada SE

Figure 12.2 in K. Murphy

(8)

Special Cases: PCA and CCA

Covariance matrix Ψ =σ²I (probabilistic)PCA.

Two-view version involving x andy CCA.

x

_i

y

i

z

^s_i

z

^x_i

z

^y_i

B

x

B

y

W

x

W

y

N

From figure 12.19 in K. Murphy

(9)

PCA and dimensionality reduction

Given n data points in p dimensions:

X =







− x1 −

− x₂ −

− ... −

− xn −







∈R^n×p

Want to reduce dimensionality from p to k. Choose k directions w1, . . . ,wk, arrange them as columns in matrix W:

W =^hw₁, w₂, . . . , w_kⁱ∈R^p×k

For eachw_j, computesimilarity z_j =w^t_jx, j = 1. . .k.

Project x down to z= (z1, . . . ,zk)^t =W^tx. How to chooseW?

(10)

Encoding–decoding model

The projection matrix W serves two functions:

Encode: z =W^tx, z ∈R^k, zj =w^t_jx.

I The vectorswj form a basis of the projected space.

I We will require that this basis is orthonormal, i.e.W^tW =I. Decode: x˜=Wz =^P^k_j=1zjwj, x˜∈R^p.

I Ifk=p, the above orthonormality condition impliesW^t=W⁻¹, and encoding can be undone without loss of information.

I Ifk<p, least-squares problem

the reconstruction error will be nonzero.

Above we assumed that the origin of the coordinate system is in the sample mean, i.e. ^P_ixi = 0.

(11)

Principal Component Analysis (PCA)

In the general case, we want the reconstruction error kx−˜xk to be small.

Objective: minimize min_W_∈_Rp×k:W^tW=IPn

i=1kx_i −WW^tx_ik²

(12)

Finding the principal components

Projection vectors are orthogonal can treat them separately:

w:minkwk=1

Xn

i=1kx_i−w w^txik² X

ikx_i −w w^tx_ik² =

n

X

i=1

[x^t_ix_i −2x^t_iw w^tx_i+x^t_iw w^tw

| {z }

=1

w^tx_i]

=^X

i[x^t_ixi−x^t_iw w^txi]

=^X

ix^t_ix_i−w^t

n

X

i=1

x_ix^t_iw

=^X

ix^t_ix_i

| {z }

const.

−w^tX^tXw.

(13)

Finding the principal components

Want to maximizew^tX^tXw under the constraintkwk= 1

Can also maximize the ratioJ(w) = ^w^t_w^Xt^tw^X^w Rayleigh quotient.

Optimal projectionw is the eigenvector of X^tX with largest eigenvalue.

Note that we assumed that^P_ixi = 0. Thus, the columns of X are assumed to sum to zero.

compute SVD of “centered” matrixX =USV^t singular vectors v are eigenvectors ofX^tX they are the principal components.

(14)

Eigen-faces [Turk and Pentland, 1991]

p = number of pixels

Each xi ∈R^p is a face image

xji = intensity of the j-th pixel in imagei

(X^t)p×n ≈ Wp×k (Z^t)k×n

≈







| |

z₁ . . . z_n

| |





 Idea: z_i more ’meaningful’ representation ofi-th face than x_i

Can use zi for nearest-neighbor classification Much faster whenp k.

(15)

Probabilistic PCA

Ι Ψ = σ

²

z

W x

N

i

µ = 0

i

Assuming Ψ =σ²I and centered data in the FA model likelihood

p(xi|z_i,θ) =N(Wzi, σ²I).

(16)

Probabilistic PCA

(Tipping & Bishop 1999): Maxima of the likelihood are given by Wˆ =V(Λ−σ²I)¹²R,

whereR is an arbitrary orthogonal matrix, columns of V: first k eigenvectors of S = ¹_nX^tX, Λ: diagonal matrix of eigenvalues.

As σ²→0, we have ˆW →V, as in classical PCA (forR = Λ⁻¹²).

Projectionsz_i: Posterior over the latent factors:

p(z_i|x_i,θ)ˆ = N(z_i|mˆ_i, σ²Fˆ⁻¹) Fˆ = σ²I+ ˆW^tWˆ mi = Fˆ⁻¹Wˆ^txi

For σ²→0,z_i →m_i andm_i →(V^tV)⁻¹V^tx_i =V^tx_i

orthogonal projection of the data onto the column space ofV, as in classical PCA.

(17)

Multiple Views: CCA

Consider paired samples from different views.

What is the dependency structure between the views ?

Standard approach: global lineardependency detected byCCA.

(18)

Canonical Correlation Analysis [Hotelling, 1936]

Often, each data point consists of two views:

Image retrieval: for each image, have the following:

I X: Pixels (or other visual features)Y: Text around the image Time series:

I X: Signal at timet

I Y: Signal at timet+ 1

Two-view learning: divide features into two sets

I X: Features of a word/object, etc.

I Y: Features of the context in which it appears Goal: reduce the dimensionality of the two views jointly.

Find projections such that projected views are maximally correlated.

(19)

CCA vs PCA

separate PCA separate PCA

CCA

(20)

CCA vs PCA

●

●●

●

● ●

●

● ●

●

● ●

●●

●

● ●

●

●●

●

● ●

●

● ●

●

● ●

●

● ●

●

● ●

●

−10 −5 0 5 10 15 20

−10−505

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

● ●

●

● ●

●

● ●

● ● ●●

●

● ●

●

●●

●

● ●

●

(21)

CCA: Setting

Let X be a random vector ∈R^p^x and Y be a random vector ∈R^p^y Consider the combined (p :=p_x +p_y)-dimensional random vector Z = (X,Y)^t. Let its (p×p) covariance matrix be partitioned into blocks according to:

Z =

"

ΣXX ∈R^p^x^×p^x | ΣXY ∈R^p^x^×p^y Σ_YX ∈R^p^y^×p^x | Σ_YY ∈R^p^y^×p^y

#

Assuming centered data, the blocks in the covariance matrix can be estimated from observed data sets X ∈R^n×p^x,Y ∈R^n×p^y:

Z≈ 1 n

"

X^tX | X^tY Y^tX | Y^tY

#

(22)

CCA: Setting

Correlation(x,y) = covariance(x,y)

standard deviation(x)·standard deviation(y)

ρ=cor(x,y) = cov(x,y) σ(x)σ(y). Sample correlation:

ρ=

P

i(xi−¯x)(yi −¯y)^t qP

i(xi −x)¯ ²^q^P_i(yi −y¯)²

centered observations

= x^ty

√ x^tx√

y^ty. Want to find maximally correlated 1D-projectionsx^ta andy^tb.

Projected covariance: cov(x^ta,y^tb)^{zero means}= a^tΣ_XYb.

Definec = Σ

1 2

XXa,d = Σ

1 2

YYb.

Thus, the projected correlation coefficient is: ρ= ^c

tΣ⁻

1 2 XXΣXYΣ⁻

1 2 YYd

√ c^tc√

d^td .

(23)

CCA: Setting

By the Cauchy-Schwarz inequality (x^ty ≤ kxk · kyk), we have





c^tΣ⁻_XX¹²Σ_XYΣ⁻_YY¹²

| {z }

H





d ≤





c^tΣ⁻_XX¹²Σ_XYΣ⁻¹_YYΣ_YXΣ⁻_XX¹²

| {z }

G:=HH^t

c







1 2

d^td¹₂ ,

ρ≤(c^tGc)¹² (c^tc)¹²

,

ρ²≤c^tGc c^tc . Equality: vectorsd and Σ⁻

1 2

YYΣ_YXΣ⁻

1 2

XXc are collinear.

Maximum: c is the eigenvector with the maximum eigenvalue of G := Σ⁻

1 2

XXΣ_XYΣ⁻¹_YYΣ_YXΣ⁻

1 2

XX.

Subsequent pairs using eigenvalues of decreasing magnitudes.

Collinearity: d ∝Σ⁻

1 2

YYΣ_YXΣ⁻

1 2

XXc

Transform back to original variablesa= Σ⁻

1 2

XXc,b= Σ⁻

1 2

YYd.

(24)

Pixels That Sound [Kidron, Schechner, Elad, 2005]

“People and animals fuse auditory and visual information to obtain robust perception. A particular benefit of such cross-modal analysis is the ability to localize visual events associated with sound sources. We aim to achieve this using computer-vision aided by a single microphone”.

https://webee.technion.ac.il/ yoav/research/pixels-that-sound.html

(25)

Probabilistic CCA

(Bach and Jordan 2005): With Gaussian priors p(z) =N(z^s|0,I)N(z^x|0,I)N(z^y|0,I),

the MLE in the two-view FA model is equivalent to classical CCA (up to rotation and scaling).

x_i yi

z^s_i

z^x_i z^y_i

Bx By

Wx Wy

N

From figure 12.19 in K. Murphy

(26)

Further connections

Ify is a discrete class label CCA is (essentially) equivalent to Linear Disriminant Analysis (LDA), see (Hastie et al. 1994).

Arbitraryy CCA is (essentially) equivalent to the Gaussian Information Bottleneck (Chechik et al. 2005)

I Basic idea: compress x into compact latent representation while preserving information abouty.

I Information theoretic motivation:

Find encoding distributionp(z|x) by minimizing I(x;z)−βI(x;y)

whereβ ≥0 is some parameter controlling the tradeoff between compression and predictive accuracy.

Arbitraryy, discrete shared latent z^s

dependency-seeking clustering (Klami and Kaski 2008): find clusters that “explain” the dependency between the two views.