• Keine Ergebnisse gefunden

Machine Learning 2020

N/A
N/A
Protected

Academic year: 2022

Aktie "Machine Learning 2020"

Copied!
26
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Machine Learning 2020

Volker Roth

Department of Mathematics & Computer Science University of Basel

18th May 2020

(2)

Section 9

Linear latent variable models

(3)

Factor analysis

One problem with mixture models: only a single latent variable. Each observation can only come from one of K prototypes.

Alternative: zi ∈Rk. Gaussian prior:

p(zi) =N(zi0,Σ0)

z

x

µ Σ

µ Ψ W

N

i i

0 0

For observationsxi ∈Rp , we may use aGaussian likelihood.

As in linear regression, we assume the mean is a linearfunction:

p(xi|zi,θ) =N(Wzi+µ,Ψ),

W: factor loading matrix, and Ψ: covariance matrix.

We take Ψ to bediagonal, since the whole point of the model is to

“force” zi to explain the correlation.

(4)

Factor analysis: generative process

Generative process (k = 1,p = 2, diagonal Ψ):

Figure 12.1 in K. Murphy

We take an isotropic Gaussian “spray can” and slide it along the 1d line defined by wzi+µ. This induces a correlated Gaussian in 2d.

(5)

Inference of the latent factors

We hope that the latent factors z will reveal something interesting about the data compute posterior over the latent variables:

p(zi|xi,θ) = N(zi|mi,Σ)

Σ = (Σ−10 +WtΨ−1W)−1

mi = Σi(WtΨ−1(xiµ) + Σ−10 µ0) The posterior means mi are called the

latent scores, or latent factors.

(6)

Example

Example from (Shalizi 2009). p= 11 variables andn = 387 cases describing aspects of cars: engine size, #(cylinders), miles per gallon (MPG), price, etc.

Fit a p = 2 dim model. Plotmi scores as points inR2.

Try to understand the “meaning” of latent factors: project unit vectorse1 = (1,0, . . . ,0),e2= (0,1,0, . . . ,0), etc. into the low dimensional space (blue lines)

Horizontal axis represents price(features labeled “dealer” and

“retail”), with expensive cars on the right.

Vertical axis represents fuel efficiency (measured in terms of MPG) versus size:

heavy vehicles: less efficient higher up, light vehicles: more efficient lower down.

(7)

Example

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

Retail Dealer Engine

Cylinders Horsepower

CityMPGHighwayMPG

Weight WheelbaseLengthWidth

Component 1

Component 2

rotation=none

Porsche 911 GT2 Honda Insight

GMC Yukon XL 2500 SLT

Mercedes−Benz CL600 Kia Sorento LX

Mercedes−Benz G500 Saturn Ion1

Nissan Pathfinder Armada SE

Figure 12.2 in K. Murphy

(8)

Special Cases: PCA and CCA

Covariance matrix Ψ =σ2I (probabilistic)PCA.

Two-view version involving x andy CCA.

x

i

y

i

z

si

z

xi

z

yi

B

x

B

y

W

x

W

y

N

From figure 12.19 in K. Murphy

(9)

PCA and dimensionality reduction

Given n data points in p dimensions:

X =

x1

x2

− ... −

xn

∈Rn×p

Want to reduce dimensionality from p to k. Choose k directions w1, . . . ,wk, arrange them as columns in matrix W:

W =hw1, w2, . . . , wki∈Rp×k

For eachwj, computesimilarity zj =wtjx, j = 1. . .k.

Project x down to z= (z1, . . . ,zk)t =Wtx. How to chooseW?

(10)

Encoding–decoding model

The projection matrix W serves two functions:

Encode: z =Wtx, z ∈Rk, zj =wtjx.

I The vectorswj form a basis of the projected space.

I We will require that this basis is orthonormal, i.e.WtW =I. Decode: x˜=Wz =Pkj=1zjwj, x˜∈Rp.

I Ifk=p, the above orthonormality condition impliesWt=W−1, and encoding can be undone without loss of information.

I Ifk<p, least-squares problem

the reconstruction error will be nonzero.

Above we assumed that the origin of the coordinate system is in the sample mean, i.e. Pixi = 0.

(11)

Principal Component Analysis (PCA)

In the general case, we want the reconstruction error kx−˜xk to be small.

Objective: minimize minWRp×k:WtW=IPn

i=1kxiWWtxik2

(12)

Finding the principal components

Projection vectors are orthogonal can treat them separately:

w:minkwk=1

Xn

i=1kxiw wtxik2 X

ikxiw wtxik2 =

n

X

i=1

[xtixi −2xtiw wtxi+xtiw wtw

| {z }

=1

wtxi]

=X

i[xtixixtiw wtxi]

=X

ixtixiwt

n

X

i=1

xixtiw

=X

ixtixi

| {z }

const.

wtXtXw.

(13)

Finding the principal components

Want to maximizewtXtXw under the constraintkwk= 1

Can also maximize the ratioJ(w) = wtwXttwXw Rayleigh quotient.

Optimal projectionw is the eigenvector of XtX with largest eigenvalue.

Note that we assumed thatPixi = 0. Thus, the columns of X are assumed to sum to zero.

compute SVD of “centered” matrixX =USVt singular vectors v are eigenvectors ofXtX they are the principal components.

(14)

Eigen-faces [Turk and Pentland, 1991]

p = number of pixels

Each xi ∈Rp is a face image

xji = intensity of the j-th pixel in imagei

(Xt)p×nWp×k (Zt)k×n

| |

z1 . . . zn

| |

Idea: zi more ’meaningful’ representation ofi-th face than xi

Can use zi for nearest-neighbor classification Much faster whenp k.

(15)

Probabilistic PCA

Ι Ψ = σ

2

z

W x

N

i

µ = 0

i

Assuming Ψ =σ2I and centered data in the FA model likelihood

p(xi|zi,θ) =N(Wzi, σ2I).

(16)

Probabilistic PCA

(Tipping & Bishop 1999): Maxima of the likelihood are given by Wˆ =V(Λ−σ2I)12R,

whereR is an arbitrary orthogonal matrix, columns of V: first k eigenvectors of S = 1nXtX, Λ: diagonal matrix of eigenvalues.

As σ2→0, we have ˆWV, as in classical PCA (forR = Λ12).

Projectionszi: Posterior over the latent factors:

p(zi|xi,θ)ˆ = N(zi|mˆi, σ2Fˆ−1) Fˆ = σ2I+ ˆWtWˆ mi = Fˆ−1Wˆtxi

For σ2→0,zimi andmi →(VtV)−1Vtxi =Vtxi

orthogonal projection of the data onto the column space ofV, as in classical PCA.

(17)

Multiple Views: CCA

Consider paired samples from different views.

What is the dependency structure between the views ?

Standard approach: global lineardependency detected byCCA.

(18)

Canonical Correlation Analysis [Hotelling, 1936]

Often, each data point consists of two views:

Image retrieval: for each image, have the following:

I X: Pixels (or other visual features)Y: Text around the image Time series:

I X: Signal at timet

I Y: Signal at timet+ 1

Two-view learning: divide features into two sets

I X: Features of a word/object, etc.

I Y: Features of the context in which it appears Goal: reduce the dimensionality of the two views jointly.

Find projections such that projected views are maximally correlated.

(19)

CCA vs PCA

separate PCA separate PCA

CCA

(20)

CCA vs PCA

−10 −5 0 5 10 15 20

−10−505

●●

● ●

(21)

CCA: Setting

Let X be a random vector ∈Rpx and Y be a random vector ∈Rpy Consider the combined (p :=px +py)-dimensional random vector Z = (X,Y)t. Let its (p×p) covariance matrix be partitioned into blocks according to:

Z =

"

ΣXX ∈Rpx×px | ΣXY ∈Rpx×py ΣYX ∈Rpy×px | ΣYY ∈Rpy×py

#

Assuming centered data, the blocks in the covariance matrix can be estimated from observed data sets X ∈Rn×px,Y ∈Rn×py:

Z≈ 1 n

"

XtX | XtY YtX | YtY

#

(22)

CCA: Setting

Correlation(x,y) = covariance(x,y)

standard deviation(x)·standard deviation(y)

ρ=cor(x,y) = cov(x,y) σ(x)σ(y). Sample correlation:

ρ=

P

i(xi−¯x)(yi −¯y)t qP

i(xix)¯ 2qPi(yiy¯)2

centered observations

= xty

xtx

yty. Want to find maximally correlated 1D-projectionsxta andytb.

Projected covariance: cov(xta,ytb)zero means= atΣXYb.

Definec = Σ

1 2

XXa,d = Σ

1 2

YYb.

Thus, the projected correlation coefficient is: ρ= c

tΣ

1 2 XXΣXYΣ

1 2 YYd

ctc

dtd .

(23)

CCA: Setting

By the Cauchy-Schwarz inequality (xty ≤ kxk · kyk), we have

ctΣXX12ΣXYΣYY12

| {z }

H

d

ctΣXX12ΣXYΣ−1YYΣYXΣXX12

| {z }

G:=HHt

c

1 2

dtd12 ,

ρ(ctGc)12 (ctc)12

,

ρ2ctGc ctc . Equality: vectorsd and Σ

1 2

YYΣYXΣ

1 2

XXc are collinear.

Maximum: c is the eigenvector with the maximum eigenvalue of G := Σ

1 2

XXΣXYΣ−1YYΣYXΣ

1 2

XX.

Subsequent pairs using eigenvalues of decreasing magnitudes.

Collinearity: d ∝Σ

1 2

YYΣYXΣ

1 2

XXc

Transform back to original variablesa= Σ

1 2

XXc,b= Σ

1 2

YYd.

(24)

Pixels That Sound [Kidron, Schechner, Elad, 2005]

“People and animals fuse auditory and visual information to obtain robust perception. A particular benefit of such cross-modal analysis is the ability to localize visual events associated with sound sources. We aim to achieve this using computer-vision aided by a single microphone”.

https://webee.technion.ac.il/ yoav/research/pixels-that-sound.html

(25)

Probabilistic CCA

(Bach and Jordan 2005): With Gaussian priors p(z) =N(zs|0,I)N(zx|0,I)N(zy|0,I),

the MLE in the two-view FA model is equivalent to classical CCA (up to rotation and scaling).

xi yi

zsi

zxi zyi

Bx By

Wx Wy

N

From figure 12.19 in K. Murphy

(26)

Further connections

Ify is a discrete class label CCA is (essentially) equivalent to Linear Disriminant Analysis (LDA), see (Hastie et al. 1994).

Arbitraryy CCA is (essentially) equivalent to the Gaussian Information Bottleneck (Chechik et al. 2005)

I Basic idea: compress x into compact latent representation while preserving information abouty.

I Information theoretic motivation:

Find encoding distributionp(z|x) by minimizing I(x;z)βI(x;y)

whereβ 0 is some parameter controlling the tradeoff between compression and predictive accuracy.

Arbitraryy, discrete shared latent zs

dependency-seeking clustering (Klami and Kaski 2008): find clusters that “explain” the dependency between the two views.

Referenzen

ÄHNLICHE DOKUMENTE

In the middle equation, the left side is the discrete first derivative of the density at grid point j with respect to time. The right side is the second spatial derivative of

The fidelity of the hybridization reac- tions can be improved significantly and as an example of a set of 24 words of 16-mers we show that the optimal set has unique

We have systematically evaluated our approach following the MUC7 guide lines on a manually annotated small corpus of German newspaper articles about company turnover (75 documents

1, learns and refines visual conceptual models, either by attending to informa- tion deliberatively provided by a human tutor (tutor- driven learning: e.g., H: This is a red box.) or

The result of this scan is an orthogonal array of laser spots, similar to the LED array on the calibration plate, to compute the transformation. The advantage of this method is

We will show that what is termed statistical shape models in computer vision and med- ical image analysis, are just special cases of a general Gaussian Process formulation, where

Wie kann diese wissen- schaftliche Haltung gemeinsam mit den entsprechenden forschungspraktischen Fertigkeiten gewinnbringend im Rahmen einer Fachhochschulausbildung vermittelt

El proyecto descrito fue un intento por desarrollar las habilidades teóricas y prácticas para el análisis de un grupo de estudiantes que trabajaron juntos en la elaboración