Machine Learning Kernel-PCA

(1)

Machine Learning Kernel-PCA

Dmitrij Schlesinger

WS2014/2015, 08.12.2014

(2)

Principal Component Analysis

Problem – (very) high dimensions of feature spaces:

Especially im Computer Vision

– Feature is 5×5image path → feature space is R²⁵ – SIFT: 18 histograms of 8 directions → vector in R¹²⁸ Idea: the feature space is represented in another basis.

Assumption:

the directions of small variances correspond to noise and can be neglected

Approach:

project the feature space onto a linear subspace so that the variances in the projected space are maximal

(3)

Principal Component Analysis

A simplified example: data are centered, the subspace is one-dimensional, i.e. it is represented by a normlized vector

||e||² = 1. Projection of an x on e ishx, ei. Hence, the task is

X

l

hx_l, ei² →min

e s.t. ||e||² = 1 Lagrangian:

X

l

hxl, ei²+λ(||e||²−1)→min

λ max

e

Gradient with respect to e:

X

l

2hx_l, ei ·x_l+ 2λe= 2e^X

l

x_l⊗x_l+ 2λe= 0 e·cov =λe

(4)

Principal Component Analysis

e·cov =λe

→e is an eigenvector andλ is the corresponding eigenvalue of the covariance matrix. Which one ?

For a pair e and λ the corresponding variance is

X

l

hx_l, ei² =e·^X

k

x_l⊗x_l·e=e·cov·e=||e||²·λ=λ

→chose the eigenvector corresponding to the maximal eigenvalue.

Similar approach: project the feature space into a subspace so that the summed squared distance between the points and their projections is minimal→ the result is the same.

(5)

Principal Component Analysis

Summary (for higher-dimensional subspaces R^m):

1. Compute the covariance matrix cov =^P

l

x_l⊗x_l 2. Find all eigenvalues and eigenvectors

3. Sort the eigenvalues in decreasing order

4. Choose m eigenvectors for them first eigenvalues 5. The n×m projection matrix consists of m columns, each

one being the corresponding eigenvector

Are projections onto alinear subspace good? Alternatives?

(6)

Do it with scalar products

The optimal direction vector can be expressed as a linear combination of data points, i.e. it is contained in the subspace that isspanned by the data points.

e=^X

i

αixi

Note: in high-dimensional spaces it may happen that all data points lie in a linear subspace, i.e. do not span the whole space (e.g. the dimension of the space is higher as the number of the data points). It will be important later for the feature spaces.

Why is it so? Proof by contradiction: Assume, it is not the case – the optimal vector is not contained in the subspace that is spanned by the data points. Project it into the subspace – the subject become better.

(7)

Do it with scalar products

Let us do the task a bit more complicated – consider projections of the direction vector onto all data vectors (instead of the vector itself):

λ·e=cov·e ⇔ λ·(x^T_k ·e) = x^T_k ·cov·e ∀k The right side follows from the left one directly.

The opposite is less trivial (see the board). It holds only ife can be represented as a linear combination of x_i – here it is just the case (see the previous slide).

(8)

Do it with scalar products

Put it all together:

cov = 1 l

X

j

x_jx^T_j, e=^X

i

α_ix_i, λ·(x^T_k ·e) =x^T_k ·cov·e ∀k

λ·(x^T_k ·^X

i

α_ix_i) =x^T_k · 1 l

X

j

x_jx^T_j ·^X

i

α_ix_i ∀k λl^X

i

α_ix^T_kx_i =^X

i

α_i^X

j

x^T_kx_jx^T_jx_i ∀k

LetK be the matrix of pair-vise scalar products: K_ij =x^T_i x_j Then (in the matrix form)

λlKα=K²α ⇒ λlα=Kα

The PCA can be expressed using scalar products only !!!

(9)

Do it with scalar products

λlα=Kα with an unknown vector α= (α₁, α₂. . . α_l)

→basically the same task: find eigenvalues (this time however of the matrixK instead of the covariance matrix cov).

Letα-s be known (already found). At the test time new data points are projected onto e, the length of the projection is

hx, ei=hx,^X

i

α_ix_ii=^X

i

α_ihx, x_ii

⇒the quantity of interest can be computed using scalar products, the direction vector is not explicitly necessary.

(10)

Using kernels

Now we would like to find a direction vector in the feature space.

All the matter remains exactly the same, but with the Kernel matrix

Kij =hΦ(xi),Φ(xj)i=κ(xi, xj) The projection onto the optimal direction is

hΦ(x), ei=hΦ(x),^X

i

α_iΦ(x_i)i=^X

i

α_iκ(x, x_i)

→Kernel-PCA

(11)

An illstration

hΦ(x), ei=^X

i

α_iκ(x, x_i) A linear function in the feature space ↔ a non-linear function in the input space.

(12)

Literature

Bernhard Schölkopf, Alexander Smola, Klaus-Robert Müller Kernel Principal Component Analysis (1997)

In the paper all the matter is presented immediately for the feature spaces and kernels.

Some further illustrations

Input space Polynomial kernel Gaussian kernel