Least squares problems

(1)

Chapter 2

Least squares problems

(2)

Linear curve fitting

• Notation: n objects at locations x_i ∈ R^p. Every object has measurement y_i ∈ R.

• Approximate “regression targets” y as a parametrized function of x.

• Consider a 1-dim problem initially.

• Start with n data points (x_i, y_i), i = 1, . . . , n.

• Choose d basis functions g₀(x), g₁(x), . . . .

• Fitting to a line uses two basis functions

g₀(x) = 1 and g₁(x) = x . In most cases n d.

• Fit function = linear combination of basis functions:

f(x; w) = P

j w_jg_j(x) = w₀ + w₁x.

• f(x_i) = y_i exactly is (usually) not possible, so approximate f(x_i) ≈ y_i

• n residuals are defined by r_i = y_i − f(x_i) = y_i − (w₀ + w₁x_i).

(3)

−2 −1 0 1 2

−6−4−2024

w_1

w_0

w_0 = −2w_1 −1

w_0 = w_1+1

w_0 = 3w_1−2

Calculus or algebra?

• Quality of fit can be measured by residual sum of squares

RSS = P

i r_i² = P

i[y_i − (w₀ + w₁x_i)]².

• Minimizing RSS with respect to w₁ and w₀ provides the least-squares fit.

• To solve the least squares problem we can 1. set the derivative of RSS to zero

calculus, or

2. solve an over-determined system

algebra: w₀ + w₁x_i = y_i, i = 1, . . . , n

• The results you get are...

– mathematically the same, but

– have different numerical properties.

(4)

Matrix-vector form

• Write f(x) ≈ y in matrix-vector form for n observed points as







1 x₁ 1 x₂

... ...

1 x_n







| {z }

X

w₀ w₁

| {z }

w

≈





 y₁ y₂ ...

y_n







| {z }

y

• We minimize the sum of squared errors, which is the squared norm of the residual vector r = y − Xw:

RSS =

n

X

i=1

(y_i − (Xw)_i)² = ky − Xwk² = krk² = r^tr.

• RSS = 0 only possible if all the data points lie on a line.

(5)

Basis functions

X has as many columns as there are basis functions. Examples:

• High-dimensional linear functions

x ∈ R^p, g₀(x) = 1 and g₁(x) = x₁, g₂(x) = x₂, . . . , g_p(x) = x_p. X_i• = g^t(x_i) = (1, — x^t_i —), (i-th row of X) f(x; w) = w^tg = w₀ + w₁x₁ + · · · + w_px_p.

• Document analysis: Assume a fixed collection of words:

x = text document g₀(x) = 1

g_i(x) = #(occurences of i-th word in document) f(x; w) = w^tg = w₀ + X

i∈words

w_ig_i(x).

(6)

Solution by Calculus

RSS = r^tr = (y − Xw)^t(y − Xw)

= y^ty − y^tXw − w^tX^ty + w^tX^tXw

= y^ty − 2y^tXw + w^tX^tXw.

Minimization: set the gradient (vector of partial derivatives) to zero:

∇_wRSS = ∂RSS

∂w

=! 0.

We need some properties of vector derivatives:

∂(Ax)/∂x = A^t

∂(x^tA)/∂x = A

∂(x^tAx)/∂x = Ax + A^tx (if A is square)

(7)

Normal Equations

∂RSS

∂w = ∂

∂w

y^ty − 2y^tXw + w^tX^tXw

= −2X^ty + [X^tXw + (X^tX)^tw]

= −2X^ty + 2X^tXw = 0 Normal equations: X^tXw = X^ty.

Could solve this system. But: All solution methods based on normal equations are inherently susceptible to roundoff errors:

k(X) = σ_max/σ_min, where X^tXv_i = σ_i²v_i

k(X^tX) = µ_max/µ_min, where X^tXX^tXv_i = µ²_i v_i X^tXX^tXv_i = X^tXσ_i²v_i = σ_i⁴v_i ⇒ µ_i = σ_i²

⇒ k(X^tX) = k²(X),

The algebraic approach will avoid this problem!

(8)

From Calculus to Algebra

∂RSS(w)

∂w = −2X^ty + 2X^tXw =^! 0

⇒X^t(y − Xw) =ˆ X^tr = 0 ⇒ r ∈ N(X^t).

• Every Xw is in column space C(X),

residual r is in the orthogonal complement N(X^t) (left nullspace).

• Let yˆ be the orthogonal projection of y on C(X) y can be split into yˆ ∈ C(X) + r ∈ N(X^t).

X[.,1]

X[.,2]

Adapted from Fig. 3.2 in (Hastie, Tibshirani, Friedman)

(9)

Algebraic interpretation

• y = ˆy ∈ C(X) + r ∈ N(X^t) Consider over-determined systems Xw = y = ˆy + r (solution impossible, if r 6= 0)

Xwˆ = ˆy (solvable, since yˆ ∈ C(X)!)

• The solution wˆ of Xw = ˆy makes the error as small as possible:

kXw − yk² = kXw − (ˆy + r)k² = kXw − ykˆ ² + krk²

Reduce kXw − yˆk² to zero by solving Xwˆ = ˆy and choosing w = ˆw.

Remaining error krk² cannot be avoided, since r ∈ N(X^t).

X^tXwˆ = X^tyˆ = X^ty ⇒ wˆ = (X^tX)⁻¹X^ty (if X^tX invertible).

• The fitted values at the sample points are yˆ = Xwˆ = X(X^tX)⁻¹X^ty.

• H = X(X^tX)⁻¹X^t is called hat matrix (puts a “hat” on y y).ˆ

(10)

Algebraic interpretation

• Left nullspace N(X^t) is orthogonal complement of column space C(X).

• H is orthogonal projection on C(X):

HX = X(X^tX)⁻¹X^tX = X, HN(X^t) = 0.

• M = I − H is orthogonal projection on nullspace of X^t: M X = (I − H)X = X − X = 0, M N(X^t) = M.

• H and M are symmetric (H^t = H) and idempotent (M M = M) The algebra of Least Squares:

H creates fitted values: yˆ = Hy yˆ ∈ C(X) M creates residuals: r = My r ∈ N(X^t)

(11)

Algebraic interpretation

X^tX is invertible iff X has linearly independent columns.

Why? X^tX has the same nullspace as X:

(i) If a ∈ N(X), then Xa = 0 ⇒ X^tXa = 0 a ∈ N(X^tX).

(ii) If a ∈ N(X^tX), then a^tX^tXa = 0 ⇔ kXak² = 0, so Xa has length zero ⇒ Xa = 0.

Thus, every vector in one nullspace is also in the other one.

So if N(X) = {0}, then X^tX ∈ R^d×d has full rank d.

When X has independent columns, X^tX is positive definite.

Why? X^tX is clearly symmetric and invertible.

To show: All eigenvalues > 0

X^tXv = λv v^tX^tXv = λv^tv λ = ^kX_kvk^vk₂² > 0.

(12)

SVD for Least-Squares

• Goal: Avoid numerical problems for normal equations:

X^tXw = X^ty, k(X^tX) = k²(X).

• Idea: Apply the SVD directly to X_n×d.

• The squared norm of the residual is

RSS = krk² = kXw − yk²

= kU SV ^tw − yk²

= kU(SV ^tw − U^ty)k²

= kSV ^tw − U^tyk²

Last equation: U is orthonormal kUak² = a^tU^tUa = a^ta = kak².

• Minimizing RSS is equivalent to minimizing kSz − ck² where z = V ^tw and c = U^ty.

(13)

SVD and LS

Recall: Columns u_i of U_n×n with σ_i > 0 form a basis of C(X). Remaining columns form basis of N(X^t):

c = U^ty =







− u^t₁ −

− u^t₂ − ...

− u^t_d −

0 0 0

...

0 0 0











 y₁ y₂ ......

...

yn−1

y_n







| {z }







c₁ ...

c_d 0...

0







∈C(X)

+







0 0 0

...

0 0 0

− u^t_d+1 −

− u^t_d+2 − ...

− u^t_n −











 y₁ y₂ ......

...

yn−1

y_n







| {z }







0...

0 c_d+1

...

c_n







∈N(X^t)

(14)

SVD and bases for the 4 subspaces

v u

vr+1

v_n

r

v₁

ur+1

u_m

u

σ

1 1u

1

σru

row space

nullspace left nullspace

column space

dim m−r σ

σ

r r

dim r dim r

dim n−r

r r r

Av = 0n

Av = u₁ ₁ 1

R Av = u

Rⁿ

m

(15)

SVD and LS

• krk² = kSz − ck² written in blocks:







σ₁ 0 . . . 0

0 σ₂ . . . 0

0 0 . . . σ_d

0 0 . . . 0

... ... ... ...

0 0 . . . 0











 z₁ z₂ ...

z_d







−







c₁ ...

c_d c_d+1

...

c_n







2

• To choose z so that krk² is minimal requires z_i = c_i/σ_i, i = 1, . . . , d r₁ = r₂ = · · · = r_d = 0.

• Unavoidable error: RSS = krk² = c²_d+1 + c²_d+2 + · · · + c²_n.

• For very small singular values, use zeroing. RSS will increase:

One additional term (usually small): RSS⁰ = c²_d+c²_d+1+c²_d+2+· · ·+c²_n, but often significantly better precision (reduced condition number).

(16)

Classification

Classification: Find class boundaries based on training data {(x₁, y₁), . . . , (x_n, y_n)}. Use boundaries to classify new items x^∗.

Here, y_i is a discrete class indicator (or “label”). Example: Fish-packing plant wants to automate the process of sorting fish on conveyor belt using optical sensing.

2 4 6 8 10

14 15 16 17 18 19 20 21 22 width

lightness

salmon sea bass

FIGURE 1.4. The two features of lightness and width for sea bass and salmon. The dark line could serve as a decision boundary of our classifier. Overall classification error on the data shown is lower than if we use only one feature as in Fig. 1.3, but there will still be some errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c2001 by John Wiley & Sons, Inc.

(Duda, Hart, Stork, 2001)

(17)

Linear Discriminant Analysis (Ronald Fisher, 1936)

0.5 1 1.5

0.5 1 1.5 2

0.5 1 1.5

x₁

-0.5 0.5 1 1.5

2

x₂

w w

x₁ x₂

FIGURE 3.5. Projection of the same set of samples onto two different lines in the directions marked w. The figure on the right shows greater separation between the red and black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

(Duda, Hart, Stork, 2001)

Main Idea: Simplify the problem by projecting down to a 1-dim subspace.

Question: How should we select the projection vector, which optimally discriminates between the different classes?

(18)

Separation Criterion

• Let m_j an estimate of the class means µ_j: m_y = 1

n_y

X

x∈classy x, n_y = #(objects in class y).

• Projected samples: x⁰_i = w^tx_i, i = 1, 2, . . . , n. Projected means:

˜

m_y = 1 n_y

X

x∈class y w^tx = w^tm_y.

• First part of separation criterion (two-class case):

maxw [w^t(m₁ − m₂)]² = max

w [ ˜m₁ − m˜ ₂]².

• There might still be considerable overlap...

should also consider the scatter or variance.

(19)

Separation Criterion

Two Gaussians with the same mean distance, but different variances:

−3 −2 −1 0 1 2 3

0.050.100.150.20

−3 −2 −1 0 1 2 3

0.00.20.40.60.81.0

(20)

Excursion: The multivariate Gaussian distribution

−2 −1 0 1 2

−2−1012

x_1

x_2

X

−2 −1

0

1

2

Y

−2

−1 0 1 2

Z

0.0 0.1 0.2 0.3

Probability density function:

p(x; µ, Σ) = √ ¹

2π|Σ| exp(−¹₂(x − µ)^tΣ⁻¹(x − µ))

(21)

Excursion: The multivariate Gaussian distribution

Covariance

(also written “co-variance”) is a measure of how much two random variables vary together. Can be positive, zero, or negative.

−2 −1 0 1 2

−2−1012

x_2

x_1 x_1

−2 −1 0 1 2

−2−1012

x_2

−2 −1 0 1 2

−2−1012

x_1

−2 −1 0 1 2

−2−1012

x_2

x_1

Sample covariance matrix Σb = _n¹ Pn

i=1(x_i − x)(x_i − x)^t, with sample mean x = _n¹ Pn

i=1 x_i = m. If m = 0 Σb = _n¹X^tX.

(22)

Separation Criterion

• Assume both classes are Gaussians with the same covariance matrix.

Let Σ_W be an estimate of this “within class” covariance matrix:

Σ_y = 1 n_y

X

x∈class y

(x − m_y)(x − m_y)^t, Σ_W = 0.5(Σ₁ + Σ₂).

• Variance of projected data:

Σ˜_y = 1 n_y

X

x∈class y(w^tx − m˜ _y)(w^tx − m˜ _y)^t

= 1 n_y

X

x∈class y w^t(x − m_y)(x − m_y)^tw = w^tΣ_yw Σ˜_W = 0.5( ˜Σ₁ + ˜Σ₂) = w^tΣ_Ww ∈ R⁺

• Strategy: ∆² = ( ˜m − m˜ )² should be large, Σ˜ small.

(23)

Separation Criterion

J(w) = ∆²_m_˜

Σ˜_W = w^t

=:Σ_B

z }| {

(m₁ − m₂)(m₁ − m₂)^t w w^tΣ_Ww .

∂

∂wJ(w) = ∂

∂w

w^tΣ_Bw w^tΣ_Ww

= 0!

= − w^tΣ_Bw

(w^tΣ_Ww)² 2Σ_Ww + 1

w^tΣ_Ww 2Σ_Bw

⇒ w^tΣ_Bw

w^tΣ_Ww(−Σ_Ww) + Σ_Bw = 0

⇒ Σ_Bw = w^tΣ_Bw

w^tΣ_Ww Σ_Ww =: λ Σ_Ww

(24)

Separation Criterion

• Let Σ_W be non-singular:

Σ⁻¹_W Σ_B w

| {z }

∆_m∆^t_mw∝∆_m

= λw, with λ = w^tΣ_Bw

w^tΣ_Ww = J(w).

• Thus, w is an eigenvector of Σ⁻¹_W Σ_B, the associated eigenvalue is the objective function! Maximum: eigenvector with largest eigenvalue.

• Unscaled Solution: wˆ = Σ⁻¹_W ∆_m = Σ⁻¹_W (m₁ − m₂).

• This is the solution of the linear system Σ_Ww = m₁ − m₂.

• Σ_W is a covariance matrix there is an underlying data matrix A such that Σ_W ∝ A^tA potential numerical problems: squared condition number compared to A...

(25)

Discriminant analysis and least squares

Theorem: The LDA vector wˆ ^LDA = Σ⁻¹_W (m₁ − m₂) coincides with the solution of the LS problem wˆ ^LS = arg min_w kXw − yk² if

n₁ = # samples in class 1 n₂ = # samples in class 2

X =







– x^t₁ – – x^t₂ –

...

– x^t_n –







, y =





 y₁ y₂ ...

y_n





 ,

with 1

n

X

i=1

x_i = m = 0 (i.e. origin in sample mean),

y_i =

(+1/n₁, if x_i in class 1

−1/n₂, else. ⇒

n

X

i=1

y_i = 0.

(26)

Discriminant analysis and least squares (cont’d)

• “Within” covariance Σ_W ∝ P

x∈classy(x − m_y)(x − m_y)^t.

• “Between” covariance Σ_B ∝ (m₁ − m₂)(m₁ − m₂)^t

• The sum of both is the “total covariance” Σ_B + Σ_W = Σ_T Σ_T ∝ P

i x_ix^t_i = X^tX.

• We know that w^LDA ∝ Σ⁻¹_W (m₁ − m₂) Σ_Ww^LDA ∝ (m₁ − m₂).

• Now Σ_Bw^LDA = (m₁−m₂)(m₁−m₂)^tw^LDA Σ_Bw^LDA ∝ (m₁ − m₂).

• Σ_Tw^LDA = (Σ_B + Σ_W)w^LDA Σ_Tw^LDA ∝ (m₁ − m₂).

• With X^tX = Σ_T, X^ty = m₁ − m₂, we arrive at

w^LDA ∝ Σ⁻¹_T (m₁ − m₂) = Σ⁻¹_T X^ty ∝ (X^tX)⁻¹X^ty = w^LS.

(27)

Chapter 2

Least squares problems

Application Example: Secondary Structure Prediction in Proteins

By Thomas Shafee, https://commons.wikimedia.org/w/index.php?curid=52821069

(28)

Short historical Introduction

• Genetics as a natural science started in 1866: Gregor Mendel performed experiments that pointed to the existence of

biological elements called genes.

• Deoxy-ribonucleic acid (DNA) isolated by Friedrich Miescher in 1869.

• 1944: Oswald Avery (and coworkers) identified DNA as the major carrier of genetic material, responsible for inheritance.

Ribose: (simple) sugar molecule, deoxy-ribose loss of oxygen atom.

Nucleic acid: overall name for DNA and RNA (large biomolecules). Named for their initial discovery in nucleus of cells, and for presence of phosphate groups (related to phosphoric acid).

By Miranda19983 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=84120486

(29)

Short historical Introduction

• 1953, Watson & Crick: 3-dimensional structure of DNA. They in- ferred the method of DNA replication.

• 2001: first draft of the human genome published by the Human Genome Project and the company Celera.

• Many new developments, such as Next Generation Sequencing, Deep learning etc.

Input Hidden Output

By RE73 - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=18862884

(30)

Base pairs and the DNA

By Madprime (talk ˆA· contribs) - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1848174

• DNA composed of 4 basic molecules nucleotides.

• Nucleotides are identical up to different nitrogen base: organic molecule with a nitrogen atom that has the chemical properties of a base (due to free electron pair at nitrogen atom).

• Each nucleotide contains phosphate, sugar (of deoxy-ribose type), and one of the 4 bases: Adenine, Guanine, Cyto- sine, Thymine (A,G,C,T).

• Hydrogen bonds between base pairs:

G ≡ C, A = T.

(31)

By OpenStax - https://cnx.org/contents/FPtK1zmh@8.25:fEI3C8Ot@10/Preface, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=30131206

(32)

The structure of DNA

• DNA molecule is directional due to asymmetrical structure of the sugars which constitute the skeleton: Each sugar is connected to the strand upstream in its 5th carbon and to the strand downstream in its 3rd carbon.

• DNA strand goes from 5⁰ to 3⁰. The directions of the two complementary DNA strands are reversed to one another ( Reversed Complement).

Adapted from https://commons.wikimedia.org/w/index.php?curid=30131206

(33)

By Zephyris - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15027555

(34)

Replication of DNA

Biological process of producing two replicas of DNA from one original DNA molecule.

Cells have the distinctive property of division

DNA replication is most essential part for biological inheritance.

Unwinding single bases exposed on each strand.

Pairing requirements are strict single strands are templates for re-forming identical double helix (up to mutations).

DNA polymerase: enzyme that catalyzes the synthesis of new DNA.

(35)

Genes and Chromosomes

• In higher organisms, DNA molecules are packed in a chro- mosome.

• Genome: total genetic information stored in the chromosomes.

• Every cell contains a complete set of the genome, differences are due to variable expression of genes.

• A gene is a sequence of nucleotides that encodes the synthesis of a gene product.

By Sponk, Tryphon, Magnus Manske,

https://commons.wikimedia.org/w/index.php?curid=20539140

• Gene expression: Process of synthesizing a gene product (often a protein) controls timing, location, and amount.

(36)

The Central Dogma

Wikipedia

Transcription: making of an RNA molecule from DNA template.

Translation: construction of amino acid sequence from RNA.

⇒ Almost no exceptions ( retroviruses)

(37)

Transcription

By Kelvinsong - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=23086203

(38)

https://commons.wikimedia.org/w/index.php?curid=9810855

(39)

Translation

• mRNA molecules are translated by ribosomes:

Enzyme that links together amino acids.

• Message is read three bases at a time.

• Initiated by the first AUG codon (codon = nucleotide triplet).

• Covalent bonds (=sharing of electron pairs) are made between adjacent amino acids

⇒ growing chain of amino acids (“polypeptide”).

• When a “stop” codon (UAA, UGA, UAG) is encountered, translation stops.

Wikipedia

(40)

By Boumphreyfr - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=7200200

(41)

The genetic code

Wikipedia

Highly redundant: only 20 (or 21) amino acids formed from 4³ = 64 possible combinations.

(42)

By Dancojocari. https://commons.wikimedia.org/w/index.php?curid=9176441

(43)

Proteins

• Linear polymer of amino acids, linked together by peptide bonds.

Average size ≈ 200 amino acids, can be over 1000.

• To a large extent, cells are made of proteins.

• Proteins determine shape and structure of a cell.

Main instruments of molecular recognition and catalysis.

• Complex structure with four hierarchical levels.

1. Primary structure: amino acid sequence.

2. Different regions form locally regular secondary structures like α- helices and β-sheets.

3. Tertiary structure: packing such structures into one or several 3D domains.

4. Several domains arranged in a quaternary structure.

(44)

Molecular recognition

Interaction between molecules through noncovalent bonding

Crystal structure of a short peptide L-Lys-D-Ala-D-Ala (bacterial cell wall precursor) bound to the antibiotic vancomycin through hydrogen bonds. By

(45)

Catalysis

Increasing the rate of a chemical reaction by adding a substance catalyst.

Wikipedia

(46)

Protein Structure: primary to quaternary

Durbin et al., Cambridge University Press

Structure is determined by the primary sequence and their physico- chemical interactions in the medium.

Structure determines functionality.

(47)

Secondary Structure

Secondary structure: two main types: β-sheet and α-helix

The School of Biomedical Sciences Wiki

Short range interactions in the AA chain are important for the secondary structure:

α-helix performs a 100^◦ turn per amino acid full turn after 3.6 AAs.

Formation of a helix mainly depends on interactions in a 4 AA window.

(48)

Example: Cytochrome C2 Precursor

Secondary structure (h=helix) amino acid sequence

hhhhhhhhhhh

MKKGFLAAGVFAAVAFASGAALAEGDAAAGEKVSKKCLACHTFDQGGANKVGPNLFGVFE hhhhhhhh hhhhhhhhh hhhhhhhhh

NTAAHKDDYAYSESYTEMKAKGLTWTEANLAAYVKDPKAFVLEKSGDPKAKSKMTFKLTK hhhhhhhhhhhhh

DDEIENVIAYLKTLK

Given: Examples of known helices and non-helices in several proteins training set

Goal: Predict, mathematically, the existence and position of α-helices in new proteins.

(49)

Classification of Secondary Structure

Idea: Use a sliding window to cut the AA chain into pieces. 4 AAs are enough to capture one full turn choose window of size 5.

Decision Problem: Find function f(. . . ) that predicts for each substring in a window the structure:

f(AADTG) =

”Yes”, if the central AA belongs to an α-helix,

”No”, otherwise

Problem: How should we numerically encode a string like AADTG?

Simple encoding scheme: Count the number of occurrences of each AA in the window. First order approximation, neglects AA’s position within the window.

(50)

Example

...RAADTGGSDP...

...xxxhhhhhhx...

(black =ˆ structure info about central AA; green =ˆ know secondary structure; red=ˆ sliding window)

A C D . . . G . . . R S T . . . Y Label

2 0 1 0 0 0 1 0 1 0 0 “No”

2 0 1 0 1 0 0 0 1 0 0 “Yes”

1 0 1 0 2 0 0 0 1 0 0 “Yes”

... ... ... ... ... ... ... ... ... ... ... ...

This is a binary classification problem use Linear Discriminant Analysis

(51)

Discriminant Analysis

Consider X_n×d, with n = #(windows) and d = #(AAs) = 20(or 21), and the n-vector of class indicators y

X =







2 0 1 . . . 0 . . . 2 0 1 . . . 1 . . . 1 0 1 . . . 2 . . .

... ... ... ... ... ...







=







– x^t₁ – – x^t₂ –

...

– x^t_n –







, y =







”No”

”Yes”

...







For the binary class idicators, we use some numerical encoding scheme.

Interpretation with basis functions:

x = sequence of characters from alphabet A g_i(x) = #(occurences of letter i in sequence) f(x; w) = w^tg = X

i∈characters

w_ig_i(x)

(52)

Discriminant analysis and least squares

Recall: The LDA vector wˆ ^LDA = Σ⁻¹_W (m₁ − m₂) coincides with the solution of the LS problem wˆ ^LS = arg min_w kXw − yk² if

n₁ = # samples in class 1 n₂ = # samples in class 2

X =







– x^t₁ – – x^t₂ –

...

– x^t_n –







, y =





 y₁ y₂ ...

y_n





 ,

with 1

n

X

i=1

x_i = m = 0 (i.e. origin in sample mean),

y_i =

(+1/n₁, if x_i in class 1

−1/n₂, else. ⇒

n

X y_i = 0

(53)

Singular Value Decomposition (SVD)

Recall: SVD for nonsquare matrix X ∈ R^n×d: X = U SV ^t. Residual sum of squares:

RSS = krk² = kXw − yk² =

U SV ^tw − y

2 =

SV ^tw

| {z }

z

−U^ty

|{z}c

2

Minimizing krk² is equivalent to minimizing kSz − ck²:

minimize krk² =







σ₁ 0

. . .

0 σ_d

0 · · · 0 ... . . . ...

0 · · · 0







·



 z₁

...

z_d



 −





 c₁

...

c_d c_d+1

...

c_n







2

We now choose z_k so that krk² is minimal, i.e., for σ_k > 0:

z_k = c_k σ_k

(54)

Iterative Algorithm

In our problem we have d = 20 (or 21) and n > 10000.

Goal: Use only X^tX ∈ R^d×d and X^ty ∈ R^d.

Initialize X^tX = 0 (zero matrix), X^ty = 0. Update: for j = 1 to n : X^tX + x_jx^t_j −→ X^tX

X^ty + x_jy_j −→ X^ty The first update procedure is correct, since

X^tX

ik = Xⁿ

j=1 x_jix_jk

⇒ X^tX =

n

X

j=1





 x_j₁ x_j₂

...

x_jd







· [x_j₁, x_j₂, . . . , x_jd] = Xⁿ

j=1 x_jx^t_j

(55)

Iterative Algorithm

A similar calculation yields the other equation:

X^ty

i = X

j

x_jiy_j ⇒ X^ty = X

j





 x_j1 x_j2

...

x_jd







· y_j =

n

X

j=1

x_jy_j

One remaining problem: In LDA we assumend that X was centered, i.e. the column sums are all zero. Compute the column sums as:

1^tX = [1, 1, . . . , 1]







– x^t₁ – – x^t₂ –

...

– x^t_n –







= n · [m₁, m₂, . . . , m_d] = n · m^t

“centered” X_c = X − 1m^t = X − _n¹11^tX

(56)

Centering

X_c = X − 1m^t = X − 1

n11^tX X_c^tX_c = X^tX + 1

n²X^t1 1^t1

|{z}=n

1^tX − 1

nX^t11^tX − 1

nX^t11^tX

= X^tX − 1

nX^t11^tX

= X^tX − n · mm^t

Iteratively update the vector n·m for every x_i corresponding to a new window position: Initialize n · m = 0 and update n · m ← n · m + x_i What about X^ty? We should have used

X_c^ty = (X − 1m^t)^ty = (X^t − m 1^t)y = X^ty − m 1^ty But by construction, y is orthogonal to 1 1^ty = 0,

so nothing needs to be done!

(57)

Iterative Algorithm

Goal: Solution which only requires X_c^tX_c ∈ R^d×d and X_c^ty ∈ R^d alone (and does not use X_c or y explicitly).

We need:

• The matrix V (for computing wˆ = V z)

Solution: columns of V are the eigenvectors of X_c^tX_c, corresponding eigenvalues are λ_i, i = 1, . . . , n ⇒ σ_i² = λ_i

• For the nonzero SVs, we need z_i = (U^ty)_i/σ_i = σ_i(U^ty)_i/σ_i² Solution:

X_c = U SV ^t ⇒ V ^tX_c^ty = V ^tV S^tU^ty = S^tU^ty

⇒ z_i = (U^ty)_i/σ_i = (V ^tX_c^ty)_i/σ_i²

So z and finally wˆ = V z can be computed from X_c^tX_c and X_c^ty alone!

(58)

Chapter 2

Least squares problems

Least-squares and dimensionality reduction

(59)

Least-squares and dimensionality reduction

Given n data points in d dimensions:

X =







− x^t₁ −

− x^t₂ −

− ... −

− x^t_n −







∈ R^n×d

Want to reduce dimensionality from d to k. Choose k directions w₁, . . . , w_k, arrange them as columns in matrix W:

W =







| | |

w₁ w₂ . . . w_k

| | |





 ∈ R^d×k

Project x ∈ R^d down to z = W^tx ∈ R^k. How to choose W?

(60)

Encoding–decoding model

The projection matrix W serves two functions:

• Encode: z = W^tx, z ∈ R^k, z_j = w^t_jx.

– The vectors w_j form a basis of the projected space.

– We will require that this basis is orthonormal, i.e. W^tW = I.

• Decode: x˜ = Wz = Pk

j=1 z_jw_j, x˜ ∈ R^d.

– If k = d, the above orthonormality condition implies W^t = W⁻¹, and encoding can be undone without loss of information.

– If k < d, the decoded x˜ can only approximate x the reconstruction error will be nonzero.

• Note that we did not include an intercept term. Assumption: origin of coordinate system is in the sample mean, i.e. P

i x_i = 0.

(61)

Principal Component Analysis (PCA)

We want the reconstruction error kx − xk˜ ² to be small.

Objective: minimize min_W_∈_R^d×k_: _WtW=I

Pn

i=1 kx_i − W W ^tx_ik²

(62)

Finding the principal components

Projection vectors are orthogonal can treat them separately:

min

w:kwk=1

Xⁿ

i=1 kx_i − ww^tx_ik² X

i kx_i − ww^tx_ik² =

n

X

i=1

[x^t_ix_i − 2x^t_iww^tx_i + x^t_iww^tw

| {z }

=1

w^tx_i]

= X

i[x^t_ix_i − x^t_iw w^tx_i]

= X

i x^t_ix_i − X

i w^tx_i x^t_iw

= X

i x^t_ix_i − w^t

n

X

i=1

x_ix^t_i w

=X

i x^t_ix_i

| {z }

const.

− w^tX^tXw.

(63)

Finding the principal components

• Want to maximize w^tX^tXw under the constraint kwk = 1

• Can also maximize the ratio J(w) = ^w^t_w^X_t^t_w^X^w.

• Optimal projection w is the eigenvector of X^tX with largest eigenvalue (compare handout on spectral matrix norm).

• We assumed P

i x_i = 0, i.e. the columns of X sum to zero.

compute SVD of “centered” matrix X_c

column vectors in W are eigenvectors of X_c^tX_c they are the principal components.

(64)

Eigen-faces [Turk and Pentland, 1991]

• d = number of pixels

• Each x_i ∈ R^d is a face image

• x_ij = intensity of the j-th pixel in image i x_i ≈ W W ^tx_i = Wz_i

(X^t)_d×n ≈ W_d×k (Z^t)_k×n

≈







| |

z₁ . . . z_n

| |







Conceptual: We can lean something about the structure of face images.

Computational: Can use z_i for efficient nearest-neighbor classification:

Much faster when k d.