Historical start: Microarray data (Golub et al., 1999)

(1)

Regression and classification

Let X be a p-dimensional predictor variable and Y the target variable of interest. Assume a linear model in which

Regression: Y ∈R

Y =Xβ^∗+ε,

Classification: Y ∈ {0,1} or{−1,1}

P(Y = 1) =f(Xβ^∗), wheref(x) = 1/(1 + exp(−x)) for some (sparse) vector β^∗∈R^p, noise ε∈R.

Regression (or classification) is high-dimensional if p n.

(2)

Historical start: Microarray data (Golub et al., 1999)

Gene expression levels of more than 3000 genes are measured for n = 72 patients, either suffering from acute lymphoblastic leukemia (“X”, 47 cases) or acute myeloid leukemia (“O”, 25 cases). Obtained from Affymetrix oligonucleotide microarrays.

(3)

A look at (a binary version of) the data for a subset of patients and genes.

Gene 1 is here either modelled as on (above average activity; filled green square) or off (below average activity; empty square)

A M L A L L

?

peopl e

activity gene 1

(4)

A M L A L L

?

peopl e

activity gene 2

(5)

A M L A L L

?

peopl e

activity gene 20

(6)

A M L A L L

?

peopl e

activity gene 60

(7)

We have more variables (genes) than observations (patients):

high-dimensional data

(8)

AML ALL

?

Red bars show three types of people:

AML: known to haveacutemyeloidleukemia ALL: known to have acutelymphocyticleukemia

?: we dont known which subtype it is

(9)

select first gene 8 times... (non-integer values are also allowed)

AML

ALL

?

(10)

select second gene 9 times...

AML

ALL

?

(11)

select third gene once..

AML

ALL

?

(12)

select fourth gene 4 times...

AML

ALL

?

(13)

select fifth gene not at all, sixth gene 7 times...

AML

ALL

?

(14)

AML

ALL

?

(15)

AML

ALL

?

(16)

AML

ALL

?

(17)

AML

ALL

?

(18)

AML

ALL

?

(19)

AML

ALL

?

(20)

AML

ALL

?

(21)

AML

ALL

?

(22)

AML

ALL

?

(23)

AML

ALL

?

(24)

AML

ALL

?

(25)

AML

ALL

?

(26)

AML

ALL

?

(27)

AML

ALL

?

(28)

AML

ALL

?

(29)

AML

ALL

?

(30)

AML

ALL

?

(31)

AML

ALL

?

(32)

AML

ALL

?

(33)

AML

ALL

?

(34)

AML

ALL

?

(35)

AML

ALL

?

(36)

AML

ALL

?

(37)

AML

ALL

?

(38)

AML

ALL

?

(39)

AML

ALL

?

(40)

AML

ALL

?

(41)

AML

ALL

?

(42)

AML

ALL

?

(43)

AML

ALL

?

(44)

AML

ALL

?

(45)

AML

ALL

?

(46)

AML

ALL

?

(47)

AML

ALL

?

(48)

AML

ALL

?

(49)

AML

ALL

?

(50)

AML

ALL

?

(51)

AML

ALL

?

(52)

AML

ALL

?

(53)

AML

ALL

?

(54)

AML

ALL

?

(55)

AML

ALL

?

(56)

AML

ALL

?

(57)

AML

ALL

?

(58)

AML

ALL

?

(59)

AML

ALL

?

(60)

AML

ALL

?

(61)

AML

ALL

?

(62)

AML

ALL

? People with known type

(63)

AML

ALL

? People with unknown type

type

"ALL" ?

(64)

Selecting a small subset of variables

How do we get the best set of 10 genes out of all available variables?

- If we check all possible combinations ofbest set of 10 genes out of 60 genes in total, and a computer that checks a million sets per second, it takes about

20.9 hours ≈ 1 day.

- If we have to select thebest set of 10 genes out of 3000 genes, and have thousand such machines, it takes about

500 x estimated time since big bang

(65)

Selecting a small subset of variables

How do we get the best set of 10 genes out of all available variables?

- If we check all possible combinations ofbest set of 10 genes out of 60 genes in total, and a computer that checks a million sets per second, it takes about

20.9 hours ≈ 1 day.

- If we have to select thebest set of 10 genes out of 3000 genes, and have thousand such machines, it takes about

500 x estimated time since big bang

(66)

Basis Pursuit (Chen et al. 99) and Lasso (Tibshirani 96)

Let Y be the n-dimensional response vector andX then×p-dimensional design.

Basis Pursuit:

βˆ= argminkβk₁ such that Y =Xβ.

Lasso:

βˆ^τ = argminkβk₁ such that kY −Xβk₂ ≤τ.

Equivalent to

βˆ^λ = argminkY −Xβk₂+λkβk₁. Combines sparsity (some ˆβ-components are 0) and convexity.

(67)

(68)

(69)

When does it work?

For predictionoracle inequalities in the sense that kX( ˆβ−β^∗)k²₂/n ≤ cσ²log(p)s

n

for some constant c >0 and noise varianceσ² >0, needRestricted Isometry Property(Candes, 2006) or weaker compatibility condition (Geer, 2008). Slower convergence rates possible with weaker

assumptions (Greenstein and Ritov, 2004).

For correct variable selection in the sense that P

∃λ:{k : ˆβ_k^λ6= 0}={k :β_k^∗ 6= 0}

≈1,

need strong irrepresentable(Zhao and Yu, 2006) or neighbourhood stability condition (NM and B¨uhlmann, 2006).

(70)

When does it work?

For predictionoracle inequalities in the sense that kX( ˆβ−β^∗)k²₂/n ≤ cσ²log(p)s

n

for some constant c >0 and noise varianceσ² >0, needRestricted Isometry Property(Candes, 2006) or weaker compatibility condition (Geer, 2008). Slower convergence rates possible with weaker

assumptions (Greenstein and Ritov, 2004).

For correct variable selection in the sense that P

∃λ:{k : ˆβ_k^λ6= 0}={k :β_k^∗ 6= 0}

≈1,

need strong irrepresentable(Zhao and Yu, 2006) or neighbourhood stability condition (NM and B¨uhlmann, 2006).

(71)

Compatibility condition

The usual minimal eigenvalue of the design min{kXβk²₂ :kβk₂= 1}

always vanishes for high-dimensional data with p >n.

The φbe the (L,S)-restricted eigenvalue (Geer, 2007):

φ²(L,S) = min{skXβk²₂ :kβ_Sk₁= 1 andkβ_S^ck₁≤L}, where

S ={k :β_k^∗6= 0}, s =|S|, and

(β_S)_k =β_k1{k ∈S} .

(72)

Compatibility condition

The usual minimal eigenvalue of the design min{kXβk²₂ :kβk₂= 1}

always vanishes for high-dimensional data with p >n.

The φbe the (L,S)-restricted eigenvalue (Geer, 2007):

φ²(L,S) = min{skXβk²₂ :kβ_Sk₁= 1 andkβ_S^ck₁≤L}, where

S ={k :β_k^∗6= 0}, s =|S|, and

(β_S)_k =β_k1{k ∈S}

.

(73)

Ifφ(L,S)>c >0 for someL>1, then we get oracle rates for prediction and convergence of kβ^∗−βˆ^λk₁.

Ifφ(1,S)>0, then the following two are identical argminkβk₀ such thatXβ =Xβ^∗ argminkβk₁ such thatXβ =Xβ^∗.

The latter equivalence requires otherwise the stronger Restricted Isometry Property which implies that∃δ <1 such that

∀b with kbk₀≤s : (1−δ)kbk²₂≤ kXbk²₂≤(1 +δ)kbk²₂, which can be a useful assumption for random designs X, as in compressed sensing.

(74)

Ifφ(L,S)>c >0 for someL>1, then we get oracle rates for prediction and convergence of kβ^∗−βˆ^λk₁.

Ifφ(1,S)>0, then the following two are identical argminkβk₀ such thatXβ =Xβ^∗ argminkβk₁ such thatXβ =Xβ^∗.

The latter equivalence requires otherwise the stronger Restricted Isometry Property which implies that∃δ <1 such that

∀b with kbk₀≤s : (1−δ)kbk²₂≤ kXbk²₂≤(1 +δ)kbk²₂, which can be a useful assumption for random designs X, as in compressed sensing.

(75)

Applications of linear models

(76)

Applications of linear models

(77)

Applications of linear models

(78)

Medical data

OMOP: Observational Medical Outcomes Project (omop.org)

1 Collect medical information (drugs taken, symptoms diagnosed) for 100.000 patients

2 In total, about 15.000 drugs and 15.000 distinct symptoms encoded.

(79)

Try to detect drug-drug interactions or make risk assesments based on medical data:

Is drug A changing the risk of a heart attack if taken together with drug B for patients with a symptom S ?

Can generate very high-dimensional data quickly if expanding interactions as new dummy variables (more than>10¹² interactions of third order).

(80)

Try to detect drug-drug interactions or make risk assesments based on medical data:

Is drug A changing the risk of a heart attack if taken together with drug B for patients with a symptom S ?

Can generate very high-dimensional data quickly if expanding interactions as new dummy variables (more than>10¹² interactions of third order).

(81)

Compressed sensing: one-pixel camera

Images are often sparse after taking a wavelet transformation X: u =Xw, where

w ∈Rⁿ: original image asn-dimensional vector X ∈R^n×n: wavelet transformation

u ∈Rⁿ: vector with wavelet coefficients

(82)

Original wavelet transformation:

u =Xw, where

The wavelet coefficients u are often sparse in the sense that it has only a few large entries. Keeping just a few of them allows a very good

reconstruction of the original image w.

Let ˜u =u1{|U| ≥τ}be the hard-thresholded coefficients (easy to store).

Then re-construct image as ˜w =X⁻¹u.˜

(83)

Conventional way:

measure image w with 16 million pixels convert to wavelet coefficientsu =Xw

throw away most ofu by keeping just the largest coefficients Is efficient as long as pixels are cheap.

(84)

For situations where pixels are expensive (different wavelengths, MRI) can do compressed sensing: observe only

y = Φu= Φ(Xw),

where forq n, matrix Φ∈R^q×nhas iid entries drawn from N(0,1).

One entry ofq-dimensional vectory is thus observed by a random transformation of the original image.

(Pseudo) Random Optical Projections

Bi tt l d d i t i

• Binary patterns are loaded into mirror array:

– light reflected towards the lens/photodiode (1)

– light reflected elsewhere (0) – pixel-wise products summed

by lensy

• Pseudorandom number generator outputs measurement basis vectors …

Each random mask corresponds to one row of Φ.

Reconstruct u by Basis Pursuit:

ˆ

u = argminkuk˜ ₁ such that Φ ˜u =y.

(85)

Observe

y = Φu= Φ(Xw),

Reconstruct wavelet coefficients u by Basis Pursuit:

ˆ

Matrix Φ satisfies for q ≥slog(p/s) with high probability theRandom Isometry Property, including the existence of aδ <1 such that (Candes, 2006) for all s-sparse vectors

(1−δ)kbk²₂ ≤ kΦbk²₂ ≤(1 +δ)kbk²₂.

Hence, if original wavelet coeffcients are s-sparse, we only need to make of orderslog(n/s) measurements to recoveru exactly (with high probability)!

(86)

Observe

y = Φu= Φ(Xw),

Reconstruct wavelet coefficients u by Basis Pursuit:

ˆ

Matrix Φ satisfies for q ≥slog(p/s) with high probability theRandom Isometry Property, including the existence of aδ <1 such that (Candes, 2006) for all s-sparse vectors

(1−δ)kbk²₂≤ kΦbk²₂ ≤(1 +δ)kbk²₂.

Hence, if original wavelet coeffcients are s-sparse, we only need to make of orderslog(n/s) measurements to recoveru exactly (with high probability)!

(87)

Rice CI Camera

Object Light

Lens 1

DMD+ALP Board

Lens 2

Photodiode circuit

dsp.rice.edu/cs/camera

(88)

Image Acquisition

dsp.rice.edu/cs/camera

(89)

Mind reading

Can use Lasso-type inference to infer for a single voxel in the early visual cortex which stimuli lead to neuronal activity using fmri-measurements (Nishimoto et al., 2011 at Gallant Lab, UC Berkeley).

Voxel A

Show movies and detect which parts of the image a particular voxel of 100k neurons is sensitive to.

(90)

Voxel A Voxel B Voxel C

page 22

December 10, 2012

Back to fMRI prblem:

Spatial Locations of Selected Features

CV

ES-CV

Prediction on Voxels A-C: CV 0.72, ES-CV 0.7

page 22

December 10, 2012

Back to fMRI prblem:

Spatial Locations of Selected Features

CV

ES-CV

page 22

December 10, 2012

Back to fMRI prblem:

Spatial Locations of Selected Features

CV

ES-CV

Dots indicate large regression coefficients and thus important regions for a region/voxel in the brain:

- Voxel A is stimulated by activity in the centre-left of the visual field - Voxel B is stimulated by activity in the top right of the visual field - Voxel C is stimulated by activity in the very centre of the visual field

(91)

Allows to forecast brain activity at all voxels, given an image.

Voxel A

?

(92)

Given only brain activity, can reverse the process and ask which image best explains the neuronal activity (given the learned regressions).

?

(93)

Top: seen image/movie

Bottom: image reconstructed from brain activity

(94)