• Keine Ergebnisse gefunden

P-values and confidence intervals for high-dimensional problems

N/A
N/A
Protected

Academic year: 2022

Aktie "P-values and confidence intervals for high-dimensional problems"

Copied!
55
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

P-values and confidence intervals for high-dimensional problems

Peter B ¨uhlmann

Seminar f ¨ur Statistik, ETH Z ¨urich

June 2015

(2)

High-dimensional data

Behavioral economics and genetics (with Ernst Fehr, U. Zurich)

I n=10525 persons

I genetic information (SNPs): p≈106

I 79 response variables, measuring “behavior”

p n

goal: findsignificantassociations between behavioral responses and genetic markers

● ● ● ● ● ● ● ● ●

● ●

● ● ● ●

● ● ● ● ●

● ●

●●● ● ● ●

● ●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 20 40 60 80

0200400600

Number of significant target SNPs per phenotype

Phenotype index

Number of significant target SNPs

51015 253035 455055 657075

100300500700

(3)

... and let’s have a look atNature 496, 398 (25 April 2013)

Challenges in irreproducible research ...

“the complexity of the system and of the tech- niques ... do not stand the test of further stud- ies”

I “We willexamine statistics more closelyand encourage authors to be transparent, for example by including their raw data.”

I “We will also demand more precise descriptions of statistics, and we willcommission statisticians as

consultantson certain papers, at the editor’s discretion and at the referees’ suggestion.”

I “Toofewbudding scientistsreceive adequate training in statisticsand other quantitative aspects of their subject.”

(4)

... and let’s have a look atNature 496, 398 (25 April 2013)

Challenges in irreproducible research ...

“the complexity of the system and of the tech- niques ... do not stand the test of further stud- ies”

I “We willexamine statistics more closelyand encourage authors to be transparent, for example by including their raw data.”

I “We will also demand more precise descriptions of statistics, and we willcommission statisticians as

consultantson certain papers, at the editor’s discretion and at the referees’ suggestion.”

I “Toofewbudding scientistsreceive adequate training in statisticsand other quantitative aspects of their subject.”

(5)

statistics is important...

and its mathematical roots as well !

(6)

statistics is important...

and its mathematical roots as well !

(7)

P-values for high-dimensional linear models

Y =Xβ0

want uncertainty quantification!

goal: statistical hypothesis testing

H0,jj0=0 or H0,G : βj0=0 for allj∈G⊆ {1, . . . ,p}

background: if we could handle the asymptotic distribution of the Lassoβ(λ)ˆ under the null-hypothesis

;could construct p-values this is very difficult!

asymptotic distribution ofβˆhas some point mass at zero,...

Knight and Fu (2000)forp<∞andn→ ∞

(8)

P-values for high-dimensional linear models

Y =Xβ0

want uncertainty quantification!

goal: statistical hypothesis testing

H0,jj0=0 or H0,G : βj0=0 for allj∈G⊆ {1, . . . ,p}

background: if we could handle the asymptotic distribution of the Lassoβ(λ)ˆ under the null-hypothesis

;could construct p-values this is very difficult!

asymptotic distribution ofβˆhas some point mass at zero,...

Knight and Fu (2000)forp<∞andn→ ∞

(9)

;standard bootstrapping and subsampling cannot be used

(10)

Low-dimensional projections and bias correction

Or de-sparsifying the Lasso estimator

related work byZhang and Zhang (2011; publ. 2014)

motivation:

βˆLS,j from projection ofY onto residuals(Xj−X−jγˆ(j)LS)

projection not well defined ifp>n

;use “regularized” residuals fromLasso onX-variables Zj =Xj−X−jγˆ(j)Lasso

(11)

usingY =Xβ0+ε ;

ZjTY =ZjTXjβj0+X

k6=j

ZjTXkβk0+ZjTε

and hence ZjTY

ZjTXj0j +X

k6=j

ZjTXk ZjTXjβk0

| {z }

bias

+ ZjTε ZjTXj

| {z } noise component

;de-sparsified Lasso:

ˆbj = ZjTY

ZjTXj − X

k6=j

ZjTXk ZjTXj

βˆLasso;k

| {z }

Lasso-estim. bias corr.

(12)

ˆbj is not sparse!... and this is crucial to obtain Gaussian limit nevertheless: it is “optimal” (see later)

I target: low-dimensional componentβ0j

I η :={βk0; k 6=j}is a high-dimensional nuisance parameter

;exactly as in semiparametric modeling!

and sparsely estimated (e.g. with Lasso)

(13)

ˆbj is not sparse!... and this is crucial to obtain Gaussian limit nevertheless: it is “optimal” (see later)

I target: low-dimensional componentβ0j

I η :={βk0; k 6=j}is a high-dimensional nuisance parameter

;exactly as in semiparametric modeling!

and sparsely estimated (e.g. with Lasso)

(14)

Asymptotic pivot and optimality

Theorem(van de Geer, PB, Ritov & Dezeure, 2013)

√n(ˆbj−βj0)⇒ N(0, σ2εjj) (j =1, . . . ,p) Ωjj explicit expression∼(Σ−1)jj optimal!

reaching semiparametric information bound

;asympt. optimal p-values and confidence intervals if we assume:

I populationCov(X) =Σhas minimal eigenvalue≥M>0

I sparsityfor regr.Y vs.X:s0=o(√

n/log(p))“quite sparse”

I sparsity of design:Σ−1sparse

i.e. sparse regressionsXj vs.X−j: sj ≤o(p

n/log(p)) may not be realistic

I no beta-min assumption !

(15)

Asymptotic pivot and optimality

Theorem(van de Geer, PB, Ritov & Dezeure, 2013)

√n(ˆbj−βj0)⇒ N(0, σ2εjj) (j =1, . . . ,p) Ωjj explicit expression∼(Σ−1)jj optimal!

reaching semiparametric information bound

;asympt. optimal p-values and confidence intervals if we assume:

I populationCov(X) =Σhas minimal eigenvalue≥M>0

I sparsityfor regr.Y vs.X:s0=o(√

n/log(p))“quite sparse”

I sparsity of design:Σ−1sparse

i.e. sparse regressionsXj vs.X−j: sj ≤o(p

n/log(p)) may not be realistic

I no beta-min assumption !

(16)

It is optimal!

Cramer-Rao

(17)

Uniform convergence:

√n(ˆbj−βj0)⇒ N(0, σ2εjj) (j =1, . . . ,p)

convergence is uniformoverB(s0) ={β;kβk00≤s0}

;honest tests and confidence regions!

and we can avoid post model selection inference (cf.P ¨otscher and Leeb)

(18)

Simultaneous inference over all components:

n(ˆbβ0)(W1, . . . ,Wp)∼ Np(0, σ2εΩ)

;can construct P-values for:

H0,GwithanyG: test-statistics maxj∈G|ˆbj| sincecovariance structureΩis known and

can easily do efficient multiple testing adjustment since covariance structureΩis known!

(19)

Alternatives?

I versions of bootstrapping(Chatterjee & Lahiri, 2013)

;super-efficiency phenomenon!

i.e. non-uniform convergence

Joe Hodges

•good for estimating the zeroes (i.e.,j ∈S0c withβ0j =0)

•bad for estim. the non-zeroes (i.e.,j∈S0withβ0j 6=0)

I multiple sample splitting(Meinshausen, Meier & PB, 2009) split the sample repeatedly in two halfs:

•select variables on first half

•p-values using second half, based on selected variables

;avoids (because of sample splitting) over-optmistic p-values, but potentially suffers in terms of power

(20)

Some further remarks onmultiplesample splitting

I if the (generalized linear) model is correct:

it “works” for fixed and random design

I in misspecified models:

it “works” for random design for the “best projected parameter” (see later)

the theoretical justification assumes the variable screening property:

|{z}

based on 1st half-sample

⊇S0

(or a slightly relaxed form (PB and Mandozzi, 2014))

;not nice...

but: the method performs rather well in broad simulation study (Dezeure, PB, Meier and Meinshausen, 2014)

(21)

... the method performs rather well in broad simulation study the heuristic reason:

I Bsample splits: p-valuesPj(1), . . . ,Pj(B) forH0,j : β0j =0 Pj(b)=

(1 ifj ∈/ Sˆ(b)

p-val from t-test on 2nd half-sample ifj ∈Sˆ(b).

I need toaggregate these dependent p-values

Leo Breiman a simple rule (Meinshausen, Meier and PB, 2009)

Pj(aggr)=sample-median(2Pj(1), . . . ,2Pj(B)) Pjaggr <1 ⇐⇒variablej has been selected in

>50%of theBsample splits

;an importantstability property the method is conservative

(22)

First real data results

where we have collaborated in joint projects

I Motif regression (computational biology) n=143, p=196

with desparsified Lasso and multiple sample splitting:

one significant single variableat 5% level with FWER multiple testing adjustment

I Riboflavin production with Bacillus Subtilis (genomics) n=71, p =4096

with desparsified Lasso and multiple sample splitting:

one significant single variableat 5% level with FWER multiple testing adjustment

surprising?

remember the meaning ofβj0:

it measures effect which is adjusted for by all other variables...

(23)

Behavioral economics and genetics (with Ernst Fehr, U. Zurich) n=10525, p≈0.5·106

(and 79 response variables, measuring “behavior”)

;cannot detect any single variable as significant after standard multiple testing correction

(24)

Hierarchical inference

there is structure!

I 79 response experiments

I 23 chromosomes per response experiment

I groups from hierarchical clustering per chromosome

. . . .. . . . .

. . . .

1 2

1

23 1 23

global

79

significant not significant

(25)

dohierarchicalFWER adjustment (Meinshausen, 2008)

. . . .. . . . . . . .

1 2

1

23 1 23

global

79

significant not significant

1. test global hypothesis

2. if significant: test all single response hypotheses

3. for the significant responses: test all single chromosome hyp.

4. for the significant chromosomes:

test finer groups from hierarchical clustering

;powerful multiple testing with

data dependent adaptation of the resolution level

(26)

input:

I a hierarchy of groups/clustersG⊆ {1, . . . ,p}

I valid p-values for

H0,G : βj0=0∀j∈G vs. HA,G : βj06=0 for somej ∈G output:

p-values for groups/clusters which control the familyw. err. rate (FWER =P[at least one false positive/rejection])

with hierarchical constraints:

ifH0,G is not rejected

=⇒H0,G˜ not rejected forG˜ lower in the hierarchy/tree seeMeinshausen (2008)

and for general sequential testing principle (Goeman and Solari, 2010)

(27)

the essential operation is very simple:

PG;adj=PG· p

|G|, PG= p-value forH0,G PG;hier−adj= max

D∈T;G⊆DPG;adj (“stop when not rejecting at a node”)

I root node: tested at levelα

I next two nodes: tested at level≈(αf1, αf2)where

|G1|=f1p, |G2|=f2p

I at a certain depth in the tree: the sum of the levels≈α on each level of depth: ≈Bonferroni correction

if the p-valuesPG are valid, the FWER is controlled

(Meinshausen, 2008) rejectH0,G ifPG;hier−adj≤α

=⇒ P[at least one false rejection]≤α

(28)

optimized procedure:

I using Shaffer’s improvement

exploiting logical relations among hypotheses:

ifH0,G is true, allH0,G0 are true forG0⊆G

I using additional sequential-type testing principles (aka Bonferroni-Holm instead of Bonferroni)

Bonferroni-Holm

Hypotheses to be tested: {1} {2}

1st step:

adjustedp-values: 2P{1} 2P{2}

FWER control (no false rejection at all): α/2 + α/2 =α If one null hypothesis (e.g.H{1}) is rejected:

do 2nd step with improved multiplicity: P{2}

(29)

optimized procedure:

I using Shaffer’s improvement

exploiting logical relations among hypotheses:

ifH0,G is true, allH0,G0 are true forG0⊆G

I using additional sequential-type testing principles (aka Bonferroni-Holm instead of Bonferroni)

Bonferroni-Holm

Hypotheses to be tested: {1} {2}

1st step:

adjustedp-values: 2P{1} 2P{2}

FWER control (no false rejection at all): α/2 + α/2 =α If one null hypothesis (e.g.H{1}) is rejected:

do 2nd step with improved multiplicity: P{2}

(30)

optimized procedure:

I using Shaffer’s improvement

exploiting logical relations among hypotheses:

ifH0,G is true, allH0,G0 are true forG0⊆G

I using additional sequential-type testing principles (aka Bonferroni-Holm instead of Bonferroni)

Bonferroni-Holm

Hypotheses to be tested: {1} {2}

1st step:

adjustedp-values: 2P{1} 2P{2}

FWER control (no false rejection at all): α/2 + α/2 =α If one null hypothesis (e.g.H{1}) is rejected:

do 2nd step with improved multiplicity: P{2}

(31)

α-weight distribution with inheritance procedure

(Goeman and Finos, 2012)

{1,2,3,4} α

{1,2}

α/2

{3,4}

α/2

{1}

α/4

{2}

α/4

{3} {4}

(32)

α-weight distribution with inheritance procedure

(Goeman and Finos, 2012)

{1,2,3,4} α

{1,2}

α/2

{3,4}

α/2

{1}

α/4

{2}

α/4

{3} {4}

(33)

α-weight distribution with inheritance procedure

(Goeman and Finos, 2012)

{1,2,3,4}

α

{1,2} α/2 {3,4} α/2

{1}

α/4

{2}

α/4

{3} {4}

(34)

α-weight distribution with inheritance procedure

(Goeman and Finos, 2012)

{1,2,3,4}

α

{1,2}

α/2

{3,4} α/2

{1} α/4 {2} α/4 {3} {4}

(35)

α-weight distribution with inheritance procedure

(Goeman and Finos, 2012)

{1,2,3,4}

α

{1,2}

α/2

{3,4} α/2

{1}

α/4

{2} α/2 {3} {4}

(36)

α-weight distribution with inheritance procedure

(Goeman and Finos, 2012)

{1,2,3,4}

α

{1,2}

α/2

{3,4} α

/2

{1}

α/4

{2}

α/2

{3} {4}

(37)

the main benefit is not primarily the “efficient” multiple testing adjustment

it is the fact that weautomatically (data-driven) adapt to an appropriate resolution level of the groups

single variable method

1 2

3

4 567 8 910 111213 14 15161718 19 20 2122 23 242526 272829 30 31 3233 34 35 3637 3839 40414243 44 454647 48 4950515253 54 55 56 5758 5960 6162 6364 6566 67 6869 7071 72 73 74 757677 7879 80 8182 8384 85 8687 888990 91 929394 9596 97 9899 100101 102103 104105 106107108109 110111 112 113114 115116117 118 119120

121

122 123124

125 126 127128 129130131132 133134 135136 137138139 140 141142 143144 145 146147 148149 150151 152 153154155 156157158 159 160 161162 163 164 165166167 168169 170171172 173174 175176177 178179 180181 182 183184 185186 187188 189 190191 192193194195 196 197198 199200

hierarchical method

1 2

3

4 567 8 910 111213 14 15161718 19 20 2122 23 242526 272829 30 31 3233 34 35 3637 3839 40414243 44 454647 48 4950515253 54 55 56 5758 5960 6162 6364 6566 67 6869 7071 72 73 74 757677 7879 80 8182 8384 85 8687 888990 91 929394 9596 97 9899 100101 102103 104105 106107108109 110111 112 113114 115116117 118 119120

121

122 123124

125 126 127128 129130131132 133134 135136 137138139 140 141142 143144 145 146147 148149 150151 152 153154155 156157158 159 160 161162 163 164 165166167 168169 170171172 173174 175176177 178179 180181 182 183184 185186 187188 189 190191 192193194195 196 197198 199200

andavoid to test all possible subset of groups...!!!

which would be a disaster from a computational and multiple testing adjustment point of view

(38)

Does this work?

Mandozzi and PB (2014, 2015)provide some theory, implementation and empirical results for simulation study when using the multiple sample splitting method

(using the desparsified Lasso is more straightforward)

I fairly reliable type I error control

I reasonable power (and clearly better than single variable testing method)

single variable method

1 2

3

4 567 8 910111213 14 15161718 19 20 2122 23 242526 272829 30 31 3233 34 35 3637 3839 40414243 44 454647 48 4950515253 54 55 56 5758 5960 6162 6364 6566 67 6869 7071 72 73 74 757677 7879 80 8182 8384 85 8687 888990 91 929394 9596 97 9899 100101 102103 104105 106107108109 110111 112 113114 115116117 118 119120

121

122 123124

125 126 127128 129130131132 133134 135136 137138139 140 141142 143144 145 146147 148149 150151 152 153154155 156157158 159 160 161162 163 164 165166167 168169 170171172 173174 175176177 178179 180181 182 183184 185186 187188 189 190191 192193194195 196 197198 199200

hierarchical method

1 2

3

4 567 8 910111213 14 15161718 19 20 2122 23 242526 272829 30 31 3233 34 35 3637 3839 40414243 44 454647 48 4950515253 54 55 56 5758 5960 6162 6364 6566 67 6869 7071 72 73 74 757677 7879 80 8182 8384 85 8687 888990 91 929394 9596 97 9899 100101 102103 104105 106107108109 110111 112 113114 115116117 118 119120

121

122 123124

125 126 127128 129130131132 133134 135136 137138139 140 141142 143144 145 146147 148149 150151 152 153154155 156157158 159 160 161162 163 164 165166167 168169 170171172 173174 175176177 178179 180181 182 183184 185186 187188 189 190191 192193194195 196 197198 199200

(39)

an illustration

1 2 3

4

5 6

7 8

9 1011 1213141516 17 18 1920 2122 2324 252627 28 29 30

S0={5,29,11,18,3}, one STD:{11}, one GTD of cardinality 3:{23,3,19} still OK, potential GTD, false detection!

(40)

an illustration

1 2 3

4

5 6

7 8

9 1011 1213141516 17 18 1920 2122 2324 252627 28 29 30

S0={5,29,11,18,3}, one STD:{11}, one GTD of cardinality 3:{23,3,19} still OK, potential GTD, false detection!

(41)

an illustration

1 2 3

4

5 6

7 8

9 1011 1213141516 17 18 1920 2122 2324 252627 28 29 30

S0={5,29,11,18,3}

, one STD:{11}, one GTD of cardinality 3:{23,3,19}

still OK, potential GTD, false detection!

(42)

an illustration

1 2 3

4

5 6

7 8

9 1011 1213141516 17 18 1920 2122 2324 252627 28 29 30

S0={5,29,11,18,3}, one STD:{11}

, one GTD of cardinality 3:{23,3,19} still OK, potential GTD, false detection!

(43)

an illustration

1 2 3

4

5 6

7 8

9 1011 1213141516 17 18 1920 2122 2324 252627 28 29 30

S0={5,29,11,18,3}, one STD:{11}, one GTD of cardinality 3:{23,3,19}

still OK, potential GTD, false detection!

(44)

an illustration

1 2 3

4

5 6

7 8

9 1011 1213141516 17 18 1920 2122 2324 252627 28 29 30

S0={5,29,11,18,3}, one STD:{11}, one GTD of cardinality 3:{23,3,19}

still OK, potential GTD

, false detection!

(45)

an illustration

1 2 3

4

5 6

7 8

9 1011 1213141516 17 18 1920 2122 2324 252627 28 29 30

S0={5,29,11,18,3}, one STD:{11}, one GTD of cardinality 3:{23,3,19}

still OK, potential GTD, false detection!

(46)

A “real” test: GWAS (Buzdugan, Kalisch, Schunk, Fehr and PB, 201x)

motivation: find significant associations in the behavioral economy data

next step: validate the hierarchical inference methodology on a much better studied problem

The Wellcome Trust Case Control Consortium (2007)

I 7 major diseases

I after missing data handling:

2934 control cases

about 1700–1800 diseased cases (depend. on disease) approx. 380’000 SNPs per individuum

(47)

Crohn’s disease

small groups

SNP group size chrom. gene p-value hit

7 1 IL23R 0.018 yes

1 2 ATG16L1 7·10−6 yes

44 5 intergenic 0.009 yes

6 10 LINC01475 0.042 yes

3 10 ZNF365 0.030 yes

1 16 NOD2 2·10−4 yes

1 18 intergenic 0.040 yes

some single SNPs are found as significant!

“hit”: SNP (in the group) is found by WTCCC or by WTCCC replication studies

(48)

large groups

SNP group size chrom. p-value

3622 1 0.036

7571 2 0.003

18161 3 0.001

6948 4 0.028

16144 5 0.007

8077 6 0.005

12624 6 0.019

13899 7 0.027

15434 8 0.031

18238 9 0.003

4972 10 0.036

14419 11 0.013

11900 14 0.006

2965 19 0.037

9852 20 0.032

4879 21 0.009

most chromosomes exhibit

signific. associations no further resolution to finer groups

(49)

Bipolar disease

only large groups/clusters are found as significant

;that’s “OK”...

(50)

Behavioral economics and genomewide association withErnst Fehr, University of Zurich

I n=1525 probands (all students!)

I m=79 response variables measuring various behavioral characteristics (e.g. risk aversion) from well-designed experiments

I p≈0.5·106SNPs (the same SNPs per response) model: multivariate linear model

Yn×m

| {z } responses

= Xn×p

| {z } SNP data

βp×mn×m

| {z } error

;perform hierarchical inference (of course...)

(51)

number of significant SNP parameters per response

● ● ● ● ● ● ● ● ●

● ●

● ● ● ●

● ● ● ● ●

● ●

● ● ● ●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 20 40 60 80

0200400600

Number of significant target SNPs per phenotype

Phenotype index

Number of significant target SNPs

5 10 15 25 30 35 45 50 55 65 70 75

100300500700

response 40 has most significant groups of SNPs I cannot tell more at the moment...

(52)

Software

R-packagehdi(Meier, 2013) contains

I de-sparsified Lasso, Ridge projection method, multiple sample splitting, stability selection

I hierarchical inference

(53)

Conclusions

key conceptsfor high-dimensional statistics:

I sparsityof the underlying regression vector

•sparse estimator is optimal for prediction/estimation

•non-sparse estimators are optimal for uncertainty quantification

due to near collinearity of a few covariables (which is to be expected withpn)

;inference for single variables is often ill-posed

hierarchical inferenceis a good way to address these issues

(54)

in view of (yet) uncheckable assumptions

;

confirmatory high-dimensional inference remains aninterestingchallenge

(55)

Thank you!

References:

I B ¨uhlmann, P. and van de Geer, S. (2011). Statistics for

High-Dimensional Data: Methodology, Theory and Applications.

Springer.

I Meinshausen, N., Meier, L. and B ¨uhlmann, P. (2009). P-values for high-dimensional regression. Journal of the American Statistical Association 104, 1671-1681.

I B ¨uhlmann, P. (2013). Statistical significance in high-dimensional linear models. Bernoulli 19, 1212-1242.

I van de Geer, S., B ¨uhlmann, P. and Ritov, Y. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models.

Annals of Statistics 42, 1166-1202.

I Dezeure, R., B ¨uhlmann, P., Meier, L. and Meinshausen, N. (2014).

High-dimensional inference: confidence intervals, p-values and R-software hdi. Preprint arXiv:1408.4026

Referenzen

ÄHNLICHE DOKUMENTE

Usually it is more convenient to use the action than the Hamiltonian of a quantum system in quantum field theory. Instead of explicitly constructing the Green’s function structure

Corollary 2 For any generating register machine with two decrementable registers, we can construct a catalytic P system with only one catalyst and working in the deriva- tion

We compare our TVP-PVAR (DLP) approach to several potential competitors. There are, of course, an enormous number of models used for forecasting inflation which we could consider.

Second, the assumed constant elastic- ity of CO 2 emissions with respect to income is entirely at odds with empirical observations both from macro-economic statistics (data

subtilis and strains grown in presence of toxic concentrations develop resistance by mutation of the branched-chain amino acid transporter encoding gene bcaP and

Results: In contrast to the two unconditioned traits (i.e., feed conversion ratio and breast meat yield) in the causal structures, the three conditioned traits (i.e., residual

For instance, McLachlan’s [7] “typicality indices” are just p-values π θ (X, D) sat- isfying (1.2) in the special case of multivariate gaussian distributions P θ ; see also

Recommendation: According to pharmacokinetic data and clinical studies, choose normal initial dose and adjust maintenance dose and/or dosage interval by means of creatinine