• Keine Ergebnisse gefunden

How to analyze many contingency tables simultaneously ?

N/A
N/A
Protected

Academic year: 2021

Aktie "How to analyze many contingency tables simultaneously ?"

Copied!
28
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Motivation Statistical setup Refined statistics Real data

How to analyze many contingency tables simultaneously ?

Thorsten Dickhaus

Humboldt-Universit ¨at zu Berlin

Beuth Hochschule f ¨ur Technik Berlin, 31.10.2012

(2)

Motivation Statistical setup Refined statistics Real data

Outline

Motivation: Genetic association studies Statistical setup

Refined statistical inference methods Real data example

Reference:

Dickhaus, T., Straßburger, K., Schunk, D., Morcillo, C., Illig, T., and Navarro, A. (2012): How to analyze many contingency tables simultaneously in genetic association studies.SAGMB 11, Article 12.

(3)

Motivation Statistical setup Refined statistics Real data

What is a SNP (single nucleotide polymorphism) ?

Bi-allelic SNPs: Exactly two possible alleles

Locus 1 2 3 4 ... i ... M

Tom (m) A A G T ... A ... G

Tom (p) A A G T ... A ... C

Andrew A A G C ... A ... C

A A G C ... G ... C

Rachel A A G C ... G ... G

A A G T ... G ... G

(4)

Motivation Statistical setup Refined statistics Real data

What is a SNP (single nucleotide polymorphism) ?

Bi-allelic SNPs: Exactly two possible alleles

Locus 1 2 3 4 ... i ... M

Tom

(m)

A A G T ... A ... G

Tom (p) A A G T ... A ... C

Andrew A A G C ... A ... C

A A G C ... G ... C

Rachel A A G C ... G ... G

A A G T ... G ... G

(5)

Motivation Statistical setup Refined statistics Real data

What is a SNP (single nucleotide polymorphism) ?

Bi-allelic SNPs: Exactly two possible alleles

Locus 1 2 3 4 ... i ... M

Tom

(m)

A A G T ... A ... G

Tom (p) A A G T ... A ... C

Andrew A A G C ... A ... C

A A G C ... G ... C

Rachel A A G C ... G ... G

A A G T ... G ... G

(6)

Motivation Statistical setup Refined statistics Real data

What is a SNP (single nucleotide polymorphism) ?

Bi-allelic SNPs: Exactly two possible alleles

Locus 1 2 3 4 ... i ... M

Tom

(m)

A A G T ... A ... G

Tom (p) A A G T ... A ... C

Andrew A A G C ... A ... C

A A G C ... G ... C

Rachel A A G C ... G ... G

A A G T ... G ... G

(7)

Motivation Statistical setup Refined statistics Real data

What is a SNP (single nucleotide polymorphism) ?

Bi-allelic SNPs: Exactly two possible alleles

Locus 1 2 3 4 ... i ... M

Tom (m) A A G T ... A ... G

Tom (p) A A G T ... A ... C

Andrew A A G C ... A ... C

A A G C ... G ... C

Rachel A A G C ... G ... G

A A G T ... G ... G

(8)

Motivation Statistical setup Refined statistics Real data

What is a SNP (single nucleotide polymorphism) ?

Bi-allelic SNPs: Exactly two possible alleles

Locus 1 2 3 4 ... i ... M

Tom (m) A A G T ... A ... G

Tom (p) A A G T ... A ... C

Andrew A A G C ... A ... C

A A G C ... G ... C

Rachel A A G C ... G ... G

A A G T ... G ... G

(9)

Motivation Statistical setup Refined statistics Real data

Contingency table layout in association studies

Assume abi-allelicmarker (SNP) at a particular locus and abinary phenotypeof interest, e. g., a disease status.

Genotype A1A1 A1A2 A2A2 Σ Phenotype 1 x1,1 x1,2 x1,3 n1.

Phenotype 0 x2,1 x2,2 x2,3 n2.

Absolute count n.1 n.2 n.3 N In case of allelic tests:

Genotype A1 A2 Σ Phenotype 1 x1,1 x1,2 n1.

Phenotype 0 x2,1 x2,2 n2.

Absolute count n.1 n.2 N

(10)

Motivation Statistical setup Refined statistics Real data

Formalized association test problem

Multiple test problem with system of hypotheses

H= (Hj :1≤j≤M), whereHj :Genotypej⊥Phenotype with two-sided alternativesKj.

Abbreviated notation (one particular position):

n= (n1.,n2.,n.1,n.2,n.3)∈N5 resp.n= (n1.,n2.,n.1,n.2)∈N4, x=

x11 x12 x13 x21 x22 x23

∈N2×3resp. x=

x11 x12 x21 x22

∈N2×2. In both cases, the probability of observingxgivennis

under the nullgiven by

f(x|n) = Q

n∈nn! N!Q

x∈xx!.

(11)

Motivation Statistical setup Refined statistics Real data

Formalized association test problem

Multiple test problem with system of hypotheses

H= (Hj :1≤j≤M), whereHj :Genotypej⊥Phenotype with two-sided alternativesKj.

Abbreviated notation (one particular position):

n= (n1.,n2.,n.1,n.2,n.3)∈N5 resp.n= (n1.,n2.,n.1,n.2)∈N4, x=

x11 x12 x13 x21 x22 x23

∈N2×3resp. x=

x11 x12 x21 x22

∈N2×2. In both cases, the probability of observingxgivennis

under the nullgiven by

f(x|n) = Q

n∈nn!

N!Q

x∈xx!.

(12)

Motivation Statistical setup Refined statistics Real data

Tests for association of marker and phenotype

(i) Chi-squared test

Q(x) =X

r

X

s

(xrs−ers)2 ers

, where ers=nr.n.s/N.

Resulting ”exact” (non-asymptotic)p-value:

pQ(x) =X

˜x

f(˜x|n), with

summation over all˜xwith marginalsnsuch thatQ(˜x)≥Q(x).

(Local) levelαtest: ϕQ(x) =1pQ(x)≤α

(13)

Motivation Statistical setup Refined statistics Real data

Tests for association of marker and phenotype

(ii) Tests of Fisher-type

pFisher(x) =X

˜x

f(˜x|n), with

summation over all˜xwith marginalsnsuch thatf(˜x|n)≤f(x|n).

Corresponding levelαtest: ϕFisher(x) =1pFisher(x)≤α

ϕQ(x)andϕFisher(x)keep the (local) significance levelα

conservativelyfor any sample sizeN.

In other words:

pQ(X)U andpFisher(X)Uunder the null,U∼UNI[0,1].

(14)

Motivation Statistical setup Refined statistics Real data

Tests for association of marker and phenotype

(ii) Tests of Fisher-type

pFisher(x) =X

˜x

f(˜x|n), with

summation over all˜xwith marginalsnsuch thatf(˜x|n)≤f(x|n).

Corresponding levelαtest: ϕFisher(x) =1pFisher(x)≤α

ϕQ(x)andϕFisher(x)keep the (local) significance levelα

conservativelyfor any sample sizeN.

In other words:

pQ(X)U andpFisher(X)Uunder the null,U∼UNI[0,1].

(15)

Motivation Statistical setup Refined statistics Real data

Estimating the proportion of informative SNPs

(References: Schweder and Spjøtvoll (1982), Storey et al., 2004)

(16)

Motivation Statistical setup Refined statistics Real data

Estimating the proportion of informative SNPs

(References: Schweder and Spjøtvoll (1982), Storey et al., 2004)

(17)

Motivation Statistical setup Refined statistics Real data

Estimating the proportion of informative SNPs

(References: Schweder and Spjøtvoll (1982), Storey et al., 2004)

(18)

Motivation Statistical setup Refined statistics Real data

Caveat: Storey’s method does not work for

discrete p-values p

Q

(X) and p

Fisher

(X)

(19)

Motivation Statistical setup Refined statistics Real data

Discreteness: Realized randomized p-values

Definition:

Statistical model(Ω,A,(Pϑ)ϑ∈Θ)given

Two-sided test problemH:{ϑ=ϑ0}versusK:{ϑ6=ϑ0} Discretetest statistic:X∼Pϑwith values inΩ

U∼UNI[0,1], stochastically independent ofX

Arealized randomizedp-valuefor testingHversusKis a measurable mappingpr: Ω×[0,1]→[0,1]with

Pϑ0(pr(X,U)≤t) =t for all t∈[0,1].

(20)

Motivation Statistical setup Refined statistics Real data

Realized randomized p-values based on p

Q

(X) and p

Fisher

(X)

Lemma:

Based upon the chi-squared and Fisher-type testing strategies, corresponding realized randomizedp-values can be calculated as

prQ(x,u) = pQ(x)−u X

˜x:Q(˜x)=Q(x)

f(˜x|n),

prFisher(x,u) = pFisher(x)−uγf(x|n),

whereudenotes the realization ofU∼UNI[0,1], stochastically independent ofXandγ ≡γ(x) =|{˜x:f(˜x|n) =f(x|n)}|.

We propose realized randomizedp-values for estimatingπ0. For final decision making, their non-randomized

counterparts should be used (Reproducibility!).

(21)

Motivation Statistical setup Refined statistics Real data

Effective number of tests

A thought experiment

Assume markers indexed byI ={1, . . . ,M}can be divided into disjoint groups with indices in subsetsIg⊂I,g∈ {1, . . . ,G}.

Letϕ= (ϕi,i∈I)and assume that for eachg∈ {1, . . . ,G}and for any pair(i,j)⊆Igthe identity{ϕi =1}={ϕj =1}holds. Then, “effectively” only one single test is performed in each subgroup. Denotingi(g) =minIgforg=1, . . . ,G, it holds

FWERϑ(ϕ) =Pϑ

G

[

g=1

[

i∈I0∩Ig

i=1}

≤Pϑ

G

[

g=1

i(g) =1}

.

Consequently, multiplicity correction in this extreme scenario only has to be done with respect toG<<M.

Bonferroni-type adjustmentα/Gwould be valid!

(22)

Motivation Statistical setup Refined statistics Real data

Effective number of tests

A thought experiment

Assume markers indexed byI ={1, . . . ,M}can be divided into disjoint groups with indices in subsetsIg⊂I,g∈ {1, . . . ,G}.

Letϕ= (ϕi,i∈I)and assume that for eachg∈ {1, . . . ,G}and for any pair(i,j)⊆Igthe identity{ϕi =1}={ϕj =1}holds.

Then, “effectively” only one single test is performed in each subgroup.

Denotingi(g) =minIgforg=1, . . . ,G, it holds

FWERϑ(ϕ) =Pϑ

G

[

g=1

[

i∈I0∩Ig

i=1}

≤Pϑ

G

[

g=1

i(g) =1}

.

Consequently, multiplicity correction in this extreme scenario only has to be done with respect toG<<M.

Bonferroni-type adjustmentα/Gwould be valid!

(23)

Motivation Statistical setup Refined statistics Real data

Effective number of tests

A thought experiment

Assume markers indexed byI ={1, . . . ,M}can be divided into disjoint groups with indices in subsetsIg⊂I,g∈ {1, . . . ,G}.

Letϕ= (ϕi,i∈I)and assume that for eachg∈ {1, . . . ,G}and for any pair(i,j)⊆Igthe identity{ϕi =1}={ϕj =1}holds.

Then, “effectively” only one single test is performed in each subgroup. Denotingi(g) =minIgforg=1, . . . ,G, it holds

FWERϑ(ϕ) =Pϑ

G

[

g=1

[

i∈I0∩Ig

i=1}

≤Pϑ

G

[

g=1

i(g) =1}

.

Consequently, multiplicity correction in this extreme scenario only has to be done with respect toG<<M.

Bonferroni-type adjustmentα/Gwould be valid!

(24)

Motivation Statistical setup Refined statistics Real data

Effective number of tests

Cheverud-Nyholt method and beyond

Meff.=1+ 1 M

M

X

i=1 M

X

j=1

(1−r2ij).

The numbersrijare measures of correlation among markersi andjand can typically be obtained fromlinkage disequilibrium (LD) matrices.

More sophisticated methods exist in the literature, e. g.:

simpleMby X. Gao et al. (2008)

Keff. by Moskvina and Schmidt (2008)

All rely on the correlation structure reflected by therij’s.

(25)

Motivation Statistical setup Refined statistics Real data

Our proposed data analysis workflow

1. Compute realized randomizedp-valuespr(xj,uj)and non-randomized versionsp(xj),j=1, . . . ,M.

2. Estimate the proportionπ0of uninformative SNPs byˆπ0. 3. Determine the effective number of testsMeff. by utilizing

correlation values obtained from anappropriate LD matrix of theMSNPs.

4. For a pre-defined FWER levelα, determine the list of associated markers by performing the multiple test ϕ= (ϕj,j=1, . . . ,M), whereϕj(xj) =1p(xj)≤t with t =α/(Meff.·πˆ0).

(26)

Motivation Statistical setup Refined statistics Real data

Real data example: Herder et al. (2008)

Replication study

Herder, C. et al. (2008). Variants of the PPARG, IGF2BP2, CDKAL1, HHEX, and TCF7L2 genes confer risk of type 2 diabetes independently of BMI in the German KORA studies. Horm. Metab. Res. 40, 722–726.

Data:

M=44SNPs on ten different genes (N≈1900study participants)

”Results” section:

”...(conservative) Bonferroni correction for10genes...”

Authors’ claim:

Thresholdt=0.005for raw marginalp-values controls the FWER atα =5%

(27)

Motivation Statistical setup Refined statistics Real data

Herder et al. (2008): Data re-analysis

LD information:

Taken from the HapMap project (population ’CEU’) Estimated effective number of tests:

Meff. =40.63 (Cheverud-Nyholt method), Keff. =16.73 (Moskvina-Schmidt method).

Estimated proportion of uninformative SNPs:

ˆ

π0=0.4545 (Storey et al., 2004)

Resulting threshold according to our method:

t =α/(Keff.×πˆ0) =α/(16.73·0.4545) =α/7.604=0.0066.

In conclusion:

Our proposed method confirms the authors’ heuristic argumentation and endorses their scientific claims.

(28)

Motivation Statistical setup Refined statistics Real data

Future research goals

Effective number of tests for continuous response

Effective number of tests for FDR control

Adaptive estimation of effective numbers of tests

Statistical methodology for confirmatory functional studies (fMRI data)

Hierarchical multiple testing methods for (auto-) correlated data (time series)

Referenzen

ÄHNLICHE DOKUMENTE

Thus, using a dynamic MFM with a small α value in combination with sensible priors on the component distributions results in obtaining the correct estimates for the number of

This policy brief summarizes the budget request and outlines two options for regenerating ground forces: surg- ing the reserves and increasing the active-duty force.. The budget

In fact, we shall see that our problem is equivalent to determining the maximum number of translated copies of a regular n-dimensional simplex that can be placed in R n such that

Second, in a stable population, the growth rate of the number of people who ever reached the age y, or denominator, is also positive but, when the time horizon is finite (0, T ),

When we accept the concept of continuum, when we accept therefore the fact that we can define an association somewhere in the space of vegetation according to our decision, we

Actually, in a universe of infinitely many independent individuals the Ireland-Kullback and asymptotically the Deming-Stephan adjustment as well as the

Using KNIME, we calculated features for every visitor, identified by unique IDs, visiting the park: (i) total number of movements per day, (ii) total number of check-ins per day,

In general, two things are expected to happen when EO 325 Dq 157 is added to solutions of BVqMANa x precursor micelles: first, the complexation between MANa and Dq results in a