• Keine Ergebnisse gefunden

2 Consistency of SUPERGNOVA and GNOVA

N/A
N/A
Protected

Academic year: 2022

Aktie "2 Consistency of SUPERGNOVA and GNOVA"

Copied!
18
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Supplementary Note

1 Properties for the SUPERGNOVA framework

1.1 Statistical model for global genetic covariance

We follow the random design and random effects model to construct phenotypes, which is same as the model used by LDSC[2]. Suppose we sample two cohorts of different phenotypes, in which sample sizes are n1 and n2, respectively. Assume the two GWAS studies share the same set of SNPs and there are m SNPs in total. We measure phenotype 1 in cohort 1 and phenotype 2 in cohort 2. We assume all the m SNPs are associated with both traits. We model phenotype vectors for each cohort as

φ1=Xβ+ φ2=Y γ+δ,

where X and Y are standardized random matrices of genotypes, with demensions n1×m and n2 ×m; β and γ are vectors of standardized genotype effect sizes, and δ and are vectors of residuals, representing environmental effects and non-additive genetic effects.

Each row of X and Y represents the standardized genotypes of an individual in the cor- responding GWAS study. By standardized genotypes, we mean the genotype of each SNP is normalized to mean zero and variance one. We assume the genotypes of different samples are independent from each other. Due to linkage disequilibrium(LD), genotypes of different SNPs are correlated. We denote the LD matrix as V. That is, cov(Xi1·) = V = cov(Yi2·), for any 1 ≤ i1 ≤ n1 and 1 ≤ i2 ≤ n2. We define the LD score of a variant j as lj := P

kVjk2. We assume we can bound lj by generic constant M, i.e. M > lj, for j ∈ {1,2, . . . , p}. We suppose that (βT, γT)T is subject to multivariate normal distribution and has mean zero and covariance matrix

Var β

γ

= 1 m

h21Im

ρIm

ρIm

h22Im

.

We defineh21 andh22 as the heritability of trait 1 and trait 2, respectively. ρ is defined as genetic covariance between trait 1 and trait 2. In addition, genetic correlation ris defined byρ/p

h21h22. In practice, two different GWASs often share a subset of samples. Without loss of generality, we assume 0 ≤ns ≤min{n1, n2} is the number of samples shared by the two GWASs and the first ns samples in each study are shared, i.e. the first ns rows of X and Y are the same. To account for the non-genetic correlation introduced by sample overlapping, we assume and δ are subject to multivariate normal distribution with covariance:

Cov [i, δj] =

e,1≤i=j≤ns

0,otherwise . The variance of and δ are:

Var [] = 1−h21

In1, Var [δ] = 1−h22 In2, so that

(2)

Var [φ1] = Var [Xβ] + Var[]

=E

XββTXT

+ 1−h21 In1

= h21 mE

XXT

+ 1−h21 In1

=h21In1 + 1−h21

In1 =In1

and similarly, Var [φ2] = In2. We assume genotype, effect size, and environmental effect are independent to each other.

Genetic covariance ρ is the covariance of genetic components. For an individual with stan- darized genotype G, denoting g1 and g2 as the genetic components for trait 1 and trait 2, respectively, we have:

Cov [g1, g2] =E

GTβγTG

=E E

GTβγTG G

=E GTE

βγT G

= ρ mE

GTG

=ρ.

We define genetic correlation as the genetic covariance normalized by the heritability:

rg = Cov [g1, g2]

pVar [g1] Var [g2] = ρ ph21h22. 1.2 Statistical model for local genetic covariance

In this section, we generalize the statistical framework above for local genetic covariance. We assume φ1 and φ2 follow additive linear models:

φ1 =

I

X

i=1

Xiβi+

φ2 =

I

X

i=1

Yiγi+δ,

where Xi and Yi are the standarized genotypes and βi and γi are the effect sizes of SNPs in regions i. We assume SNPs from different regions are independent and we use Vi to denote the LD matrix in region i. We assume effect size (βTi , γiT)T is subject to multivariate normal distribution:

βi γi

∼N

0, 1 mi

h21iImi ρiImi

ρiImi h22iImi

and effect sizes of SNPs from different regions are independent. Environmental error covariance ρe in local genetic covariance is defined the same as it is in global genetic covariance. The variance of and δ are V ar[] =

1−PI i=1h21i

In1and V ar[δ] =

1−PI i=1h22i

In2 so that V ar(φ1) = 1 and V ar(φ2) = 1.

The covariance of the genetic components is the sum of local genetic covariance. DenoteGi

(3)

as the standardized genotype in region ifor an individual. We have Cov [g1, g2] =E

" I X

i=1

GTi βi

! I X

i=1

γiTGi

!#

=

I

X

i=1

E

GTi βiγiTGi

=

I

X

i=1

E E

GTi βiγTi Gi

Gi

=

I

X

i=1

E GTi E

βiγTi Gi

=

I

X

i=1

ρi

miE GTi Gi

=

I

X

i=1

ρi. Local genetic correlation is defined by:

rig = ρi

q h21ih22i

.

1.3 Covariance of z scores

In genome-wide association studies (GWAS), summary statistics are more accessible than individual- level genotype data due to potential privacy and data sharing security concerns. For a GWAS with quantitative trait, the z score of a single SNP j is given by:

zj = βˆj

se( ˆβj)

where ˆβj is the estimated coefficient from marginal linear regression between the trait and the SNPj, andse( ˆβj) is the corresponding standard error. In practice, we approximate z scores by z1j =X·jTφ1/√

n1 and z2j =Y·jTφ2/√ n2. 1.3.1 Global

We derive the variance-covariance matrix of z1T, z2TT

in this section. It’s sufficient to show that

Cov(z1, z2) =

√n1n2ρ

m V2+ nsρt

√n1n2

V, (1)

where ρt=ρ+ρe. Because V ar(z1) and V ar(z2) can be derived from (1). In fact, to compute V ar(z1), we can assume trait 1 and trait 2 are from the same study and hence we have

V ar(z1) =Cov(z1, z1) =

√n1n1h21

m V2+n1

h21+ 1−h21

√n1n1 V = n1h21

m V2+V.

Similarly, Cov(z2) = n2h22/m

V2+V. To prove (1), we begin with the following proposition.

Proposition 1 AssumeΓ12 i.i.d.

∼ B(1, pΓ);Π12i.i.d.

∼ B(1, pΠ); Cor(Γ11) =Cor(Γ22) = rΓΠ; Γ1 and Π2 are independent. Γ2 and Π1 are independent. We have:

(4)

(1).

E h

1+ Γ2−2pΓ)21+ Π2−2pΠ)2i

=4 r2ΓΠ+ 1

pΠ(1−pΠ)pΓ(1−pΓ) + 2rΓΠp

pΠ(1−pΠ)p

pΓ(1−pΓ) (1−2pΠ) (1−2pΓ). (2). Additionally, if Λ12 i.i.d.

∼ B(1, pΛ); Cor(Γ11) = Cor(Γ22) = rΓΛ; Cor(Π11) = Cor(Π22) =rΠΛ;E[(Π1−pΠ) (Γ1−pΓ) (Λ1−pΛ)] =E[(Π2−pΠ) (Γ2−pΓ) (Λ2−pΛ)] =

∆; Λ1 is independent with Γ2 and Π2; Λ2 is independent with Γ1 and Π1. We have:

E h

1+ Γ2−2pΓ) (Λ1+ Λ2−2pΛ) (Π1+ Π2−2pΠ)2i

=4 (rΓΠrΠΛ+rΓΛ)pΠ(1−pΠ)p

pΓ(1−pΓ)p

pΛ(1−pΛ) + 2 (1−2pΠ) ∆.

We note the genotype of SNP j before standardization is the sum of two independent idential binomial distribution with p equal to minor allele frequency, 0.05 < fj < 0.5, of SNP j. We only consider common variants. With Hardy-Weinberg equilibrium, for any 1 ≤ τ ≤ n1, we approximate Xτ j by

Xτ j := S1τ j+S2τ j−2fj p2fj(1−fj) ,

where S1τ j and S2τ j represent allelic dosage for SNP j on each chromosomethe and fj is the minor allele frequency of SNP j. The approximate of genotype inY is defined in the same way.

By proposition 1, for different SNP, SNP j, SNP ζ, we have E

Xτ j2 Xτ ζ2

=V2 + 1 + V(1−2fj) (1−2fζ) 2p

fj(1−fj)p

fζ(1−fζ).

For SNP k,k6=j, ζ, denoting ∆jkζ =E[(Sατ j −fj) (Sατ k−fk) (Sατ ζ−fζ)], α= 1,2, we have E

Xτ jXτ kXτ ζ2

=VV+Vjk+ (1−2fζ) ∆jkζ 2fζ(1−fζ)p

fj(1−fj)p

fk(1−fk),

where V represents the j-th row, ζ-th column of LD matrix V. In other words, V is the correlation of genotypes between SNP j and SNP ζ. Vjk andV are defined in the same way.

Now, we prove (1). We have E

z1z2T

=E

XTφ1

√n1

·φT2Y

√n2

=E

"

XT (Xβ+)

√n1

·(Y γ+δ)TY

√n2

#

=E

XTXβγTYTY

√n1n2

+E

XTδTY

√n1n2

=E

E

XTXβγTYTY

√n1n2

X, Y

+E

E

XTδTY

√n1n2

X, Y

=E

(XTXE βγT

YTY

√n1n2

) +E

(XTE δT

√ Y n1n2

)

= ρ

m√

n1n2E

XTXYTY

+ ρe

√n1n2E

"ns X

i=1

XTY

#

= ρ

m√

n1n2E

XTXYTY

+ ρens

√n1n2

V. (2)

(5)

To complete the calculation, we next compute E h

X·jTXYTY·k

i

. First, we denote N = (τ, η)

τ =η= 1,2, . . . , ns . Forj =k, we have E

X·jTXYTY·j

=E

m

X

ζ=1 n1

X

τ=1

Xτ jXτ ζ

!

·

n2

X

η=1

YηjYηζ

=

m

X

ζ=1 n1

X

τ=1 n2

X

η=1

E[Xτ jXτ ζYηjYηζ]

=

m

X

ζ=1

 X

(τ,η)∈N

E[Xτ jXτ ζYηjYηζ] + X

(τ,η)/∈N

E[Xτ jXτ ζ]E[YηjYηζ]

=

m

X

ζ=1

 X

(τ,η)∈N

E

Xτ j2 Xτ ζ2

+ (n1n2−ns)V2

=

m

X

ζ=1

ns

"

V2 + 1 + V(1−2fj) (1−2fζ) 2p

fj(1−fj)p

fζ(1−fζ)

#

+ (n1n2−ns)V2

!

=n1n2

m

X

ζ=1

V2 +

m

X

ζ=1

ns+

m

X

ζ=1

nsV(1−2fj) (1−2fζ) 2p

fj(1−fj)p

fζ(1−fζ)

=n1n2 m

X

ζ=1

V2 +mns+ns m

X

ζ=1

V(1−2fj) (1−2fζ) 2p

fj(1−fj)p

fζ(1−fζ)

=n1n2 m

X

ζ=1

V2 +mns+o(nsm)

≈n1n2

m

X

ζ=1

V2 +mns. (3)

We explain the approximation in (3) here. In the deivation below, C represents a generic constant, whose value may be different at different places. As mentioned above, we only consider common variants in our supplementary note, which means 0.05< fj <0.5 for any SNP j. So, for any SNP j and ζ, there exists a constantC so that

(1−2fj) (1−2fζ) 2p

fj(1−fj)p

fζ(1−fζ) < C.

By Cauchy-Schwarz inequality, we have

m

X

ζ=1

V(1−2fj) (1−2fζ) 2p

fj(1−fj)p

fζ(1−fζ) < C

m

X

ζ=1

V ≤C v u utm

m

X

ζ=1

V2 =Cp

mlj. (4) So, we have

ns m

X

ζ=1

V(1−2fj) (1−2fζ) 2p

fj(1−fj)p

fζ(1−fζ) ≤C·lj ·ns

√m=o(nsm)

(6)

For j6=k, we have

E

X·jTXYTY·k

=E

m

X

ζ=1 n1

X

τ=1

Xτ jXτ ζ

!

·

n2

X

η=1

YηkYηζ

=

m

X

ζ=1 n1

X

τ=1 n2

X

η=1

E[Xτ jXτ ζYηkYηζ]

=

m

X

ζ=1

 X

(τ,η)∈N

E[Xτ jXτ ζYηkYηζ] + X

(τ,η)/∈N

E[Xτ jXτ ζ]E[YηkYηζ]

=

m

X

ζ=1

 X

(τ,η)∈N

E

Xτ jXτ kXτ ζ2

+ (n1n2−ns)VV

=

m

X

ζ=1

ns

"

VV+Vjk + (1−2fζ) ∆jkζ 2fζ(1−fζ)p

fj(1−fj)p

fk(1−fk)

#

+ (n1n2−ns)VV

!

=n1n2

m

X

ζ=1

VV+

m

X

ζ=1

nsVjk+

m

X

ζ=1

ns(1−2fζ) ∆jkζ 2fζ(1−fζ)p

fj(1−fj)p

fk(1−fk)

=n1n2 m

X

ζ=1

VV+mnsVjk+ns m

X

ζ=1

(1−2fζ) ∆jkζ 2fζ(1−fζ)p

fj(1−fj)p

fk(1−fk)

=n1n2 m

X

ζ=1

VV+mnsVjk+o(nsm)

≈n1n2

m

X

ζ=1

VV+mnsVjk. (5)

We note that ∆jkζ ≤V by definition. So, the approximation in (5) is derived from the same argument as (4).

Combine (3) and (5) we have E

XTXYTY

≈n1n2V2+mnsV (6) Substituting (6) into (2), we finish the calculation forCov(z1, z2).

1.3.2 Local

We assume SNPs from different local regions are indepedent. In other words, we assume LD matrix takes the block structure

V =

 V1

V2 . ..

VI

 ,

where I is the number of locla regions.

Based on the derivations for global genetic covariance above, we generalize the results to

(7)

calculate covariance of local z scores, Cov(z1i, z2i). We have

E z1iz2iT

=E

XiTφ1

√n1

·φT2Yi

√n2

=E

 XiT

PI

ν=1Xνβν+

√n1

· PI

ν=1YνγνT

Yi

√n2

=

I

X

ν=1

E

XiTXνβνγνTYνTYi

√n1n2

+E

XiTδTYi

√n1n2

=E

XiTXiβiγiTYiTYi

√n1n2

+X

ν6=i

E

XiTXνβνγνTYνTYi

√n1n2

+ nsρe

√n1n2

Vi

= ρi

miE

XiTXiYiTYi

√n1n2

+X

ν6=i

ρν

mνE

"

XiTE XνYνT Yi

√n1n2

#

+ nsρe

√n1n2Vi

√n1n2ρi

mi

Vi2+ nsρi

√n1n2

Vi+X

ν6=i

nsρν

√n1n2

Vi+ nsρe

√n1n2

Vi (7)

=

√n1n2ρi mi

Vi2+ nsρt

√n1n2

Vi. Here, ρt=PI

i=1ρie. Approximation in (7) follows from (6).

1.4 Variance of ˜z1ij˜z2ij

Assume eigen decomposition ofVi is Vi =UiΣiUiT. Denote ˜z1i=UiTz1i and ˜z2i =UiTz2i. Then, we have

1i

˜ z2i

∼N

0,

n1h21i

mi Σ2i + Σi

n1n2ρi

mi Σ2i +nnsρt

1n2Σi

n1n2ρi

mi Σ2i +nnsρt

1n2Σi n2h22i

mi Σ2i + Σi

. (8) We order the eigenvalues in Σi by their values. We denote the eigenvalues bywi1 ≤wi2 ≤. . .≤ wimi. ˜z1ij and ˜z2ij denote the j-th element of ˜z1i and ˜z2i, respectively. We use the following proposition to obtain V ar(˜z1ij2ij).

Proposition 2 Assume

ξ1 ξ2

∼N

0,

σ12 ρ0 ρ0 σ22

we have V ar(ξ1ξ2) =ρ2021σ22. By proposition 2, we have

var[˜z1ij2ij] = √

n1n2ρi

mi w2ij+ nsρt

√n1n2wij

2

+

n1h21i

mi w2ij+wij

n2h22i

mi wij2 +wij

. (9) 1.5 Variance of ˆρi

From (8), we have

E[˜z1ij2ij] =

√n1n2ρi

mi wij2 + nsρt

√n1n2wij.

We obtain the estimates of ρi by weighted linear regression that regresses ˜z1ij2ijnnsρt

1n2wij on w2ij. We use LD score regression[2] to obtain the estimates of nnsρt

1n2, which is denoted by ndsρt/√

n1n2. The weights are given by the reciprocal of the variance in (9). In practice, we use n

1h21i

mi w2ij +wij n2mh22i

i w2ij+wij

to approximate the variance, E[˜z1ij2ij]. For notation

(8)

simplicity, we denote q2ij = n

1h21i

mi w2ij+wij n2h22i

mi wij2 +wij

and ηij = ˜z1ij2ijnnsρt

1n2wij. We note V ar(nij) = qij2. We use the Ki largest eigenvalues in the estimation of local genetic covariance to reduce noise.

1.5.1 Theoretical and empirical variance of ˆρi

From weighted linear regression, the weighted estimate of local genetic covariance is given by ˆ

ρi = mi

√n1n2

PKi

j=1ηijw2ij/q2ij PKi

j=1w4ij/qij2 . (10)

So, the theoretical variance of ˆρi is given by V ar( ˆρi) =

m2i n1n2

V ar

PKi

j=1ηijw2ij/qij2 PKi

j=1wij4/q2ij

!

= m2i

n1n2

PKi

j=1V ar(ηij)w4ij/qij4 PKi

j=1w4ij/qij22 = m2i/(n1n2) PKi

j=1wij4/q2ij. We assume the variance of ηij is proportional toqij and given by qijσηi2 although we know the true value of σηi2 is equal to 1. In other words, we treat the variance of ηij as unknown and use weighted linear regression procedure to estimate σηi2. The estimate ofσηi2 is denoted by ˆσηi2. Then, multiply the theoretical variance by ˆσηi2, we have the empirical variance of ˆρi

V ard( ˆρi) =V ar( ˆρi) ˆσ2ηi. (11) We obtain ˆσ2ηi by the residues from the regression model. We have

ˆ σ2ηi=

Ki

X

j=1

ηij2 qij2

PKi

j=1ηijwij2/q2ij2

PKi

j=1wij4/q2ij

(Ki−1). (12)

Substituting (12) into (11), we have

V ard( ˆρi) = m2i/(n1n2) PKi

j=1w4ij/qij2 ·

Ki

X

j=1

η2ij qij2

PKi

j=1ηijw2ij/q2ij 2

PKi

j=1w4ij/qij2

(Ki−1). (13)

1.5.2 Approximation of Var E ˆ

ρi| ndsρt/√

n1n2

To compensate the noise introduced by the estimation ofndsρt/√

n1n2 in LD score regression[2], we add V ar E

ˆ

ρi| ndsρt/√

n1n2

to the total variance, V ar( ˆρi). Then, the total variance of ˆ

ρi is then given by

V ar( ˆρi) =V ar

E

ˆ ρi

ndsρt

√n1n2

+E

V ar

ˆ ρi

ndsρt

√n1n2

.

(9)

The estimation ofE V ar ˆ

ρi| ndsρt/√

n1n2

is given by the empirical variance in (13). In this section, we derive the expression ofV ar E

ˆ

ρi| ndsρt/√ n1n2

. By (10), We have

E

ˆ ρi

ndsρt

√n1n2

= mi

√n1n2E

" PKi

j=1ηijw2ij/qij2 PKi

j=1w4ij/q2ij

ndsρt

√n1n2

#

= mi

√n1n2 PKi

j=1E

ηij| ndsρt/√ n1n2

w2ij/qij2 PKi

j=1wij4/q2ij

= mi

√n1n2 PKi

j=1

E[˜z1ij2ij]−ndnsρt

1n2wij

wij2/q2ij PKi

j=1w4ij/q2ij

= mi

√n1n2 PKi

j=1E[˜z1ij2ij]w2ij/qij2 PKi

j=1wij4/q2ij − mi

√n1n2 ndsρt

√n1n2 PKi

j=1w3ij/q2ij PKi

j=1w4ij/q2ij. (14) The first term in (14) is constant. So, V ar E

ˆ

ρi| ndsρt/√

n1n2

is given by V ar

E

ˆ ρi

ndsρt

√n1n2

=V ar mi

√n1n2 ndsρt

√n1n2 PKi

j=1w3ij/qij2 PKi

j=1w4ij/qij2

!

= m2i n1n2

PKi

j=1wij3/q2ij PKi

j=1wij4/q2ij

!2

V ar

ndsρt

√n1n2

. V ar ndsρt/√

n1n2

is the variance of the intercept estimated by LD score regression.

2 Consistency of SUPERGNOVA and GNOVA

The estimates of SUPERGNOVA is approximately consistent with the estimates of GNOVA[4]

when the number of SNPs m is large enough. The estimates of GNOVA corrected for sample overlap is given by

ˆ

ρgnova= m √

n1n2zT1z2−mnsρt

n1n2Pm

j=1lj . (15)

Without loss of generality, we demonstrate the equivalence between SUPERGNOVA and GNOVA for global genetic covariance estimation and we use all the eigenvalues of LD matrix. By (10), the SUPERGNOVA estimator for global genetic covariance is given by

ˆ

ρ= m

√n1n2 Pm

j=1ηjwj2/qj2 Pm

j=1w4j/q2j ,

(10)

with qj,wj and ηj are defined in the obvious way. We assume per SNP heritability is small so that we can approximateq2j by wj2. So, the SUPERGNOVA estimator can be written as:

ˆ

ρ= m

√n1n2

Pm j=1

˜

z1j2jnnsρt

1n2wj Pm

j=1w2j

= m

√n1n2

˜

z1T2−Pm j=1

nsρt

n1n2wj Pm

j=1w2j

= m

√n1n2

z1TU UTz2nnsρt

1n2tr(V) tr(V2)

= m

√n1n2

z1Tz2mnnsρt

1n2

Pm

j=1lj

= m √

n1n2z1Tz2−mnsρt n1n2Pm

j=1lj

, which is exactly the GNOVA estimator given in (15).

3 Liability threshold model for SUPERGNOVA

In this section, we demonstrate that the SUPERGNOVA framework can be used in ascertained case/control studies. LetPkdenotes the sample prevalence ofφkin studykfork= 1,2. Directly following LDSC[2], we compute z-scores from liability model

zkj =

pnkPk(1−Pk) (ˆpcas,kj−pˆcon,kj)

ppˆkj(1−pˆkj) , (16) where ˆpkj denotes alleles frequency in the entire sample, ˆpcas,kjand ˆpcon,kj denote allele frequency for study k in case samples and control samples, respectively. We assume the MAF of a single SNP doesn’t have much difference between the two studies and both equal to population MAF i.e. ˆp1j = ˆp2j =fj, wherefj is the MAF of SNP j. That is because we assume an infinitesimal model and the contribution of a signle SNP is small. We assume the selection of sample for studyk only based on phenotype k.Otherwise, ˆpcas,kj will be a biased estimate ofpcas,kj.

In the liability threshold (probit) model, the binary trait is determined by continuous liability ψ, i.e. φ= 1 [ψ > τ], whereτ is the liability threshold. The liability threshold τ is determined by τ = Φ−1(1−K), where Φ is the standard normal cdf and K is the population prevalence.

We first state the theory for genome-wide genetic covariance. As an analogy of quantitative trait, we model continuous liability by

ψ1 =Xβ+ ψ2 =Y γ+δ,

where the distributions of X, Y, β, γ, , and δ and the definition of heritability and genetic covariance are the same as those in quantitative trait model. Since we assume an additive model and Hardy-Weinberg equilibrium, following the arguments in LDSC[2], without loss of generality, we can state the proofs in terms of haploid genotypes. We further denote the 0-1 coded genotype matrices as G and H for notation simplicity. So, we have Xij = (Gij −fj)/p

fj(1−fj) and Yij = (Hij −fj)/p

fj(1−fj), wherefj is the MAF of SNPj ifX andY are haploid standard- ized genotype matrices. Since we assume an infinitesimal model, the effect of a single SNP is

(11)

small and the MAF of a SNP doesn’t have much difference between the two studies. We define pcas,1j =P[Gij = 1|φ1i = 1]

pcon,1j =P[Gij = 1|φ1i = 0]

pcas,2j =P[Hij = 1|φ2i = 1]

pcon,2j =P[Hij = 1|φ2i = 0]

as the allele frequencies of SNP j in cases and controls for study 1 and study 2, respectively, where irepresents a generic individual in study 1 or study 2.

Proposition 3 Under the liability model above, we have E[ˆpcas,1j]−E[ˆpcon,1j] = 0 E[ˆpcas,2j]−E[ˆpcon,2j] = 0

proof. The distribution of ψ1i given Gij and β is ψ1i|(Gij, β)∼N

VjTβ,1

. The variance is 1 because we assume an infinitesimal model and the marginal heritability explained by a single SNP is small. So, we have

P(φ1i= 1|Xij, β) =P(ψ1i > τ1|Xij, β)

= 1−Φ τ1−XijVjTβ

≈K1+φ(τ1)XijVjTβ, (17) whereφis the standard normal density. Taylor expansion atτ1 is used in the above approxima- tion. By the results in (17), we have

E[ˆpcas,1j] =E

Pφ1i=11{Gij = 1}

n1P1

= n1P1E[1{Gij = 1} |φ1i = 1]

n1P1

=P[Gij = 1|φ1i = 1]

= P[φ1i = 1|Gij = 1]P[Gij = 1]

P[φ1i= 1]

= fj

K1 ·P[φ1i= 1|Gij = 1]

= fj K1

·E[P(φ1i= 1|Gij = 1, β)]

= fj

K1 ·E

"

K1+φ(τ1) s

1−fj

fj VjTβ

#

=fj.

We compute E[ˆpcas,2j], E[ˆpcas,2j], and E[ˆpcon,2j] with the same argument and complete the proof.

By proposition 3, we have E[zkj] =E

" p

nkPk(1−Pk) (ˆpcas,kj−pˆcon,kj) ppˆkj(1−pˆkj)

#

pnkPk(1−Pk)E[(ˆpcas,kj−pˆcon,kj)]

Ep ˆ

pkj(1−pˆkj) (18)

= 0.

(12)

The approximation in (18) ignore O(1/n) term and put the expectation onto the nominator and denominator.

By Lemma 2 in the supplementary note of LDSC[2], we have E[pcas,1j −pcon,1j] = 0,

E[pcas,2j −pcon,2j] = 0,

V ar[pcas,1j −pcon,1j] = fj(1−fj)φ(τ1)2h21 mK12(1−K1)2 lj, V ar[pcas,2j −pcon,2j] = fj(1−fj)φ(τ2)2h22

mK22(1−K2)2 lj, Cov[pcas,1j −pcon,1j, pcas,2j −pcon,2j] = fj(1−fj)φ(τ1)φ(τ2

mK1(1−K1)K2(1−K2)lj.

We want to compute Cov[pcas,1j−pcon,1j, pcas,2ζ−pcon,2ζ], for j 6= ζ, which is equivalent to E[(pcas,1j−pcon,1j) (pcas,2ζ−pcon,2ζ)], to generalize the theory to SUPERGNOVA.

Proposition 4 Under the liability model above, we have

E[(pcas,1j−pcon,1j) (pcas,2ζ−pcon,2ζ)] =

pfj(1−fj)p

fζ(1−fζ)φ(τ1)φ(τ2)ρ mK1(1−K1)K2(1−K2) VjTVζ Proof. Following the results in proposition 3, we have

pcas,1j|β=P(Gij = 1|φ1i = 1, β)

= P(φ1i= 1|Gij = 1, β)P(Gij = 1) P(φ1i= 1)

= fj

K1 · K1+φ(τ1) s

1−fj

fj VjTβ

! . With the same arguments, we have

pcon,1j|β= fj 1−K1

· 1−K1−φ(τ1)

s1−fj fj

VjTβ

!

pcas,2j|γ = fj

K2 · K2+φ(τ2) s

1−fj

fj VjTγ

!

pcon,2j|γ = fj 1−K2

· 1−K2−φ(τ2)

s1−fj fj

VjTγ

! .

(13)

Thus, we have

E[(pcas,1j−pcon,1j) (pcas,2ζ−pcon,2ζ)]

=E[E[(pcas,1j−pcon,1j) (pcas,2ζ−pcon,2ζ)|β, γ]]

=E[E[pcas,1j|β]E[pcas,2ζ|γ]] +E[E[pcon,1j|β]E[pcon,2ζ|γ]]

−E[E[pcas,1j|β]E[pcon,2ζ|γ]]−E[E[pcon,1j|β]E[pcas,2ζ|γ]]

=E

fj+φ(τ1) q

(1−fj)fjVjTβ 1 K1

fζ+φ(τ2) q

(1−fζ)fζVζTγ 1 K2

+E

fj −φ(τ1) q

(1−fj)fjVjTβ 1 1−K1

fζ−φ(τ2) q

(1−fζ)fζVζTγ 1 1−K2

−E

fj +φ(τ1) q

(1−fj)fjVjTβ 1

K1 fζ−φ(τ2) q

(1−fζ)fζVζTγ 1 1−K2

−E

fj −φ(τ1) q

(1−fj)fjVjTβ 1

1−K1 fζ+φ(τ2) q

(1−fζ)fζVζTγ 1 K2

=

pfj(1−fj)p

fζ(1−fζ)φ(τ1)φ(τ2)ρ mK1(1−K1)K2(1−K2) VjTVζ.

Now we giveCov(z1, z2) for binary traits which can be seen as a generalization of proposition 2 of LDSC[2].

Proposition 5 Under the liability model above, we have Cov(z1, z2) =

√n1n2

m ρobsV2+p

n1n2P1(1−P1)P2(1−P2)

×

Ncas,cas Ncas,1Ncas,2

+ Ncon,con Ncon,1Ncon,1

− Ncas,con Ncas,1Ncon,1

− Ncon,cas Ncon,1Ncas,1

V, where ρobs denotes the covariance on the observed scale:

ρobs =ρ φ(τ1)φ(τ2)p

P1(1−P1)P2(1−P2) K1(1−K1)K2(1−K2)

!

and Na,k denotes the sample size of phenotype ain study k. Na,b denotes the overlapped sample with phenotype ain study 1 and phenotype b in study 2.

Proof. By proposition 3, we have E[z1] = E[z2] = 0. So, Cov(z1j, z) = E[z1jz]. We have

E[z1jz] =E

" p

n1n2P1P2(1−P1) (1−P2) (ˆpcas,1j−pˆcon,1j) (ˆpcas,2ζ−pˆcon,2ζ) ppˆ1j(1−pˆ1j) (1−pˆ)

# . By omitting an error term with orderO(1/n), we have

E[z1jz] =

pn1n2P1P2(1−P1) (1−P2)E[E[(ˆpcas,1j −pˆcon,1j) (ˆpcas,2ζ−pˆcon,2ζ)|β, γ]]

Ep ˆ

p1j(1−pˆ1j) (1−pˆ)

pn1n2P1P2(1−P1) (1−P2)E[E[(ˆpcas,1j −pˆcon,1j) (ˆpcas,2ζ−pˆcon,2ζ)|β, γ]]

pfjfζ(1−fj) (1−fζ) . (19) The approximation in (19) follows the arguements in (18). After conditioning on β and γ, the only source of variance in the sample allele frequencies is ˆpcas,1, ˆpcon,1, ˆpcas,2, and ˆpcon,2 is sample error. We can write ˆpcas,1jcas,2ζ = (pcas,2j +η) (pcas,2ζ+ν),where η and ν denote

Referenzen

ÄHNLICHE DOKUMENTE

Inactivation of a TGF␤ receptor gene (TGF␤R2) in mouse neural crest cells resulted in cleft palate and abnormalities in the formation of the cranium (Ito et al. 2005),

fimbriatus by its larger size (snout-vent length up to 200 mm vs. 295 mm), hemipenis morphology, colouration of iris, head and back, and strong genetic differentiation (4.8 %

Indeed, if the research interest lies in investigating immigrant attitudes, behaviour or participation in a social field, focusing on immigrant status, one should make sure

The aims of the present thesis were to examine the occurrence and frequency of the degenerative joint diseases canine elbow dysplasia (ED) in a sample, and canine hip dysplasia

Whilst many risk loci that lie adjacent or within genes that have a role in mesenchymal cell function have been associated with subtle statistical variations in joint shape or

It is found that the oxygen octahedron sur- rounding the impurity ion V 4 + changes from elongation along the tetragonal axis in the pure crystal to compression and the magnitude

Noteworthy, for the 3d 1 ions Ti 3+ and Cr 5+ in the tetragonal phase of SrTiO 3 [28, 29], since ¯ R &gt; R ⊥ , the ground state is an orbital doublet, an additional distortion due

The spin Hamiltonian anisotropic g factors g and g ⊥ and the local structures of the Ni 3 + cen- ters I and II in K Ta O 3 are theoretically investigated by using the