2 Consistency of SUPERGNOVA and GNOVA

(1)

Supplementary Note

1 Properties for the SUPERGNOVA framework

1.1 Statistical model for global genetic covariance

We follow the random design and random effects model to construct phenotypes, which is same as the model used by LDSC[2]. Suppose we sample two cohorts of different phenotypes, in which sample sizes are n1 and n2, respectively. Assume the two GWAS studies share the same set of SNPs and there are m SNPs in total. We measure phenotype 1 in cohort 1 and phenotype 2 in cohort 2. We assume all the m SNPs are associated with both traits. We model phenotype vectors for each cohort as

φ₁=Xβ+ φ₂=Y γ+δ,

where X and Y are standardized random matrices of genotypes, with demensions n1×m and n₂ ×m; β and γ are vectors of standardized genotype effect sizes, and δ and are vectors of residuals, representing environmental effects and non-additive genetic effects.

Each row of X and Y represents the standardized genotypes of an individual in the corresponding GWAS study. By standardized genotypes, we mean the genotype of each SNP is normalized to mean zero and variance one. We assume the genotypes of different samples are independent from each other. Due to linkage disequilibrium(LD), genotypes of different SNPs are correlated. We denote the LD matrix as V. That is, cov(X_i₁·) = V = cov(Y_i₂·), for any 1 ≤ i₁ ≤ n₁ and 1 ≤ i₂ ≤ n₂. We define the LD score of a variant j as l_j := P

kV_jk². We assume we can bound lj by generic constant M, i.e. M > lj, for j ∈ {1,2, . . . , p}. We suppose that (β^T, γ^T)^T is subject to multivariate normal distribution and has mean zero and covariance matrix

Var β

γ

= 1 m

h²₁Im

ρIm

h²₂Im

.

We defineh²₁ andh²₂ as the heritability of trait 1 and trait 2, respectively. ρ is defined as genetic covariance between trait 1 and trait 2. In addition, genetic correlation ris defined byρ/p

h²₁h²₂. In practice, two different GWASs often share a subset of samples. Without loss of generality, we assume 0 ≤n_s ≤min{n₁, n₂} is the number of samples shared by the two GWASs and the first ns samples in each study are shared, i.e. the first ns rows of X and Y are the same. To account for the non-genetic correlation introduced by sample overlapping, we assume and δ are subject to multivariate normal distribution with covariance:

Cov [_i, δ_j] =

(ρe,1≤i=j≤ns

0,otherwise . The variance of and δ are:

Var [] = 1−h²₁

I_n₁, Var [δ] = 1−h²₂ I_n₂, so that

(2)

Var [φ₁] = Var [Xβ] + Var[]

=E

Xββ^TX^T

+ 1−h²₁ I_n₁

= h²₁ mE

XX^T

+ 1−h²₁ In1

=h²₁In1 + 1−h²₁

In1 =In1

and similarly, Var [φ₂] = I_n₂. We assume genotype, effect size, and environmental effect are independent to each other.

Genetic covariance ρ is the covariance of genetic components. For an individual with standarized genotype G, denoting g₁ and g₂ as the genetic components for trait 1 and trait 2, respectively, we have:

Cov [g₁, g₂] =E

G^Tβγ^TG

=E E

G^Tβγ^TG G

=E G^TE

βγ^T G

= ρ mE

G^TG

=ρ.

We define genetic correlation as the genetic covariance normalized by the heritability:

r_g = Cov [g₁, g₂]

pVar [g1] Var [g2] = ρ ph²₁h²₂. 1.2 Statistical model for local genetic covariance

In this section, we generalize the statistical framework above for local genetic covariance. We assume φ1 and φ2 follow additive linear models:

φ₁ =

I

X

i=1

X_iβ_i+

φ2 =

I

X

i=1

Yiγi+δ,

where X_i and Y_i are the standarized genotypes and β_i and γ_i are the effect sizes of SNPs in regions i. We assume SNPs from different regions are independent and we use Vi to denote the LD matrix in region i. We assume effect size (β^T_i , γ_i^T)^T is subject to multivariate normal distribution:

β_i γi

∼N

0, 1 mi

h²_1iI_m_i ρiImi

ρ_iI_m_i h²_2iImi

and effect sizes of SNPs from different regions are independent. Environmental error covariance ρ_e in local genetic covariance is defined the same as it is in global genetic covariance. The variance of and δ are V ar[] =

1−PI i=1h²_1i

In1and V ar[δ] =

1−PI i=1h²_2i

In2 so that V ar(φ₁) = 1 and V ar(φ₂) = 1.

The covariance of the genetic components is the sum of local genetic covariance. DenoteGi

(3)

as the standardized genotype in region ifor an individual. We have Cov [g1, g2] =E

" _I X

i=1

G^T_i βi

! _I X

i=1

γ_i^TGi

!#

=

I

X

i=1

E

G^T_i β_iγ_i^TG_i

=

I

X

i=1

E E

G^T_i βiγ^T_i Gi

Gi

=

I

X

i=1

E G^T_i E

β_iγ^T_i G_i

=

I

X

i=1

ρi

miE G^T_i Gi

=

I

X

i=1

ρ_i. Local genetic correlation is defined by:

rig = ρi

q h²_1ih²_2i

.

1.3 Covariance of z scores

In genome-wide association studies (GWAS), summary statistics are more accessible than individual- level genotype data due to potential privacy and data sharing security concerns. For a GWAS with quantitative trait, the z score of a single SNP j is given by:

zj = βˆj

se( ˆβj)

where ˆβ_j is the estimated coefficient from marginal linear regression between the trait and the SNPj, andse( ˆβj) is the corresponding standard error. In practice, we approximate z scores by z_1j =X_·j^Tφ₁/√

n₁ and z_2j =Y_·j^Tφ₂/√ n₂. 1.3.1 Global

We derive the variance-covariance matrix of z₁^T, z₂^TT

in this section. It’s sufficient to show that

Cov(z₁, z₂) =

√n1n2ρ

m V²+ n_sρ_t

√n1n2

V, (1)

where ρt=ρ+ρe. Because V ar(z1) and V ar(z2) can be derived from (1). In fact, to compute V ar(z₁), we can assume trait 1 and trait 2 are from the same study and hence we have

V ar(z1) =Cov(z1, z1) =

√n₁n₁h²₁

m V²+n1

h²₁+ 1−h²₁

√n₁n₁ V = n1h²₁

m V²+V.

Similarly, Cov(z2) = n2h²₂/m

V²+V. To prove (1), we begin with the following proposition.

Proposition 1 AssumeΓ1,Γ2 i.i.d.

∼ B(1, pΓ);Π1,Π2i.i.d.

∼ B(1, pΠ); Cor(Γ1,Π1) =Cor(Γ2,Π2) = r_ΓΠ; Γ₁ and Π₂ are independent. Γ₂ and Π₁ are independent. We have:

(4)

(1).

E h

(Γ₁+ Γ₂−2p_Γ)²(Π₁+ Π₂−2p_Π)²i

=4 r²_ΓΠ+ 1

p_Π(1−p_Π)p_Γ(1−p_Γ) + 2r_ΓΠp

p_Π(1−p_Π)p

p_Γ(1−p_Γ) (1−2p_Π) (1−2p_Γ). (2). Additionally, if Λ1,Λ2 i.i.d.

∼ B(1, pΛ); Cor(Γ1,Λ1) = Cor(Γ2,Λ2) = rΓΛ; Cor(Π1,Λ1) = Cor(Π₂,Λ₂) =r_ΠΛ;E[(Π₁−p_Π) (Γ₁−p_Γ) (Λ₁−p_Λ)] =E[(Π₂−p_Π) (Γ₂−p_Γ) (Λ₂−p_Λ)] =

∆; Λ1 is independent with Γ2 and Π2; Λ2 is independent with Γ1 and Π1. We have:

E h

(Γ₁+ Γ₂−2p_Γ) (Λ₁+ Λ₂−2p_Λ) (Π₁+ Π₂−2p_Π)²i

=4 (r_ΓΠr_ΠΛ+r_ΓΛ)p_Π(1−p_Π)p

p_Γ(1−p_Γ)p

p_Λ(1−p_Λ) + 2 (1−2p_Π) ∆.

We note the genotype of SNP j before standardization is the sum of two independent idential binomial distribution with p equal to minor allele frequency, 0.05 < f_j < 0.5, of SNP j. We only consider common variants. With Hardy-Weinberg equilibrium, for any 1 ≤ τ ≤ n₁, we approximate Xτ j by

X_{τ j} := S_{1τ j}+S_{2τ j}−2f_j p2fj(1−fj) ,

where S1τ j and S2τ j represent allelic dosage for SNP j on each chromosomethe and fj is the minor allele frequency of SNP j. The approximate of genotype inY is defined in the same way.

By proposition 1, for different SNP, SNP j, SNP ζ, we have E

X_{τ j}² X_{τ ζ}²

=V_jζ² + 1 + Vjζ(1−2fj) (1−2fζ) 2p

f_j(1−f_j)p

f_ζ(1−f_ζ).

For SNP k,k6=j, ζ, denoting ∆jkζ =E[(Sατ j −fj) (Sατ k−fk) (Sατ ζ−fζ)], α= 1,2, we have E

X_{τ j}X_{τ k}X_{τ ζ}²

=V_jζV_kζ+V_jk+ (1−2f_ζ) ∆_jkζ 2f_ζ(1−f_ζ)p

fj(1−fj)p

f_k(1−f_k),

where V_jζ represents the j-th row, ζ-th column of LD matrix V. In other words, V_jζ is the correlation of genotypes between SNP j and SNP ζ. Vjk andVkζ are defined in the same way.

Now, we prove (1). We have E

z1z₂^T

=E

X^Tφ1

√n1

·φ^T₂Y

√n2

=E

"

X^T (Xβ+)

√n1

·(Y γ+δ)^TY

√n2

#

=E

X^TXβγ^TY^TY

√n1n2

+E

X^Tδ^TY

√n1n2

=E

E

X^TXβγ^TY^TY

√n₁n₂

X, Y

+E

E

X^Tδ^TY

√n₁n₂

X, Y

=E

(X^TXE βγ^T

Y^TY

√n₁n₂

) +E

(X^TE δ^T

√ Y n₁n₂

)

= ρ

m√

n₁n₂E

X^TXY^TY

+ ρe

√n₁n₂E

"_n_s X

i=1

X_i·^TYi·

#

= ρ

m√

n1n2E

X^TXY^TY

+ ρ_en_s

√n1n2

V. (2)

(5)

To complete the calculation, we next compute E h

X_·j^TXY^TY·k

i

. First, we denote N = (τ, η)

τ =η= 1,2, . . . , ns . Forj =k, we have E

X_·j^TXY^TY·j

=E





m

X

ζ=1 n1

X

τ=1

X_{τ j}X_{τ ζ}

!

·





n2

X

η=1

Y_ηjY_ηζ









=

m

X

ζ=1 n1

X

τ=1 n2

X

η=1

E[Xτ jX_{τ ζ}YηjY_ηζ]

=

m

X

ζ=1



 X

(τ,η)∈N

E[X_{τ j}X_{τ ζ}Y_ηjY_ηζ] + X

(τ,η)/∈N

E[X_{τ j}X_{τ ζ}]E[Y_ηjY_ηζ]





=

m

X

ζ=1



 X

(τ,η)∈N

E

X_{τ j}² X_{τ ζ}²

+ (n₁n₂−n_s)V_jζ²





=

m

X

ζ=1

n_s

"

V_jζ² + 1 + V_jζ(1−2fj) (1−2f_ζ) 2p

f_j(1−f_j)p

f_ζ(1−f_ζ)

#

+ (n₁n₂−n_s)V_jζ²

!

=n₁n₂

m

X

ζ=1

V_jζ² +

m

X

ζ=1

n_s+

m

X

ζ=1

n_sV_jζ(1−2f_j) (1−2f_ζ) 2p

fj(1−fj)p

fζ(1−fζ)

=n1n2 m

X

ζ=1

V_jζ² +mns+ns m

X

ζ=1

V_jζ(1−2f_j) (1−2f_ζ) 2p

fj(1−fj)p

fζ(1−fζ)

=n1n2 m

X

ζ=1

V_jζ² +mns+o(nsm)

≈n₁n₂

m

X

ζ=1

V_jζ² +mn_s. (3)

We explain the approximation in (3) here. In the deivation below, C represents a generic constant, whose value may be different at different places. As mentioned above, we only consider common variants in our supplementary note, which means 0.05< fj <0.5 for any SNP j. So, for any SNP j and ζ, there exists a constantC so that

(1−2fj) (1−2fζ) 2p

f_j(1−f_j)p

f_ζ(1−f_ζ) < C.

By Cauchy-Schwarz inequality, we have

m

X

ζ=1

V_jζ(1−2fj) (1−2f_ζ) 2p

f_j(1−f_j)p

f_ζ(1−f_ζ) < C

m

X

ζ=1

V_jζ ≤C v u utm

m

X

ζ=1

V_jζ² =Cp

ml_j. (4) So, we have

ns m

X

ζ=1

V_jζ(1−2f_j) (1−2f_ζ) 2p

fj(1−fj)p

f_ζ(1−f_ζ) ≤C·lj ·ns

√m=o(nsm)

(6)

For j6=k, we have

E

X_·j^TXY^TY·k

=E





m

X

ζ=1 n1

X

τ=1

X_{τ j}X_{τ ζ}

!

·





n2

X

η=1

Y_ηkY_ηζ









=

m

X

ζ=1 n1

X

τ=1 n2

X

η=1

E[Xτ jX_{τ ζ}Y_ηkY_ηζ]

=

m

X

ζ=1



 X

(τ,η)∈N

E[X_{τ j}X_{τ ζ}Y_ηkY_ηζ] + X

(τ,η)/∈N

E[X_{τ j}X_{τ ζ}]E[Y_ηkY_ηζ]





=

m

X

ζ=1



 X

(τ,η)∈N

E

X_{τ j}X_{τ k}X_{τ ζ}²

+ (n₁n₂−n_s)V_jζV_kζ





=

m

X

ζ=1

n_s

"

V_jζV_kζ+V_jk + (1−2f_ζ) ∆_jkζ 2fζ(1−fζ)p

fj(1−fj)p

fk(1−fk)

#

+ (n₁n₂−n_s)V_jζV_kζ

!

=n₁n₂

m

X

ζ=1

V_jζV_kζ+

m

X

ζ=1

n_sV_jk+

m

X

ζ=1

n_s(1−2f_ζ) ∆_jkζ 2f_ζ(1−f_ζ)p

fj(1−fj)p

f_k(1−f_k)

=n1n2 m

X

ζ=1

VjζVkζ+mnsVjk+ns m

X

ζ=1

(1−2f_ζ) ∆_jkζ 2f_ζ(1−f_ζ)p

f_j(1−f_j)p

f_k(1−f_k)

=n1n2 m

X

ζ=1

V_jζV_kζ+mnsV_jk+o(nsm)

≈n₁n₂

m

X

ζ=1

V_jζV_kζ+mn_sV_jk. (5)

We note that ∆_jkζ ≤V_jζ by definition. So, the approximation in (5) is derived from the same argument as (4).

Combine (3) and (5) we have E

X^TXY^TY

≈n1n2V²+mnsV (6) Substituting (6) into (2), we finish the calculation forCov(z₁, z₂).

1.3.2 Local

We assume SNPs from different local regions are indepedent. In other words, we assume LD matrix takes the block structure

V =





 V₁

V₂ . ..

V_I





 ,

where I is the number of locla regions.

Based on the derivations for global genetic covariance above, we generalize the results to

(7)

calculate covariance of local z scores, Cov(z_1i, z_2i). We have

E z1iz_2i^T

=E

X_i^Tφ1

√n1

·φ^T₂Yi

√n2

=E





 X_i^T

PI

ν=1Xνβν+

√n1

· PI

ν=1Yνγν+δ T

Yi

√n2







=

I

X

ν=1

E

X_i^TX_νβ_νγ_ν^TY_ν^TY_i

√n₁n₂

+E

X_i^Tδ^TY_i

√n₁n₂

=E

X_i^TX_iβ_iγ_i^TY_i^TY_i

√n1n2

+X

ν6=i

E

X_i^TX_νβ_νγ_ν^TY_ν^TY_i

√n1n2

+ n_sρ_e

√n1n2

V_i

= ρi

m_iE

X_i^TXiY_i^TYi

√n₁n₂

+X

ν6=i

ρν

m_νE

"

X_i^TE XνY_ν^T Yi

√n₁n₂

#

+ nsρe

√n₁n₂V_i

≈

√n1n2ρi

mi

V_i²+ n_sρ_i

√n1n2

V_i+X

ν6=i

n_sρ_ν

√n1n2

V_i+ n_sρ_e

√n1n2

V_i (7)

=

√n₁n₂ρ_i mi

V_i²+ nsρt

√n1n2

Vi. Here, ρt=PI

i=1ρi+ρe. Approximation in (7) follows from (6).

1.4 Variance of ˜z1ij˜z2ij

Assume eigen decomposition ofV_i is V_i =U_iΣ_iU_i^T. Denote ˜z_1i=U_i^Tz_1i and ˜z_2i =U_i^Tz_2i. Then, we have

z˜1i

˜ z_2i

∼N



0,





n1h²_1i

mi Σ²_i + Σ_i

√n1n2ρi

mi Σ²_i +^√ⁿ_n^s^ρ^t

1n2Σ_i

√n1n2ρi

mi Σ²_i +^√ⁿ_n^s^ρ^t

1n2Σi n2h²_2i

mi Σ²_i + Σi







. (8) We order the eigenvalues in Σi by their values. We denote the eigenvalues bywi1 ≤wi2 ≤. . .≤ w_im_i. ˜z_1ij and ˜z_2ij denote the j-th element of ˜z_1i and ˜z_2i, respectively. We use the following proposition to obtain V ar(˜z_1ijz˜_2ij).

Proposition 2 Assume

ξ₁ ξ₂

∼N

0,

σ₁² ρ₀ ρ₀ σ₂²

we have V ar(ξ₁ξ₂) =ρ²₀+σ²₁σ²₂. By proposition 2, we have

var[˜z1ijz˜2ij] = √

n₁n₂ρ_i

m_i w²_ij+ nsρt

√n₁n₂wij

2

+

n1h²_1i

m_i w²_ij+wij

n2h²_2i

m_i w_ij² +wij

. (9) 1.5 Variance of ˆρ_i

From (8), we have

E[˜z_1ijz˜_2ij] =

√n1n2ρi

m_i w_ij² + n_sρ_t

√n₁n₂w_ij.

We obtain the estimates of ρ_i by weighted linear regression that regresses ˜z_1ijz˜_2ij − ^√ⁿ_n^s^ρ^t

1n2w_ij on w²_ij. We use LD score regression[2] to obtain the estimates of ^√ⁿ_n^s^ρ^t

1n2, which is denoted by ndsρt/√

n1n2. The weights are given by the reciprocal of the variance in (9). In practice, we use _n

1h²_1i

mi w²_ij +w_ij ⁿ²_m^h²²ⁱ

i w²_ij+w_ij

to approximate the variance, E[˜z_1ijz˜_2ij]. For notation

(8)

simplicity, we denote q²_ij = _n

1h²_1i

mi w²_ij+wij n2h²_2i

mi w_ij² +wij

and ηij = ˜z1ijz˜2ij − ^√ⁿ_n^s^ρ^t

1n2wij. We note V ar(nij) = q_ij². We use the Ki largest eigenvalues in the estimation of local genetic covariance to reduce noise.

1.5.1 Theoretical and empirical variance of ˆρ_i

From weighted linear regression, the weighted estimate of local genetic covariance is given by ˆ

ρ_i = m_i

√n1n2

PKi

j=1ηijw²_ij/q²_ij PKi

j=1w⁴_ij/q_ij² . (10)

So, the theoretical variance of ˆρi is given by V ar( ˆρ_i) =

m²_i n₁n₂

V ar

PKi

j=1ηijw²_ij/q_ij² PKi

j=1w_ij⁴/q²_ij

!

= m²_i

n₁n₂

P^Ki

j=1V ar(ηij)w⁴_ij/q_ij⁴ PKi

j=1w⁴_ij/q_ij²2 = m²_i/(n₁n₂) PKi

j=1w_ij⁴/q²_ij. We assume the variance of η_ij is proportional toq_ij and given by q_ijσ_ηi² although we know the true value of σ_ηi² is equal to 1. In other words, we treat the variance of η_ij as unknown and use weighted linear regression procedure to estimate σ_ηi². The estimate ofσ_ηi² is denoted by ˆσ_ηi². Then, multiply the theoretical variance by ˆσ_ηi², we have the empirical variance of ˆρi

V ard( ˆρi) =V ar( ˆρi) ˆσ²_ηi. (11) We obtain ˆσ²_ηi by the residues from the regression model. We have

ˆ σ²_ηi=







Ki

X

j=1

η_ij² q_ij² −

PKi

j=1η_ijw_ij²/q²_ij2

PKi

j=1w_ij⁴/q²_ij







(K_i−1). (12)

Substituting (12) into (11), we have

V ard( ˆρi) = m²_i/(n₁n₂) PKi

j=1w⁴_ij/q_ij² ·







Ki

X

j=1

η²_ij q_ij² −

PKi

j=1ηijw²_ij/q²_ij 2

PKi

j=1w⁴_ij/q_ij²







(Ki−1). (13)

1.5.2 Approximation of Var E ˆ

ρ_i| nd_sρ_t/√

n₁n₂

To compensate the noise introduced by the estimation ofnd_sρ_t/√

n₁n₂ in LD score regression[2], we add V ar E

ˆ

ρ_i| nd_sρ_t/√

n₁n₂

to the total variance, V ar( ˆρ_i). Then, the total variance of ˆ

ρi is then given by

V ar( ˆρi) =V ar

E

ˆ ρi

nd_sρ_t

√n1n2

+E

V ar

ˆ ρi

nd_sρ_t

√n1n2

.

(9)

The estimation ofE V ar ˆ

ρ_i| nd_sρ_t/√

n₁n₂

is given by the empirical variance in (13). In this section, we derive the expression ofV ar E

ˆ

ρi| ndsρt/√ n1n2

. By (10), We have

E

ˆ ρi

ndsρt

√n1n2

= mi

√n1n2E

" P_K_i

j=1η_ijw²_ij/q_ij² PKi

j=1w⁴_ij/q²_ij

ndsρt

√n1n2

#

= mi

√n₁n₂ PKi

j=1E

η_ij| nd_sρ_t/√ n₁n₂

w²_ij/q_ij² PKi

j=1w_ij⁴/q²_ij

= mi

√n₁n₂ PKi

j=1

E[˜z1ijz˜2ij]−^√ⁿ^d_n^s^ρ^t

1n2wij

w_ij²/q²_ij PKi

j=1w⁴_ij/q²_ij

= m_i

√n₁n₂ PKi

j=1E[˜z1ijz˜2ij]w²_ij/q_ij² PKi

j=1w_ij⁴/q²_ij − m_i

√n₁n₂ nd_sρ_t

√n₁n₂ PKi

j=1w³_ij/q²_ij PKi

j=1w⁴_ij/q²_ij. (14) The first term in (14) is constant. So, V ar E

ˆ

ρ_i| nd_sρ_t/√

n₁n₂

is given by V ar

E

ˆ ρ_i

nd_sρ_t

√n₁n₂

=V ar m_i

√n₁n₂ nd_sρ_t

√n₁n₂ PKi

j=1w³_ij/q_ij² P_K_i

j=1w⁴_ij/q_ij²

!

= m²_i n1n2

PKi

j=1w_ij³/q²_ij PKi

j=1w_ij⁴/q²_ij

!2

V ar

nd_sρ_t

√n1n2

. V ar nd_sρ_t/√

n₁n₂

is the variance of the intercept estimated by LD score regression.

2 Consistency of SUPERGNOVA and GNOVA

The estimates of SUPERGNOVA is approximately consistent with the estimates of GNOVA[4]

when the number of SNPs m is large enough. The estimates of GNOVA corrected for sample overlap is given by

ˆ

ρ_gnova= m √

n1n2z^T₁z2−mnsρt

n₁n₂Pm

j=1l_j . (15)

Without loss of generality, we demonstrate the equivalence between SUPERGNOVA and GNOVA for global genetic covariance estimation and we use all the eigenvalues of LD matrix. By (10), the SUPERGNOVA estimator for global genetic covariance is given by

ˆ

ρ= m

√n₁n₂ Pm

j=1η_jw_j²/q_j² Pm

j=1w⁴_j/q²_j ,

(10)

with q_j,w_j and η_j are defined in the obvious way. We assume per SNP heritability is small so that we can approximateq²_j by w_j². So, the SUPERGNOVA estimator can be written as:

ˆ

ρ= m

√n1n2

Pm j=1

˜

z_1jz˜_2j −^√ⁿ_n^s^ρ^t

1n2w_j Pm

j=1w²_j

= m

√n₁n₂

˜

z₁^Tz˜₂−Pm j=1

nsρt

√n1n2w_j Pm

j=1w²_j

= m

√n1n2

z₁^TU U^Tz2−^√ⁿ_n^s^ρ^t

1n2tr(V) tr(V²)

= m

√n₁n₂

z₁^Tz₂−^√^mn_n^s^ρ^t

1n2

P_m

j=1l_j

= m √

n₁n₂z₁^Tz₂−mn_sρ_t n1n2Pm

j=1lj

, which is exactly the GNOVA estimator given in (15).

3 Liability threshold model for SUPERGNOVA

In this section, we demonstrate that the SUPERGNOVA framework can be used in ascertained case/control studies. LetP_kdenotes the sample prevalence ofφ_kin studykfork= 1,2. Directly following LDSC[2], we compute z-scores from liability model

z_kj =

pn_kP_k(1−P_k) (ˆp_cas,kj−pˆ_con,kj)

ppˆ_kj(1−pˆ_kj) , (16) where ˆp_kj denotes alleles frequency in the entire sample, ˆp_cas,kjand ˆp_con,kj denote allele frequency for study k in case samples and control samples, respectively. We assume the MAF of a single SNP doesn’t have much difference between the two studies and both equal to population MAF i.e. ˆp_1j = ˆp_2j =f_j, wheref_j is the MAF of SNP j. That is because we assume an infinitesimal model and the contribution of a signle SNP is small. We assume the selection of sample for studyk only based on phenotype k.Otherwise, ˆp_cas,kj will be a biased estimate ofp_cas,kj.

In the liability threshold (probit) model, the binary trait is determined by continuous liability ψ, i.e. φ= 1 [ψ > τ], whereτ is the liability threshold. The liability threshold τ is determined by τ = Φ⁻¹(1−K), where Φ is the standard normal cdf and K is the population prevalence.

We first state the theory for genome-wide genetic covariance. As an analogy of quantitative trait, we model continuous liability by

ψ₁ =Xβ+ ψ₂ =Y γ+δ,

where the distributions of X, Y, β, γ, , and δ and the definition of heritability and genetic covariance are the same as those in quantitative trait model. Since we assume an additive model and Hardy-Weinberg equilibrium, following the arguments in LDSC[2], without loss of generality, we can state the proofs in terms of haploid genotypes. We further denote the 0-1 coded genotype matrices as G and H for notation simplicity. So, we have X_ij = (G_ij −f_j)/p

f_j(1−f_j) and Y_ij = (H_ij −f_j)/p

f_j(1−f_j), wheref_j is the MAF of SNPj ifX andY are haploid standardized genotype matrices. Since we assume an infinitesimal model, the effect of a single SNP is

(11)

small and the MAF of a SNP doesn’t have much difference between the two studies. We define pcas,1j =P[Gij = 1|φ_1i = 1]

pcon,1j =P[Gij = 1|φ_1i = 0]

p_cas,2j =P[H_ij = 1|φ_2i = 1]

p_con,2j =P[H_ij = 1|φ_2i = 0]

as the allele frequencies of SNP j in cases and controls for study 1 and study 2, respectively, where irepresents a generic individual in study 1 or study 2.

Proposition 3 Under the liability model above, we have E[ˆpcas,1j]−E[ˆpcon,1j] = 0 E[ˆpcas,2j]−E[ˆpcon,2j] = 0

proof. The distribution of ψ1i given Gij and β is ψ1i|(Gij, β)∼N

V_j^Tβ,1

. The variance is 1 because we assume an infinitesimal model and the marginal heritability explained by a single SNP is small. So, we have

P(φ1i= 1|X_ij, β) =P(ψ1i > τ1|X_ij, β)

= 1−Φ τ1−XijV_j^Tβ

≈K₁+φ(τ₁)X_ijV_j^Tβ, (17) whereφis the standard normal density. Taylor expansion atτ₁ is used in the above approximation. By the results in (17), we have

E[ˆp_cas,1j] =E

Pφ1i=11{G_ij = 1}

n₁P₁

= n1P1E[1{G_ij = 1} |φ_1i = 1]

n₁P₁

=P[Gij = 1|φ_1i = 1]

= P[φ_1i = 1|G_ij = 1]P[G_ij = 1]

P[φ1i= 1]

= f_j

K₁ ·P[φ_1i= 1|G_ij = 1]

= f_j K1

·E[P(φ_1i= 1|G_ij = 1, β)]

= fj

K₁ ·E

"

K₁+φ(τ₁) s

1−fj

f_j V_j^Tβ

#

=fj.

We compute E[ˆp_cas,2j], E[ˆp_cas,2j], and E[ˆp_con,2j] with the same argument and complete the proof.

By proposition 3, we have E[z_kj] =E

" p

nkPk(1−Pk) (ˆpcas,kj−pˆcon,kj) ppˆ_kj(1−pˆ_kj)

#

≈

pnkPk(1−Pk)E[(ˆpcas,kj−pˆcon,kj)]

Ep ˆ

p_kj(1−pˆ_kj) (18)

= 0.

(12)

The approximation in (18) ignore O(1/n) term and put the expectation onto the nominator and denominator.

By Lemma 2 in the supplementary note of LDSC[2], we have E[pcas,1j −pcon,1j] = 0,

E[pcas,2j −pcon,2j] = 0,

V ar[p_cas,1j −p_con,1j] = fj(1−fj)φ(τ1)²h²₁ mK₁²(1−K1)² l_j, V ar[pcas,2j −pcon,2j] = f_j(1−f_j)φ(τ₂)²h²₂

mK₂²(1−K₂)² lj, Cov[p_cas,1j −p_con,1j, p_cas,2j −p_con,2j] = fj(1−fj)φ(τ1)φ(τ2)ρ

mK₁(1−K₁)K₂(1−K₂)l_j.

We want to compute Cov[pcas,1j−pcon,1j, pcas,2ζ−pcon,2ζ], for j 6= ζ, which is equivalent to E[(pcas,1j−pcon,1j) (p_cas,2ζ−p_con,2ζ)], to generalize the theory to SUPERGNOVA.

Proposition 4 Under the liability model above, we have

E[(p_cas,1j−p_con,1j) (p_cas,2ζ−p_con,2ζ)] =

pf_j(1−f_j)p

f_ζ(1−f_ζ)φ(τ₁)φ(τ₂)ρ mK₁(1−K₁)K₂(1−K₂) V_j^TV_ζ Proof. Following the results in proposition 3, we have

p_cas,1j|β=P(G_ij = 1|φ_1i = 1, β)

= P(φ_1i= 1|G_ij = 1, β)P(G_ij = 1) P(φ1i= 1)

= fj

K₁ · K1+φ(τ1) s

1−fj

f_j V_j^Tβ

! . With the same arguments, we have

pcon,1j|β= f_j 1−K1

· 1−K1−φ(τ1)

s1−f_j fj

V_j^Tβ

!

p_cas,2j|γ = fj

K₂ · K₂+φ(τ₂) s

1−fj

f_j V_j^Tγ

!

pcon,2j|γ = f_j 1−K2

· 1−K2−φ(τ2)

s1−f_j fj

V_j^Tγ

! .

(13)

Thus, we have

E[(p_cas,1j−p_con,1j) (p_cas,2ζ−p_con,2ζ)]

=E[E[(pcas,1j−pcon,1j) (p_cas,2ζ−p_con,2ζ)|β, γ]]

=E[E[pcas,1j|β]E[pcas,2ζ|γ]] +E[E[pcon,1j|β]E[pcon,2ζ|γ]]

−E[E[pcas,1j|β]E[pcon,2ζ|γ]]−E[E[pcon,1j|β]E[pcas,2ζ|γ]]

=E

f_j+φ(τ₁) q

(1−f_j)f_jV_j^Tβ 1 K1

f_ζ+φ(τ₂) q

(1−f_ζ)f_ζV_ζ^Tγ 1 K2

+E

fj −φ(τ1) q

(1−fj)fjV_j^Tβ 1 1−K1

fζ−φ(τ2) q

(1−fζ)fζV_ζ^Tγ 1 1−K2

−E

fj +φ(τ1) q

(1−fj)fjV_j^Tβ 1

K₁ f_ζ−φ(τ2) q

(1−f_ζ)f_ζV_ζ^Tγ 1 1−K₂

−E

f_j −φ(τ₁) q

(1−f_j)f_jV_j^Tβ 1

1−K₁ f_ζ+φ(τ₂) q

(1−f_ζ)f_ζV_ζ^Tγ 1 K₂

=

pfj(1−fj)p

fζ(1−fζ)φ(τ1)φ(τ2)ρ mK1(1−K1)K2(1−K2) V_j^TV_ζ.

Now we giveCov(z1, z2) for binary traits which can be seen as a generalization of proposition 2 of LDSC[2].

Proposition 5 Under the liability model above, we have Cov(z1, z2) =

√n₁n₂

m ρobsV²+p

n1n2P1(1−P1)P2(1−P2)

×

N_cas,cas Ncas,1Ncas,2

+ N_con,con Ncon,1Ncon,1

− N_cas,con Ncas,1Ncon,1

− N_con,cas Ncon,1Ncas,1

V, where ρ_obs denotes the covariance on the observed scale:

ρobs =ρ φ(τ₁)φ(τ₂)p

P₁(1−P₁)P₂(1−P₂) K1(1−K1)K2(1−K2)

!

and N_a,k denotes the sample size of phenotype ain study k. N_a,b denotes the overlapped sample with phenotype ain study 1 and phenotype b in study 2.

Proof. By proposition 3, we have E[z₁] = E[z₂] = 0. So, Cov(z_1j, z_2ζ) = E[z_1jz_2ζ]. We have

E[z_1jz_2ζ] =E

" p

n1n2P1P2(1−P1) (1−P2) (ˆpcas,1j−pˆcon,1j) (ˆp_cas,2ζ−pˆ_con,2ζ) ppˆ_1jpˆ_2ζ(1−pˆ_1j) (1−pˆ_2ζ)

# . By omitting an error term with orderO(1/n), we have

E[z1jz2ζ] =

pn₁n₂P₁P₂(1−P₁) (1−P₂)E[E[(ˆp_cas,1j −pˆ_con,1j) (ˆp_cas,2ζ−pˆ_con,2ζ)|β, γ]]

Ep ˆ

p1jpˆ2ζ(1−pˆ1j) (1−pˆ2ζ)

≈

pn1n2P1P2(1−P1) (1−P2)E[E[(ˆpcas,1j −pˆcon,1j) (ˆpcas,2ζ−pˆcon,2ζ)|β, γ]]

pf_jf_ζ(1−f_j) (1−f_ζ) . (19) The approximation in (19) follows the arguements in (18). After conditioning on β and γ, the only source of variance in the sample allele frequencies is ˆp_cas,1, ˆp_con,1, ˆp_cas,2, and ˆp_con,2 is sample error. We can write ˆp_cas,1jpˆ_cas,2ζ = (p_cas,2j +η) (p_cas,2ζ+ν),where η and ν denote