• Keine Ergebnisse gefunden

2.4 The interplay of genetic and non-genetic factors

3.1.4 Testing for association

For the data given in table 3.1 comprising n1. cases and n0. controls, an association analysis tests the null hypothesis that the genetic variant and the disease occur in-dependently from each other. Statistical tests to examine this null hypothesis are all methods known to analyze dichotomous outcome data, e.g.χ2tests or logistic regression models. In general, theχ2 test of independence checks if two categorical or qualitative variables are independent from each other by comparing the observed frequencies for the possible combinations of variable outcomes with the expected ones assuming no associa-tion. With respect to our genetic data, we can distinguish different alternativeχ2 tests.

The test can be performed allele or genotype based, with the latter providing several more alternatives, distinguishing different genetic modes of inheritance. Comparing all three genotype groups directly, we can calculate the test statistic

χ2G= X

d=0,1;g=0,1,2

(ndg−edg)2 edg ,

with the expected counts calculated by edg = nd.n.g/n. This test statistic is asymp-totically χ2 distributed with 2 degrees of freedom (df) under the null hypothesis of independence. By assuming a dominant or recessive mode of inheritance, specific alter-native hypotheses are given, that are restricted to the comparison of only two genotype groups by collapsing the heterozygotes with one of the homozygous genotypes. The test statistic assuming a dominant model is given by

χ2dom= X

d=0,1

nd0n.0nnd.2 n.0nd.

n

+

(nd1+nd2)−(n.1+nn.2)nd.2 (n.1+n.2)nd.

n

, which can be simplified to

χ2dom=n(n10(n01+n02)−n00(n11+n12))2

n1.n0.n.0(n.1 +n.2) . (3.2)

The corresponding statistic when assuming a recessive model is χ2rec =n((n10+n11) +n02)−(n00+n01)n12)2

n1.n0.(n.0+n.1)n.2 .

Under the null hypothesis of no association, both statistics are asymptotically χ2 dis-tributed with 1 df. As already mentioned, we can also test for association based on the alleles rather than the genotypes. We count each occurring allele resulting in twice the sample size, having 2n.0+n.1 wildtype variants and 2n.2+n.1 mutation variants. These are distributed to cases and controls with 2n10+n11,2n00+n01and 2n12+n12,2n02+n01.

Plugging in these numbers in the general formula for a chi-square test and simplifying results in the test statistic

χ2all = 2n((2n10+n11) + (n01+ 2n02)−(2n00+n01)(n11+ 2n12))2 2n1.2n0.(2n.0+n.1)(n.1+ 2n.2)

which is again asymptotically χ2 distributed with 1 df. Furthermore, because of the biological plausibility that the number of risk alleles has an influence to the disease occurrence, the Armitage-Trend-Test is often used. This test distinguishes all 3 possible genotypes but assumes a trend in the effects with an increasing level of the risk factor. The trend statistic is given by

χ2tr =

P2

g=0wg(n0gn1.−n1gn0.)2 n0.n1.

n

P2

g=0wg2n.g(n−n.g)−2P2 g=0

P2

h=g+1wgwhn.gn.h

,

withw= (w0, w1, w2) weights that can be chosen to fit different association models. The statistic isχ2distributed with 1 df under the null hypothesis of no association. In GWAS a linear trend with increasing number of the minor allele is often assumed, denoted as additive effect. Therefore, we use w=(0,1,2) and the test statistic simplifies to

χ2tr = n(n(n11+ 2n12)−n1.(n.1+ 2n.2))2

n1.n0.(n(n.1+ 4n.2)−(n.1+ 2n.2)2) (3.3) In general, these weights are not only used when the trend is linear but also when the change is assumed to be monotonically (Clarkeet al.,2011). Weight of w=(0,0,1) would correspond to a recessive model, w=(0,1,1) to a dominant one. The advantage of the allele based approach is the doubling of the sample size. However, in general genotype based tests should be preferred, because they are robust to deviations from HWE, while the allele based test is only valid under the assumption of HWE. In addition, this is the biologically more plausible variant. Depending on the assumed biological function and mode of inheritance of a genetic variant, the corresponding test should be chosen. The trend test is suggested when no biological knowledge exists, because it often reaches the highest power (“locally optimal”). When sparse cells (expectation less than 5) occur, Fisher’s exact test should be used instead of aχ2 test.

When additional variables, e.g. age or sex, should be considered in the analysis, a logistic regression model offers a good alternative by including them as covariates. In general, a regression model describes the influence of one or more risk factors X1...XK to an outcome measure Y by an equation of the form

f(Y) =α+β1X1+· · ·+βKXK+

with α denoted as intercept, regression coefficients β1, . . . , βK and ∼N(0, σ2). Given a quantitative phenotype e.g. blood pressure as outcome Y and only one influencing genetic risk factor G, the model reduces to a simple linear regression of the form

Y =α+βG+, ∼N(0, σ2), which is often rewritten as

E(Y|G) = α+βG.

The equation describes which Y value is expected given a particular G with residuals =Y −E(Y|G). Given a dichotomous outcome variable, in our context affected (D=1) and unaffected (D=0) according to a disease of interest, we cannot model the outcome directly by a linear equation anymore. Therefore, a logit transformation, that is the logarithm of the odds of a disease, has to be used for D, resulting in a logistic regression model of the form

ln

P(D= 1|G) P(D= 0|G)

=α+βG+, ∼N(0, σ2). (3.4)

Hence, we can obtain the expected probability to become affected given genotype G E(D|G) = P(D= 1|G) = exp(α+βG)

1 + exp(α+βG).

The Armitage-Trend-Test tests the same information as a logistic regression model with one regression variable for the genotype coded 0,1 and 2. Furthermore, the logistic regression coefficients are directly related to the odds ratios measuring the strength of association by

ORhet= exp(α+β)

exp(α) = exp(β) and ORhom= exp(α+ 2β)

exp(α) = exp(2β). (3.5) In this regression model ORhom = OR2het, hence it is based on the assumption of a multiplicative allele effect. However, when no multiplicity should be assumed, two dummy variables Ghet and Ghom for the heterozygous and homozygous genotype can be used alternatively

ln

P(D= 1|Ghet, Ghom) P(D= 0|Ghet, Ghom)

=α+βhetGhethomGhom+, ∼N(0, σ2).

with Ghom = 1 and Ghet = 0 for G=1 and Ghom = 1 and Ghet = 0 for G=2. Then ORhet = exp(βhet) and ORhom = exp(βhom). We will come back to the connection of regression coefficients and OR in the context of GxE interaction in section3.1.5 and for the derivation of our approach in chapter 6. As mentioned before, the advantage of a regression model compared to aχ2 test is that other influencing factors can be included in the analysis as well. These can be e.g. sex and age or other confounders as additional genetic or environmental factors necessary to adjust for. By including these additional factorsX1, ..., XK, the model expands to a multiple logistic regression model of the form

ln

P(D= 1|G, X1, ..., XK) P(D= 0|G, X1, ..., XK)

=α+βGG+β1X1+· · ·+βKXK+, ∼N(0, σ2) From this model, odds ratios adjusted for the covariates can be calculated as shown before. The regression coefficients of a logistic regression model can be estimated by the principle of maximum likelihood. In general, the maximization cannot be analytically performed since no exact solution is available, so that an iterative approach such as the Newton-Raphson algorithm for numerical optimization has to be used (Faraway, 2006). An influence of the genetic factor to the disease is given when the corresponding estimated coefficient, e.g. ˆβG, is significantly different from 0. This can be tested by

dividing it by the corresponding estimated standard deviation (ˆσβG) and using a Wald test, assuming a normal distribution under the null hypothesis of no effect. A stratified analysis of case-control data according to categorical covariates is possible as well, e.g.

with the Cochrane-Mantel-Haenszel test (Agresti, 2002). Moreover, non-parametric methods exist that we will not elaborate since we are concentrating on parametric approaches.

3.1.5 Gene x environment interactions and gene-environment associations A mathematical definition of a GxE interaction

In the following we will restrict our attention to a dominant genetic risk factor (car-rier/non carrier of the susceptibility allele) and a binary environmental factor (ex-posed/unexposed) since it is used that way in our GxE interaction methods. The disease risks for the different risk factor combinations are given by

Table 3.2: Disease risks for individuals with different combinations of genetic and environmental risk factor

environmental factor exposed unexposed genetic carrier r11 r10

factor non-carrier r01 r00

For the definition of a gene x environment interaction in a statistical sense, the principle of conditional independence (Dawid, 1979; Jakulin and Bratko, 2004) is used. In gen-eral, two factors X and Y are called conditionally independent with respect to a third factor Z if and only if

P(X, Y|Z) = P(X|Z)P(Y|Z). (3.6)

An equivalent way to express this relationship is given by

P(X|Y, Z) = P(X|Z). (3.7)

In terms of gene x environment interactions, conditional independence is present when the effect of one of the risk factors (genetic or environmental) on the disease risk is the same across strata defined by the other risk factor. The absence of this independence is called interaction. Hence, a gene x environment interaction in a statistical sense is observed, when the effect of the environmental factor on disease risk differs depending on the underlying genotype, or when the genotype effect on disease risk differs in subjects depending on the environmental exposure (Ottman,1996).

However, the existence of an interaction corresponding to this definition depends on the scale of measurement used for the disease risks. Two different scales are commonly used: additive and multiplicative. Based on a cohort, an interaction on an additive measurement scale is defined when r11 − r10 6= r01 − r00, while an interaction on a multiplicative scale is present when r11/r10 6= r01/r00. In terms of relative risks, an

Table 3.3: Data for an unmatched case-control study with a binary genetic and envi-ronmental factor

genetic factor

carrier noncarrier

environmental factor environmental factor

exposed unexposed exposed unexposed Total

cases n111 n110 n101 n100 Ncases

controls n011 n010 n001 n011 Ncontrols

interaction on an additive scale is given when RR11 6=RR01+RR10−1, and RR11 6=

RR01RR10 represents an interaction assuming a multiplicative model. RR10 = r10/r00 denotes the relative risk for unexposed carriers with respect to unexposed non-carriers of the susceptibility allele as a reference group,RR01=r01/r00 for exposed non-carriers and RR11=r11/r00 for exposed carriers.

If we take a look at the possible kinds of biological interactions listed in section 2.4.3, we can notice that interactions of type M1, M2 and M4 express themselves in statistical interactions as defined above on both scales. A relationship of the three factors of type M5 is not reflected in a statistical interaction at all, neither on an additive nor a multiplicative scale. In this case, only the frequency of the joint occurrence of the genetic and environmental factor are influenced, but the effect of the genetic factor on the disease risk stays the same across strata of the environmental factor and the other way around. A biological interaction as described by model M3 may manifest itself in a statistical interaction, but not necessarily, and the scale of the risk measure plays an important role. For instance, when we consider a multistage process like the initiation or promotion in cancer, two factors that act at the same stage fit a risk model on additive scale and an interaction results in a departure from additivity. When both factors act at different stages, this better fits a multiplicative model, and an interaction in this case can be observed as deviation from multiplicativity (Rothman et al., 1980;

Siemiatycki and Thomas, 1981). Hence, both scales can be adequate depending on the underlying pathophysiological model and mechanism (Koopman,1977; Kupper and Hogan, 1978;Ottman,1996;Walter and Holford, 1978). A point of view that might be taken into account choosing the scale is determined by the goal of investigation, leading to the preference of a multiplicative scale when the causes of disease should be revealed (Rothman et al., 1980). For our further investigations, we used the definition of GxE interaction on a multiplicative scale, which is the commonly used and adequate one in a case-control study.

Measures and testing of GxE interactions and G-E associations

In this section we will restrict to the case-control study design since the data in our applications and our simulation studies are based on that. Therefore, assume in the following that we have an unmatched case-control study for the disease D with a binary environmental exposure E and a binary genetic factor G. The data can be presented in a 2x4 table as given in table3.3. The entriesndge denote the number of cases (d=1) and controls (d=0) that are carriers (g=1) or non-carriers (g=0) of the susceptibility allele and exposed (e=1) or unexposed (e=0) to the environmental factor. The observed cell

counts for cases n1 = (n111, n110, n101, n100) and controls n0 = (n011, n010, n001, n000) can be viewed as realizations from two independent multinomial distributions

n1 ∼M N(Ncases, p1) and n0 ∼M N(Ncontrols, p0),

where p1 = (p111, p110, p101, p100) and p0 = (p011, p010, p001, p000) are the cell probabilities of the underlying case-control population.

It is known that relative risks (RR) cannot be calculated for studies designed in case-control manner. However, we can calculate Odds Ratios (OR) - with unexposed non-carriers as reference group - given by

ORg = p000p110

p010p100 genetic main effect, ORe = p000p101

p001p100 environmental main effect, ORge = p000p111

p011p100 joint effect of gene and environment.

(3.8)

Under a multiplicative model with no interaction effect, we haveORge =ORgORe, and hence the interaction effect can be measured by the interaction parameter

Ψ = ORge/(ORgORe), (3.9)

with Ψ





>1 positive interaction effect - more than multiplicative

= 1 no interaction effect - multiplicative

<1 negative interaction effect - less than multiplicative .

It should be noticed, that when a GxE interaction exists, it expresses itself not only in form of the interaction itself, but also in dependent distributions of genetic factor and exposure within the cases and within the controls. In presence of a positive interaction, exposure and genetic susceptibility factor occur more often together in cases than ex-pected and less often in controls, the contrary holds in presence of negative interaction.

The dependency of a genetic and an environmental factor is called gene-environment association (G-E), and can be measured by OR as well, with

ORcases= p100p111

p110p101 for cases ORcontrols = p000p011

p010p001 for controls.

(3.10)

In absence of G-E association in the corresponding group,ORcases = 1 andORcontrols = 1 respectively. A departure from one indicates an association.

The exact relationship between Ψ and the stratified G-E association measuresORcases and ORcontrols can be derived by simply rearranging the formula for Ψ, resulting in

Ψ = ORge ORgORe =

p000p111

p011p100

p000p110

p010p100

p000p101

p001p100

=

p100p111

p110p101

p000p011

p010p001

= ORcases

ORcontrols. (3.11)

According to this formula, we can distinguish two different situations where Ψ = 1 and hence no interaction occurs, namely when no G-E association is given at all

ORcases=ORcontrols = 1

or when the G-E association is present in the exact same magnitude in cases and controls ORcases=ORcontrols 6= 1.

The latter is a population-based G-E association (section 2.4.3) and is given when we have a correlation between G and E that exists in the whole population to the same degree totally independent from the disease status. On the other hand, when we have a G-E association only caused by an underlying interaction effect, ORcases and ORcontrols

depart from 1 in different directions. Hence, while a population-based G-E association can be observed to the same extend in cases and controls, an interaction is given when the G-E association is different in both groups.

With a decreasing prevalence of the disease, the departure from 1 of theORcontrols due to an interaction effect gets weaker and reduces to one under the rare disease assumption (Schmidt and Schaid (1999)). Thus, for a rare disease the interaction effect is only reflected in the association within the casesORcases.

Since odds ratios are closely connected to logistic regression, with the coefficients of the regression model corresponding to the logarithm of the respective Odds Ratios, we can test an interaction effect by such a logistic regression model

logitP(D= 1 |g, e) =log

P(D= 1|g, e) P(D= 0|g, e)

=α+βee+βgg+βgege. (3.12) The regression coefficient βg is a measure of the genetic main effect, βe measures the environmental main effect andβge is a measure of the interaction effect between G and E.

The regression coefficients are related to the odds ratios and the interaction parameter Ψ by

βg = log(ORg) βe = log(ORe) βge = log(Ψ).

(3.13)

A regression coefficient of 0 corresponds to no effect, a coefficient>0 indicates a positive effect and a coefficient <0 a negative one. The G-E associations stratified by disease status can be measured by logistic regression as well

logitP(E = 1 |D= 1, g) = log

P(E = 1 |D= 1, g) P(E = 0 |D= 1, g)

casescasesg (3.14) logitP(E = 1 |D= 0, g) = log

P(E = 1 |D= 0, g) P(E = 0 |D= 0, g)

controlscontrolsg, (3.15) with βcases=log(ORcases), βcontrols =log(ORcontrols) and

βcases−βcontrols =log( ORcases

ORcontrols)(3.11)= log(Ψ) =βge. (3.16)

The regression coefficients and hence the OR can be estimated from the data by their

maximum likelihood estimators (MLE), given by βˆg = log

n000n110

n010n100

βˆe = log

n000n101

n001n100

βˆcases = log

n100n111

n110n101

βˆcontrols = log

n000n011

n010n001

(3.17)

and

βˆge (3.16)= βˆcases−βˆcontrols =log

n001n100n010n111

n011n110n101n000

. (3.18)

These MLE asymptotically follow approximate normal distributions (Le,1991; Mukher-jeeet al., 2008)

βˆg ∼ N(βg, σ2g) βˆe ∼ N(βe, σe2) βˆcases ∼ N(βcases, σ2cases) βˆcontrols ∼ N(βcontrols, σ2controls) βˆge ∼ N(βge, σ2ge)

(3.19)

with variance estimators given by ˆ

σg2 = P

d=0,1

P

g=0,1 1 ndg0

ˆ

σe2 = P

d=0,1

P

e=0,1 1 nd0e

ˆ

σcases2 = P

g=0,1

P

e=0,1 1 n1ge

ˆ

σcontrols2 = P

g=0,1

P

e=0,1 1 n0ge

ˆ

σge2 = P

d=0,1

P

g=0,1

P

e=0,1 1

ndge = ˆσcases2 + ˆσcontrols2 .

(3.20)

The classical case-control test for gene x environment interactions simply tests the in-teraction coefficient βge with null hypothesis H0 : βge = 0. Because of its approximate normal distribution, the case-control test statistic corresponds to a standardized normal test for βge by normalizing the estimate ˆβge from the data by its estimated standard deviation ˆσge, resulting in

Zcc = βˆge

ˆ σge =

βˆcases−βˆcontrols

pσˆ2cases+ ˆσcontrols2 .

This test statistic is asymptoticallyN(βge,1) distributed, withβge = 0 =βcases−βcontrols (standard normal distribution) under the null hypothesis of no interaction. Furthermore, we have that

βcasescontrols = 0

when genotype and environmental factor are independent from each other; given a population-based G-E association

βcasescontrols 6= 0 holds.

3.2 Genome-wide association studies (GWAS)

3.2.1 Genetic epidemiological study types

In genome-wide studies the whole genome is systematically examined by using nu-merous genetic markers distributed through the complete genetic information to find genes involved in disease development. The counterpart to the exploratory genome-wide approach is the hypothesis-driven candidate gene studies. Candidate gene studies focus on analyzing only genes or regions already known or expected to be involved in disease etiology. Candidates can come from other, e.g. experimental, studies, from knowledge in other species, or the information about functional relations of genes with the disease. For autoimmune diseases for example, the HLA system on chromosome 6 is known as the most important candidate region. While candidate gene studies can be successful when good candidates are known, a genome-wide search is the method of choice when insufficient information about the biological and biochemical processes of the disease is given and hence inadequate prior knowledge about potentially involved genes is available. Furthermore, even when good candidates are known, genome-wide studies can find additional new genes not expected before. Genome-wide studies are totally independent from pathophysiological hypotheses and therefore keep all possibil-ities open. As already mentioned before, two different genetic principles can be used to find genes contributing to disease development: linkage and association. Linkage studies are successful to find rare variants with high penetrances that strongly increase disease risk. On the contrary, association studies have higher power in finding common variants with a reduced penetrance and low to moderate risk effects, involved in a more complicated interplay of numerous genetic and non-genetic factors. They allow a finer mapping of potential disease causing factors while linkage analyses are only applicable to identify a coarse region. Since association studies can be performed on a population basis, the recruitment is simpler than for families. However, population approaches are more prone to confounding e.g. by population stratification, possibly leading to false positive results. Before the 21st century, only linkage studies were possible genome-wide.

Genome-wide linkage studies were tremendously successful for the identification of genes underlying monogenic disease, characterized by their rare occurrence, high penetrance and large relative risk (Hirschhorn, 2005; Thomas et al., 2005; Thomas, 2006). Major genes involved in clear Mendelian subtypes of complex diseases showed similar proper-ties and were detected as well, but beyond, the success in complex disease was limited (Altm¨uller et al., 2001). For those factors of complex diseases involved in the interplay of multiple genetic and environmental factors in a complicated way (Wanget al., 2005), the power was much too low due to incomplete penetrances and relatively small effects (Cardon and Bell, 2001; Hirschhorn, 2005; Risch and Merikangas, 1996; Risch, 2000;

Tabor et al., 2002; Thomas, 2006). Associations studies on the other hand, since only possible as a candidate approach at that time, failed due to an imperfect understanding about the fundamental biology of complex diseases and hence lack of ability to pick good candidate genes (Pearson and Manolio,2008;Sham and Cherny,2010). Although candidate gene association studies revealed many susceptibility genes, the replication rate was only low (Patterson and Cardon, 2005; Sham and Cherny, 2010; Todd, 2006;

Zondervan, 2010). Reasons for that may be the overestimation of the ability to select adequate candidates and too low thresholds for claiming an association (Khoury and

Wacholder, 2009; Wacholder et al., 2004). In a review of 2002, Hirschhorn et al.

Wacholder, 2009; Wacholder et al., 2004). In a review of 2002, Hirschhorn et al.