• Keine Ergebnisse gefunden

Theory and Some Pitfalls of Genetic Association Analysis

Im Dokument Genetics of Restless Legs Syndrome  (Seite 40-44)

1 Introduction

1.11 Theory and Some Pitfalls of Genetic Association Analysis

A naïve genetic association study can easily lead to false positive and false negative results. Thus, very often, association analyses have to be modified and corrected according to the aims and the setup of a study.

1.11.1 Rare and Common Variants in Complex Traits

In contrast to Mendelian traits, the effects of many genes could sum up to a complex trait [300]. Two hypothesis try to explain complex diseases [19]: the “common disease – common variant” (CDCV) [17] and the “common disease – rare variant” (CDRV) [19, 301] hypothesis. They argue that common complex diseases might be explained by either genetic variants with high allele frequencies [17] or low allele frequencies [301], respectively. In particular, the “common disease – rare variant” theory assumes that complex diseases might be mainly caused by loci that show a rather high rate of mutations into susceptibility alleles but with mildly deleterious effects and large allelic heterogeneity, and as a consequence, these alleles might be quite common in a locus as a sum [301].

Association studies were shown to be superior to linkage studies for mapping disease susceptibility loci in complex diseases [302]. Interestingly, many complex diseases were tested successfully for associations with common genetic variants [303, 304]. However, most of these studies could only explain a small proportion the diseases’ heritability with common genetic variants, which raised the problem of the “missing heritability” [305]. But interestingly, genetic studies on lipid levels showed an overlap of genes that were detected by common variant association mapping in the complex trait of lipid levels and in Mendelian dyslipidemia [306-308]. So it is possible that a complex disease/trait could be explained by both common variants of low effect and rarer variant of moderate effect [309]. This could be confirmed already [310]. As one extreme, very rare variants could cause a Mendelian disease variant when the variant’s effect size is high [309].

1.11.2 Bias in Association Analysis

A genetic association can be tested between disease status and a genetic marker using the χ² test (with 2 degrees of freedom), which compares the distributions of categorical counts between cases and controls [311]. The categories can be e.g. the number of minor alleles of a genotype (0, 1, or 2) in the case of biallelic markers.

These genetic association studies face the problem of biases between cases and controls (or biases in the samples that contribute to a quantitative trait association study), which can introduce false positive results [312]. The main biases are population stratification/structure [312], cryptic relatedness [313], and technical genotyping artifacts [312]. Population stratification is a mismatch of cases and controls due to different proportions of ethnic groups or different fractions of ancestry from distinct ancestral populations [314]. Cryptic relatedness is a likely higher relatedness within the case group, compared to the control group, as the cases share a genetic disease [313].

Many methods were developed to address these problems in genome-wide association studies. One method was genomic control (GC) for case-control setups, which proposed λGC as a measure of population heterogeneity (cryptic relatedness, different proportions of subpopulations in cases and controls) [313]. The value of λGC can be estimated as the median of the χ² statistics divided by 0.675² [313]. (An approach based on the mean of the statistics was also proposed [315]). In particular, λGC is approximately the variance of the distribution of the absolute χ statistics that is approximately normally distributed under the null hypothesis [313]. This variance is greater 1 in the case of population heterogeneity [313]. A couple of assumptions are made to estimate λGC from SNP data:

1.11 Theory and Some Pitfalls of Genetic Association Analysis

The markers should be biallelic [313], independent from each other [313], and maybe from group of markers that are to be tested for association [314]. The resulting λGC can correct the observed χ² statistics and thus corrects for the detected population heterogeneity [313]. However, if samples are not independent, e.g. in case of relatedness, then the correction leads to a loss of power [313]. Of note, λGC can be locus specific when loci differ in mutation rate or selection [313]. The λGC can also be rescaled to be comparable between studies of different sample size [316].

As an alternative, the association analysis might be stratified by the population structure (if detected) [317-320].

A GWAS genotype dataset is a multidimensional dataset with each marker being a dimension. The principal components analysis (PCA) tries to transform these dimensions of interrelated variables into a set of much fewer dimensions that retain most of the variation of the dataset [321]. These principal components can be used to explain a large proportion (ideally most) of the genotypic variance of the GWAS dataset. As the genotypes can reflect population stratification, the principal components can capture (parts) of the stratification and can be used as covariates in an association analysis to correct for (parts) of the stratification [322]. However, the principal components will be calculated based on all sources of genotypic variance in the dataset and thus might also reflect technical biases [312], familial relationships [322] or LD structures [323]. Of note, PCA is applicable for sample sizes that are below the number of markers, and it can also be applied to multi allelic markers [322]. But it is not suitable for correcting for population structure in presence of closely related individuals [322]. A similar approach is multidimensional scaling (MDS). In principal, it transforms a (genetic) similarity (dissimilarity) matrix of the individuals into a coordinate system where each individual has a set of coordinates, and there the distances between the individuals reflect the distances from the similarity matrix [321].

Some tests do not need these corrections. E.g. the transmission test for linkage disequilibrium (TDT) is a family based test that tests for linkage and association, and it is resistant to population structure [96]. However tests like the TDT have disadvantages with respect to their statistical power [324].

Recently, methods were developed to use mixed models with case-control data (e.g. GMMAT [48]).

The principals were established in the field of animal breeding research [325]. In the mixed models, fixed and random effects are implemented in a statistical model where the random effects can be modeled from genetic markers [326]. The inclusion of random effects was shown to correct for population structure, cryptic and familial relatedness [326], and it is now computational feasible [327].

1.11.3 Hardy Weinberg Equilibrium

The ratio of genotypes remains constant for a locus without selection and in a large population with random mating [328] (Hardy Weinberg Equilibrium, HWE). Thus the transmission of any allele is equally probable. In that case, the distribution of genotypes can be inferred from the allele frequencies [328]. If the inferred (expected) genotypes differ from the observed genotypes, then the assumption does not hold that the transmission of the genotypes is equally probable. This deviation from HWE might hint to an allele selection, e.g. a disease association [329] or a genotyping artifact [330]. Thus, the HWE should be carefully examined in genetic association datasets.

1.11.4 Genetic Association Tests

In a traditional case-control study, the samples can be stratified by genotype and disease status in a 2x3 contingency table [50]. The proportion of cases differs between the genotype categories in case

1 Introduction

of a real association (alternative hypothesis, HI), or it is constant in the case of no association (null hypothesis, H0) [50]. The proportions are genotype specific disease penetrances [50]. Their ratios are specific for a genetic mode of inheritance and are termed genotype relative risks (GRR) [50]. E.g. the denominator can be the penetrance of the homozygous AA genotype category, and GRRs are obtained for the two other genotype categories AB and BB: γAB and γBB, respectively [50]. The two GRRs have the following relations in the different genetic modes of inheritance: in a dominant model γAB = γBB and γAB > 1; in a recessive model γAB < γBB and γAB = 1; in an additive model 2γAB - 1 = γBB

(which can be inferred from a linear increase of the GRR) and γBB > 1; in an multiplicative (codominant) model γAB² = γBB and γBB > 1; in an overdominant model γAB > γBB and γBB = 1 [50].

A genotype contingency table may be directly used for a χ² test or Fisher test, and it has an appropriate power for different genetic models [311]. However, it is more powerful to count and test alleles of cases and controls (instead of the individual genotypes) in a 2x2 contingency table, which assumes independent alleles (thus HWE) [331]. As an alternative to the χ² test, scores can be assigned to the genotype categories, and then a Cochran-Armitage trend test (CATT) can be performed [331-333]. This test has a superior power to the χ² test, even for different genetic models (additive, multiplicative, dominant, recessive) [334]. The additive model is asymptotically equivalent to the multiplicative model and does not suffer from a substantial loss of power in case of another true underlying genetic model [334]. But if the inheritance is overdominant, then the Armitage trend test will fail [311]. Furthermore, the CATT is resistant to HWE departure [334].

The logistic regression is another approach to the genetic association analysis of binary traits as RLS.

In a logistic regression, the binary phenotypes are transformed using the logit function, which calculates the logarithm of the odds of being diseased for each genotype category [311]. These values are regressed against a linear model with regression parameters for each genotype category (general form) [311]. After fitting the model, a likelihood ratio test is performed of the full, alternative model, which that is determined by the regression parameters, against the reduced, null model, which has equal regression parameters of the genotype categories, with 2 degrees of freedom, which is equivalent to the Pearson χ² test [311]. If a genetic model is specified, than the power can be increased to detect an association due to this genetic model, e.g. by the specification of an additive model, which can be for example tested with an score test equivalent to the CATT with 1 degree of freedom [50, 311]. This approach also just requires fitting the null model and thus saves computation time [48]. If the alternative model is fitted as well, which has to done for each genetic locus separately, then a Wald test can be performed [48], which can approximate the score test [50]. The logistic regression has the advantage that it can incorporate covariates and interactions into the model [311]. Random effects can also be incorporated, which turns the regression framework into a generalized linear mixed model [48]. Of note, the significance of association can also approximately be tested using a linear mixed model instead of a generalized linear mixed model [327].

In a genome-wide association study (GWAS), many SNPs are tests against the same phenotype. Each test has a prespecified alpha level that is the probability of a false positive association (type-1 error, decision for the alternative hypothesis in case of a true null hypothesis) [311]. The alpha level is often set at 5% [311]. However, if many tests are performed of the same family at the alpha level 5%, then the probability of observing at least one false positive is higher than 5% [311]. Thus a new alpha level has to be defined, which can be approximated by the Bonferroni correction of dividing the alpha level by the number of (independent) tests [311]. However, the method is conservative and can be replaced by empirically estimating the type-1 error rate (with the cost of computational

1.11 Theory and Some Pitfalls of Genetic Association Analysis

time) [311]. Other alternatives are adaptive approaches that sort p values and the null hypothesis are rejected in a step down procedure with less stringent criteria based on the number of already rejected null hypothesis [335]. Another method is the exact correction for multiple testing using Sidak’s method, which is the basis of the Bonferroni approximation [336]. In a GWAS, the genome-wide significance threshold should be set below 1E-07 to keep the alpha level at 0.05, as found in a study with focus on European origin [337], and another study suggested a threshold of 7.2E-08 [338]. In contrast to this correction of the family wise error rate (FWER), the approach of the false discovery rate (FDR) enriches a list of associations with true positives and accepts a proportion of false positives [339].

1.11.5 Genetic Association Tests for Rare Variants

A typical individual genome has 4.1 million to 5 million variants and almost all of them are SNPs (single nucleotide polymorphisms) [340]. Most of these variants (90%) are rare thus there are only few observations of alternative alleles in the population [340]. In contrast, only for a small proportion of SNPs (10%), variants are commonly observed (in more than 5% of haploid genomes, MAF > 5%, MAF = minor allele frequency) [340].

But the association testing is underpowered for single rare variants [341] and might also be conservative, e.g. when the Wald test is applied [342]. Thus a variety of tests was designed to combine information from multiple variants of interest into one association test. They can be grouped into 5 classes: burden tests, adaptive burden tests, variance-components test, omnibus tests and EC tests [343].

The burden test can be performed in a simple way: The minor alleles of interest are counted for each individual and regressed against the phenotype (burden of rare variants, BRV) [12]. Of the same spirit, other tests are e.g. CAST (cohort allelic sum test) [16], CMC (combined multivariate and collapsing) [21] and WST (weighted-sum test) [104]. All burden like tests have high power when all variants are causal and their effects have the same direction [344].

Variance-component gene level tests can be much more powerful than burden like tests when the variants have different directions of effect and only a subset of markers is causal [92]. A widely used test is the SKAT (sequence kernel association test) [92]. SKAT performs a score test for each variant and collapses the squared score for a region of interest [92]. Thus SKAT is computational efficient because it has to fit the null model (including the covariates) once [92]. Other tests are special cases of the SKAT [92], e.g. the SSU [95] or C-alpha [344].

Adaptive burden tests are two step tests that select or weight variants based on initial tests prior to the actual association test [343], e.g. aSum (adaptive sum test) [8], EREC (estimated regression coefficient) [33], VT (variable threshold test) [101] and KBAC (kernel-based adaptive cluster) [69].

Most of the tests calculate empirical p values and need considerable computational time [343]. As an advantage, their power can be comparable to variance-component and omnibus tests [345].

Variance-component tests and burden test can be combined in an omnibus test, e.g. by the Fisher method [346]. This approach needs an empirical evaluation of the significance and thus computational time [346]. But it can be superior in means of power compared to another omnibus test, the SKAT-O [346]. The SKAT-O test sums the test statistics from its burden like test and its SKAT, and it does not require permutations [93, 347].

The EC test sums in principal the exponentials of quadratic variant scores for a region of interest, and it is powerful in case of a very small proportion of causal variants but requires permutations to evaluate the significance [31].

1 Introduction

Im Dokument Genetics of Restless Legs Syndrome  (Seite 40-44)