• Keine Ergebnisse gefunden

3.2 Genome-wide association studies (GWAS)

3.2.3 Data quality checks

SNP chips allow researchers to interrogate hundreds of thousands SNPs across the hu-man genome (Weale, 2010) with the goal to identify true association signals in sea of false positive results (Christensen and Murray, 2007). A good quality of the data is an essential point to avoid false positive results and guarantee to draw accurate conclusions from the analysis (Neale and Purcell,2008). On the one hand, quality assurance during study conduct is necessary, ensuring a good study design, good sampling protocols, good quality DNA, adequate protocols for DNA extraction and preparation (Weale, 2010).

On the other hand, an additional exploratory data quality control is the first step of a GWAS analysis to evaluate the genotyping performance and is indispensable (Neale and Purcell,2008;Thomas,2010c). The process how to get from the chip signals to the genotype for each SNP, a process called genotype calling, is described in the appendix A.2. The assignment of a genotype to a SNP according to the corresponding chip signal is denoted as call. Genotyping errors (miscalls) as well as missing data (no-calls) can occur. Factors influencing the quality of genotyping are for example the concentration, contamination or possible degradation of the input DNA, failures or degeneration of the chip arrays, as well as differences in sample preparation (e.g. different laboratories) and plating errors (Teo, 2008; Weale, 2010). As long as wrong and missing genotypes occur at random and affect cases and controls equally, it will lead to some loss of power and bias of effect estimates, but not to an increase in the type I error (Bickeb¨oller and Fis-cher, 2007; Thomas et al., 2005). However, problems occur when the genotype quality differs with the phenotype because cases and controls are not genotyped in an identical manner, e.g. on separate days, separate plates, by separate laboratory assistants or even

in different laboratories (Bickeb¨oller and Fischer, 2007; Hirschhorn, 2005). This may lead to different systematic missings and misclassifications in cases and controls and hence in bias and spurious associations (Thomas, 2010c). To avoid these problems, it is recommended to plate cases and controls together (same laboratory assistant on the same day under the same conditions) so that at least plate effects are evenly distributed.

Special attention is essential when controls from a predefined reference database are used (Weale, 2010). In addition, genotyping errors and missing values can be not randomly distributed among the different genotypes of a SNP but rather over-represented in a particular genotype (Weale, 2010). By the identification of bad quality SNPs, as well as individuals that do not fulfill several quality criteria and excluding them from the analysis, inflated type I and type II errors can be avoided (Thomas, 2010c). In addi-tion, reducing the number of SNPs leads to a decrease of the multiple testing burden and therefore to higher power for the remaining SNPs (Weale, 2010) Criteria to filter out SNPs are their proportion of missing genotypes (e.g. <95%), the MAF (e.g. <5%) and strong deviations from HWE (in GWAS: p < 10−7). Persons should be excluded from the analysis when they show missing genotypes for many SNPs (e.g. <90%) or an excess of heterozygous or homozygous genotypes. When the reported sex is not the same as the sex determined by the X chromosome, an incorrect alignment of genetic and phenotypic data cannot be ruled out and it is recommended to remove the person prior to the analysis. Furthermore, relatedness as well as population outliers and strat-ification is usually investigated in the quality control step of GWAS and can lead to further exclusions of individuals. For the interested reader, more detailed information about the different quality criteria listed above is given the appendix A.2. However, since population stratification plays an important role not only in quality control but rather in the analysis of GWAS data and also for this thesis, we will consider this in the following section.

3.2.4 Analysis of genome-wide association studies

Although GWAS analyses can build on valuable lessons learned from candidate gene association and linkage studies (Pearson and Manolio, 2008), they brought also new technological, practical and statistical challenges (Thomas, 2010c). Managing the enormous amount of data was one of the first practical aspects (Neale and Purcell, 2008;Thomas,2010c), needing large computer capacities with respect to CPU time and storage (Ziegler et al., 2008). Sophisticated statistical, but also bioinformatical tools for analyzing and interpreting the data were necessary (De Bakkeret al.,2005; Clayton and Leung, 2007; Falush et al., 2003; Marchini et al., 2006, 2007; Price et al., 2006;

Pritchard et al., 2000; Scheet and Stephens, 2006; Stephens et al., 2001; Teo et al., 2007;Wellcome Trust Case Control Consortium (WTCCC),2007), skills from computer sciences essential. Due to the high number of SNPs, quality control checks need to be performed in a highly automated way, as well as the following association analyses. In this section we will outline the most important steps in the analysis of a genome-wide association study. We will start with a short description of the single step analysis methods that in general build the first step in a GWA analysis, as in the lung cancer studies of chapter 7. In addition, the pathway based methods illustrated in chapter 5 are based such results. After this, different methods to correct for the most important

confounder in genetic epidemiological studies, population stratification, are explained.

In our lung cancer application in chapter7, a corresponding adjustment for a particular study is shown in detail. Furthermore, the two most common graphical representation methods for GWAS results are presented. Finally, we will outline how significance of association signals in GWAS is assigned and how their validity is judged, since this is the main challenge in GWAS. The hierarchical method that we address with this thesis tries to improve GWAS at that point of assessing significance. The applications presented in chapter7are done in the scope of a consortium (Amos,2007;International Agency for Research on Cancer (IARC),2012), with the aim of good validation of results.

Single SNP association tests

For the association analysis, the standard first step in GWAS is to perform simple single SNP association tests (Bickeb¨oller and Fischer, 2007). Most commonly, an additive model or a trend test is used (Pearson and Manolio, 2008). In addition, since the underlying genetic model is unknown, it prevailed to perform tests for all three standard models dominant, additive and recessive and to use their maximum. By permutation methods (Freidlinet al.,2002) or a conditional test taking the correlations of the statistics into account, an adjustment can be performed (Ziegleret al., 2008).

Furthermore since confounders play an important role in population-based association studies, the consideration of those is an important point to avoid spurious associations.

Confounders can be already considered in the study design and recruitment, e.g. by choosing homogenous groups. Alternatively, they can be integrated in the analysis by stratification or adjusting, e.g. in a regression model. One of the main confounders in population-based association studies is population stratification, also called ”confound-ing by ethnicity” (Ziegler et al., 2008). We will consider this phenomenon in more detail in the next section.

Population stratification

Population stratification is a population heterogeneity based on the presence of multiple populations or subgroups according to ethnicity or geographic origin involved in a population-based association study, where the disease prevalence differs between the subgroups and the frequencies of the genetic marker alleles and LD patterns between the markers vary (Cavalli-Sforza et al., 1994; Dawson et al., 2002; Hirschhorn, 2005;

Jorde et al., 1994; Patil et al., 2001; Phillips et al., 2003; Shifman et al., 2003; Teo, 2008; Watkins et al., 1994; Zavattari et al., 2000). When population structures are undetected and not accounted in the analysis, this provides a serious issue since the variation in disease rates across the groups and the different allele frequencies have the potential to result in inflations of the test statistic (Ziegler et al., 2008). This may lead to spurious associations (Palmer and Cardon, 2005). If e.g. cases tend to be over-sampled for one of these groups, all alleles more common in that group will appear to be associated with the disease. Already before the GWAS era, the problem of population stratification was widely debated (Cardon and Palmer, 2003; Freedman et al., 2004; Thomas and Witte, 2002; Thomas et al., 2005; Wacholder et al., 2000).

The simplest method to account for population stratification, the genomic control approach (Devlin and Roeder, 1999; Devlin et al., 2001), corrects for the stratification without identifying the sample structure. Since population stratification leads to an

overdispersion of the statistics, the degree of inflation of the test statistics and hence the extent of population heterogeneity can be estimated (Devlin and Roeder, 1999).

Therefore, χ2 association test statistics for all SNPs are calculated and the median over all SNPs is compared with the expected theoretical median of the distribution under the null hypothesis of no association. Since the fraction of false positive results is expected to be increased, the quotient of the observed and the expected χ2 median, denoted as inflation factorλ, is expected to be >1 (Devlin and Roeder, 1999). All test statistics are furthermore corrected for the inflation by dividing them by the inflation factor, hence resulting in an adjusted test of association. In a study of Nelis et al.

(2009) about the genetic structure in Europe inflation factors for the comparison of 19 samples from 16 European countries were calculated. Between Southern and Northern Germany (KORA and POPGEN cohorts) they obtained a lambda value of 1.08, both being close to the European population of Hap Map with 1.06 and 1.07 respectively.

The largest genetic distance was observed for Spain and Kuusamo located in the middle of Finland (4.21), with these having inflation factors of 1.34 and 2.89 with the European Hap Map population. For the different Hap Map populations, inflation factors of 21.56 for the African and Asian population and a slightly smaller one for the African and European population, 13.27 for the European and Asian population and 1.77 between the two Asian populations were calculated. While an inflation factor of less than 1.05 is still acceptable, for higher factors a correction is recommended (Aulchenko, 2010).

Another possible strategy is to initially identify the underlying population structure by determining the genetic similarity between the individual participants and then correct for the particular structure. Therefore, the SNP data should be pruned first so that only SNPs with no strong LD among them remain. A measure typically used to express the genotypic similarity between two individuals is the kinship coefficient.

The kinship coefficient is defined as the probability that an allele of a particular locus that is randomly chosen from an individual is identical by descent (IBD) with an allele selected from the same locus of the other individual. Two alleles are IBD when they are copies of the same ancestral allele. The kinship coefficients for all pairs of individuals are collected in the kinship matrix and that matrix can be used as a part of the model for the correlations of the outcomes in a random effects model (Yu et al., 2006). Alternatively, a principle component analysis (PCA) (Patterson et al., 2006;

Price et al., 2006; Tian et al., 2008; Tiwari et al., 2008)) based on 0.5 - the kinship matrix (distance matrix) can be performed. The leading eigenvectors, the principle components (PC), can be extracted and describe informative ”axes of ancestry”. These axes can be represented graphically so that different populations can be identified.

Furthermore, to correct for the population stratification that may exist in the data, the axes can be used as covariates in the subsequent association analysis (Patterson et al., 2006; Price et al., 2006; Tian et al., 2008; Tiwari et al., 2008; Weale, 2010).

The number of PC axes to consider in the analysis can be yield by testing them for statistical significance (Weale, 2010). However, the PCA can not only detect and correct for correlations due to ancestry, but also any source of correlation in the data.

Lab errors, e.g. systematic genotyping artifacts and many high-effect causal SNPs in a case-control study can be picked up as well (Weale, 2010). To clarify the source of correlation, it is possible to use external reference populations in the analysis and see how the study individuals cluster to these. This can be done PCA, but also another

approach, denoted as STRUCTURE (Pritchard et al., 2000), is dealing with this very issue. In STRUCTURE, the study sample is compared with reference populations and each individual is assigned to one of these populations (Pritchard et al., 2000).

The population membership is determined, but also outliers, migrants and admixed individuals can be identified, not clearly belonging to one of the distinct populations.

Based on the obtained knowledge, a stratified analysis can be performed. When the ancestry of all individuals is already known by the reported geographic location or ethnicity, a stratified approach can be conducted as well. The approach of genomic control has the disadvantage that a constant multiplicative factor is used to correct each SNP test statistic. This assumes that the existing population structure has an uniform influence across the whole genome (Teo, 2008). Hence, the method fails when the stratification affects certain SNPs more than the average (Teo, 2008). In comparison, the PCA correction is adjusted to each SNP individually, e.g. by the magnitude of SNP variation along each axis of ancestry, and hence corrects not only for false positive but also false negative results. PCA has been shown to be more powerful than genomic control or structured association analysis. Furthermore, it is fast to implement, intuitive and appealing (Price et al., 2006), so that using PCA axes as covariates is the preferred method for handling population stratification in large genetic studies. Nevertheless, PCA and a stratified analysis have to be treated with caution when cases and controls come from different source populations. In that case the covariates could take up all the possible variance between case-control status, including true association effects.

In the study of Nelis et al. (2009) the genetic structure within Europe showed a clear correlation with the geographic location, with the first two PCs representing the genetic diversity from northwest to southeast. In 2006, (Steffens et al., 2006) investigated the genetic substructure in the German population and observed that only minor degree of population substructure (Ziegler et al., 2008) exists. Nevertheless, the larger the sample size of a GWAS, the more susceptible is the study to confounding from finer levels of population differences (Teo, 2008). Hence, the greater is the potential bias from the stratification (Freedman et al., 2004; Marchini et al.,2004).

Visualization of GWAS results

Two popular graphical representations of GWAS results are the Manhattan plot and the Quantile-Quantile-plot (QQ-plot). The Manhattan plot is a type of scatter plot that allows the display of a high number of data points as given in genome-wide association studies. It provides a visual summary of the association test results for the examined SNPs and clearly highlights (regions of) significant markers. The plot displays the negative logarithm (−log10) of the p-values for the single SNPs on the y-axis as a function of the chromosomal location on the x-axis. For visual effect, the different chromosomes are shown as blocks of different colors. Since the strongest associations have the smallest p-values, the corresponding −log10 will be greatest, so that SNPs with significant p-values will stand out. In figure 3.1 a Manhattan plot for a meta-analysis of 4 different lung cancer GWAS is shown. We can see a clear peak by numerous neighboring SNPs on chromosome 6, as it is expected by a truly associated region. Furthermore, several genome-wide significant loci close to each other show up on chromosome 15, pinpointing to another truly positive hit. On chromosome 5, 10

Figure 3.1: Manhattan plot for a meta-analysis of four different lung cancer GWAS:

Central Europe lung cancer GWAS of the International Agency for Research on Cancer (IARC, Prof. Brennan), Toronto lung cancer GWAS of the University of Toronto and the Samuel Lunenfeld Research Institute (SLRI, Prof. Hung), Texas genome-wide lung cancer study conducted by the M.D. Anderson Cancer Center (MDACC, Prof. Amos, Prof. Spitz) of the University of Texas and UK lung cancer GWAS at the Institute of Cancer Research (ICR, Prof. Houlston) [with kind permission of Prof. Amos]

and 21 we can see some more SNPs that reach genome-wide significance. Since these are only single SNPs standing out, they rather indicate false positive results than true associations. The Quantile-Quantile plot (figure 3.2) is a useful tool to check the quality of the data on the one hand and assess the number and strength of the observed associations on the other hand (Pearson and Manolio, 2008). Therefore, the expected distribution of the association test statistics across all SNPs under the null hypothesis of no association (x-axis) is compared to the observed value in the data (y-axis) (Pearson and Manolio, 2008). In GWAS it is assumed that the vast majority of the genotyped SNPs is not associated with the disease. Hence, their test statistics follow the null distribution and only a minor deviation from the diagonal in the QQ-plot should be observed. Only a handful of values that deviate in the upper tail of the distribution may represent SNPs with strong evidence for a true association (Pearson and Manolio, 2008). For diseases highly associated with SNPs in a heavily genotyped region, such as Rheumatoid Arthritis associated with the HLA region on chromosome 6p21 (Pearson and Manolio, 2008), stronger deviations can be observed. However, large deviations of the observed values indicate consistent differences between the cases and controls across the whole genome. This systematic bias in the data can be due to e.g. relatedness, population stratification or genotyping artifacts (Pearson and Manolio, 2008). By filtering the data according to the different quality criteria listed in the previous section and correcting for population stratification this type of bias can be avoided. Other confounders, such as smoking in lung cancer can inflate the distribution as well when not considered in the analysis. In figure3.2 a QQ-plot for a lung cancer GWAS that is

Figure 3.2: QQ-plot for three different regression models analyzing the Central Eu-rope lung cancer GWAS of the International Agency for Research on Cancer (IARC, Prof. Brennan). The study comprises individuals from six different central and eastern European countries. The first regression model involved only sex as a covariate, model 2 includes the country by five dummy variables. The last regression model addition-ally involved smoking status (never-ever) and packyears. [with kind permission of Prof.

Brennan]

composed of 6 different central and eastern European countries is shown to demonstrate the effect of population stratification and smoking as important confounding factors.

Therefore, results from three different logistic regression models for the genetic markers are compared: including only sex as a covariate, including sex and country as well as including sex, country and smoking. We can clearly see that the inflation of results is reduced by including both confounders in the analysis. This is also reflected by the corresponding inflation factors of λGC = 1.042, λGC = 1.022 and λGC = 1.013.

Assigning significance of association signals and validity

The major challenge in GWAS is not the type of test to use but the number of tests.

Performing hundreds of thousands of tests brings a high computational and statistical multiple testing burden. In addition, this can be complicated by having multiple phe-notypes, considering further modifiers to the basic analysis (Neale and Purcell,2008) or testing within subgroups. In a study of 500,000 SNPs conducting one single SNP test for each of them, 25,000 false positive results are expected using a nominal significance level of 5%. Therefore assessing the significance of the SNPs is an important issue in GWAS. Numerous methods to address the multiple testing problem were developed long before the GWAS era. These methods can be distinguished in methods controlling the family wise error rate (FWER) or the false discovery rate (FDR).

The FWER is the probability of making at least one false positive result and the classi-cal methods for controlling the FWER are characterized by simple p-value adjustment

The FWER is the probability of making at least one false positive result and the classi-cal methods for controlling the FWER are characterized by simple p-value adjustment