Practical issues in pathway based GWAS analysis

Although pathway based methods enjoy increasing popularity in the area of genome-wide association studies, their application is still in its infancy and presents several challenges. Different gene set sizes and gene lengths, as well as the strong correlation of SNPs due to the LD patterns and the presence of overlapping genes may lead to bias. Methodological issues start with assigning the single SNPs to genes and genes to gene sets as well as summarizing marker information on a gene level. Gene set analysis methods also include the construction of the test statistics and finally the assessment of statistical significance to whole gene set by considering the potential sources of bias (Wanget al.,2011). Moreover, when prioritization methods are used a really important issue is how to code the covariates for the gene set information.

Except for the last point mentioned, the listed issues are not the central focus of our work, but we had to deal with them in our applications and make decisions how to solve them. Therefore, we will take a closer look at these different critical steps in pathway based GWAS analyses in the following. We will process the different issues one by one, present typical possibilities and illustrate our decision.

So far, these logistical aspects that accompany the pathway based methods demand subjective choices since more intensive research on these critical points is still necessary to come to an informed decision (Tintle et al., 2009a). Some of the challenges have been explored in greater detail in other areas before and we can take advantage of the lessons learned and use the available information. However, others are still inadequately investigated. Since the coding of the pathway information given as covariates for gene or SNP prioritization methods is a more central point of our work, it will be part of our applications in section 5.4. A graphical overview about the different steps in the analysis flow is given in figure5.2.

Figure 5.2: Steps in a pathway based genome-wide association study analysis 5.3.1 SNP to gene assignment

In comparison to gene set analysis in gene expression studies, one challenge of pathway based methods in GWAS is that we do not have one test statistic per gene but several test results for single SNPs within and close to a gene (Penget al., 2010). Hence, as a first step in a pathway based analysis, SNPs have to be assigned to genes to relate the data to the pathway information. This holds in particular for our applications.

Several data bases, such as the National Center for Biotechnology Information (NCBI) (2012) data base, the USCS genome browser (Kentet al.,2002) or the Ensembl genome browser (Flicek et al., 2012), provide gene annotation information. This information can include start and end position for the genes, SNP positions as well as direct SNP to gene assignments. The information can be obtained by downloading correspond-ing annotation files or extractcorrespond-ing information from the data base e.g. by R (Melville, 2011) or programming languages for data management (SQL). In addition, Affymetrix (http://www.affymetrix.com) and Illumina (http://www.illumina.com) directly provide annotation files for their SNP arrays, including SNP position and the nearest gene. Since different names can occur for the same gene, caution is necessary to ensure a consistent notation.

When assigning SNPs to genes it is possible to restrict to those markers within gene coding regions. SNPs that can be mapped to the coding region of several genes are as-signed to all of them. However, restricting to the coding part does not cover regulatory regions or consider LD. Therefore, the usual strategy to assign SNPs to genes is map SNPs not directly located within a coding sequence to the nearest gene within a certain distance window of +/- X kb down- and upstream. Thereby, the core gene part as well as the boundary regions containing regulatory units are covered (Tintle et al., 2009a;

Wang et al., 2010, 2011). The reasonable window size is still unclear and the implica-tions of different sizes are still unknown (Tintle et al.,2009a). Different distances were proposed such as 500kb (Wang et al., 2007, 2010), 200kb (Perry et al., 2009), 100kb (Wang et al., 2010), 20kb (Jia et al., 2010), 10kb (Wang et al., 2010) and 5kb (Chen et al.,2010).

Assigning the SNPs to genes still poses problems for the analysis. So far, there is no exact definition of a gene and the corresponding gene positions vary between different databases and over time. Assigning SNPs to more than one gene induces correlation be-tween them and may result in bias. Furthermore, SNPs may be located in a regulatory region of a gene although this is not the nearest one, resulting in a wrong assignment.

In gene set analysis methods, all SNPs that are not assigned to any gene because they are not close to any one are excluded from further analysis. This can be a severe loss of information. This especially holds when a small window size is used since it may affect several hundreds of thousands of SNPs. A wide window size on the contrary allows the assignment of numerous irrelevant SNPs without any effect to contribute to a gene set and may dilute its potential single strengths (Wang et al., 2011).

To improve gene set analyses in GWAS one may use more sophisticated strategies for SNP assignment. These may for example consider information about exact LD patterns around a gene (Bushet al.,2009; Honget al.,2009) or regulatory units from expression studies (Wang et al., 2011). Veyrieras (Veyrieras et al., 2008) estimated that most ge-netic variants that influence gene expression are located within 20 KB around the gene (Wanget al., 2011).

For our application presented in section 5.4, we assigned SNPs to genes using the cor-responding SNP annotation file available from Illumina upon request. Each SNP was assigned to the nearest gene within +/- 500kb. This relatively large distance guaranteed to miss no regulatory or LD region of the gene, even though that was accompanied by assigning many non-relevant SNPs to the gene as well. By using 500kb the number of SNPs not assigned to a gene and hence not considered at all in the gene set analysis is reduced.

5.3.2 Gene-based test statistic

Since pathway based methods in GWAS are often performed based on genes rather than SNPs, gene level summary measures of association have to be obtained from the sets of underlying SNPs. Although the HBP in our application parts was based on the SNP level, we performed the GSEA on the gene level so that this issue represents a practical aspect of our work. The best strategy for the reduction of the SNP-level information within each gene is still disputed and represents a whole area of research not pursued here.

In the following assume that we have a geneG and N_M_(G) assigned SNPs with ordered p-values p_M₍₁₎_(G) ≤ · · · ≤ p_M₍

NM(G))(G) (ascending) and corresponding test statistics

t_M₍₁₎_(G) ≥ · · · ≥ t_M₍

NM(G))(G) (descending) from a single SNP analysis. A very simple

approach to summarize the single SNP signals at the gene level is to represent each gene by its most significant SNP (maximum SNP statistic, Sidak’s combination test, Peng et al. (2010)) p_G(Sidak) = p_M₍₁₎_(G) and t_G(Sidak) = t_M₍₁₎_(G). According to Simes’s (1986) combination method we could also chose p_G(Simes) = min_jnN_M_(G)p_M

(j)(G)

(Peng et al.,2010). Simes procedure was proposed as a more powerful alternative to the Bon-ferroni correction for multiple testing. For J hypotheses H₍₁₎, . . . , H_(J) with p-values p₍₁₎ ≤ p₍₂₎ ≤ · · · ≤ p_(J), H_(j), j=1,. . . ,J, is rejected if p_(j)(Simes) = ^{J p}_j^(j) ≤ α. This method is simple to apply and in particular advantageous over Bonferroni given highly correlated test statistics. The combination method uses the minimum of these Simes corrected p-values p_(j)(Simes), j=1,. . . ,J, for the joint test of the whole set of hypotheses (Simes,1986).

However, for both gene summary methods the loss of information is huge, they do not consider the correlation between the SNPs due to strong LD within a gene and are

susceptible to genotyping errors. Several weakly informative markers within one gene will not be detected (Sohns et al., 2009). Nevertheless, the maximum SNP statistic is often used due to its simplicity.

Instead of picking the result of only one particular SNP to represent the gene, alterna-tively the results of all or a subset of most significant SNPs assigned to a gene could be combined in one statistic (Chen et al., 2010; Dudbridge and Koeleman, 2003; Hoh et al.,2001;Yu et al., 2009; Zaykin et al., 2002). This may be done by Fisher’s combi-nation method (equation5.2) or other methods known for meta-analysis. However, the assumption of independence between the different p-values is not fulfilled. Depending on the number of SNPs involved, the resulting gene level statistics are distributed with different degrees of freedom. Thus they are not directly comparable to each other. An alternative may be a joint association test such as a multiple regression.

In general, the different gene size and hence different number of SNPs assigned per gene poses a problem and may result in a potential bias. Larger genes are more likely to have larger test statistics and more significant p-values, leading to favor large genes and discriminate small genes (Tintle et al., 2009a). This again leads to privilege gene sets with many large genes rather than gene sets primarily involving small genes. Therefore, an adjustment of the gene level measures is recommended, e.g. captured by permutation methods that are discussed in section5.3.4.

In a comparison ofBallard et al.(2010) among seven multi-marker association tests, in-cluding the maximum SNP statistic, principal component regression (Gaudermanet al., 2007; Wang and Abbott, 2008) demonstrates to be the most powerful (Wang et al., 2011). Alternatively, the analysis can be performed on SNP level directly using SNP sets instead of gene sets.

In our application given in section 5.4, the maximum test statistic among all SNPs of each gene was chosen to represent the gene test statistic for the GSEA since it is the method most commonly used so far and simple to apply. Our HBP was based on the SNP level.

5.3.3 Pathway information

The most prominent component for pathway based analysis is the biological informa-tion used. The quality of gene set analysis results is highly related to the quality of the underlying gene set annotations.

Information about biological pathways can be found in different publicly available databases. Since these sources provide different knowledge about gene sets, the choice of gene set information is a challenge and has to be made arbitrarily. So far, there is no consistent information and only little advice how to correctly choose the most rele-vant accurate prior information (Tintle et al., 2009a). Despite the high availability of gene set information, the pathway description is not yet sufficient and complete due to our incomplete knowledge about the human genes and their relationships (Sohnset al., 2009; Wang et al., 2011). Various human genes are not well understood or uncharac-terized and cannot be mapped to pathways (Wang et al., 2010). Another problem is that genes in general function in multiple ways and therefore appear several times in different pathways. This overlap of gene sets results in redundant information among the sets (Wang et al.,2011), leading to problems in the pathway analysis.

For our analysis presented in this chapter, we used a file provided byWanget al. (2007) in the GenGen-package combining different Web resources (Wang, 2008). This collec-tion involves informacollec-tion from 2076 pathways and gene sets from Biocarta, KEGG and the Gene Ontology (January 14^th 2008). Some more information about the gene set databases mentioned in this thesis can be found in the appendixA.3.

5.3.4 Significance Assessment

One important issue in gene score resampling (GSR) methods is the calculation of the empirical distribution of the score under the null hypothesis. The empirical distribution is obtained by recalculating the gene set scores multiple times by a permutation procedure. Disease labels as well as SNP or gene-level statistics can be used as permutation units (Tintle et al., 2009a).

When a phenotype permutation (sample randomization) test procedure is used, the case-control status is randomly shuffled keeping the total number of cases and controls fixed (Sohns et al., 2009; Tintle et al., 2009a). SNP statistics, gene-level statistics and gene set score are recalculated. In the SNP and gene permutation method, the SNP or gene-level statistics are shuffled across the genome and gene set scores are calculated based on these reallocated statistics (SNP or gene randomization). Hence, using gene randomization, random gene sets of the same size are generated for comparison. A comparison of advantages and disadvantages of the different permutation methods is given the appendix B.1.

For each gene set, an empirical distribution of the corresponding score is obtained by the permutations to represent the null distribution. The background distribution established by phenotype permutations represents the null hypothesis underlying self-contained GSA methods, that no SNP and hence gene is associated with the disease. In practice however, some susceptibility SNPs will occur (Wang et al., 2010).

SNP and gene randomization permutation represent the compatible null hypothesis of no enrichment as in competitive testing (Ballard et al., 2009; Tintle et al., 2009a,b).

Typically, the permutation strategy is chosen ignoring the compatibility with the corresponding null hypothesis of the used gene set method (no association vs. no enrichment) (Wang et al., 2011). This may lead to some bias.

Based on the empirical distribution for a gene set, a nominal p-value can be simply calculated by the fraction of permutation scores for this set that are equal or larger than the original score. Furthermore, since in GWAS numerous gene sets are tested, permutation results may be used for a FWER or FDR method to correct for multiple testing (Curtis et al., 2005). This affects GSR as well as ORA methods.

In the following we will present an example of a phenotype permutation method to assign significance to GSEA scores and calculate the FDR and FWER in more detail.

This approach is used in our practical applications to avoid bias due to gene size and keep the correlation structures in the data.

Sample-label permutation procedure in GSEA

This sample-label permutation procedure for GSEA in the context of genome-wide as-sociation studies was proposed by Wang in 2007. Assume that we calculated the en-richment score ES(S_k) for each of N_S examined gene sets S_k, k=1,. . . ,N_S. For each

permutation b = 1, . . . B, we randomly assign the original disease labels to the sam-ples, recalculate the SNP and gene-level statistics and recomputed the enrichment score ES(S_k, b) for each gene set to obtain the null distribution of ES(S_k). The nominal p-value is given by

p_S_k_(nom) =

b=1,...,B

(ES(S_k, b)≥ES(S_k))

B (5.8)

Since gene sets of varying size are not necessarily directly comparable to each other (Wanget al.,2007), the permutations can be used to adjust for these differences between gene sets. Therefore, we can calculate the mean and variance of the permutation scores per gene set by ˆµ_S_k =PB

b=1ES(S_k, b) and ˆσ²_S

k = _B−1¹ PB

b=1(ES(S_k, b)−µˆ_S_k)² and each enrichment score is normalized by subtracting the corresponding mean and dividing by the standard deviation

N ES(S_k) = ES(Sk)−µˆS_k

σ_Sk , N ES(S_k, b) = ES(Sk, b)−µˆS_k

σ_Sk , b= 1, . . . , B.

To control the FWER, we can calculate the p-value by comparing the true NES of the gene set of interest NES(S_k) with the highest NES score over all gene sets per permutation

p_S_k_(FWER) =

b=1,...,B

( max

l=1,...,NS

(N ES(S_l, b))≥N ES(S_k))

B . (5.9)

To control the FDR, the distribution of allN ES(S_l, b) over all gene setsS_l, l= 1, . . . , N_S and permutationsb= 1, . . . , B is used to estimate a FDRq-value for a givenN ES(S_k).

This is given by

q_S_k_(FDR) =

# b= 1, ..., B l= 1, ..., N_S

(N ES(S_l, b)≥N ES(S_k)) ,

(BNS)

l=1,...,NS

(N ES(S_l)≥N ES(S_k)) N_S

. (5.10)

Both FDR and FWER were calculated in our applications.

5.4 Analysis of the NARAC data for Genetic Analysis

Im Dokument The Empirical Hierarchical Bayes Approach for Pathway Integration and Gene-Environment Interactions in Genome-Wide Association Studies (Seite 99-105)