• Keine Ergebnisse gefunden

5.4 Analysis of the NARAC data for Genetic Analysis Workshop 16

5.4.5 Results

(5.12) Within the brackets, we have the ratio of the number of duplicated elements within the lists to the total number of elements. The preceding factor is a normalization factor, so that we have an index of 0 when all lists are different and no element occurs twice (worst case), while an index of 1 indicates that all lists have exactly the same elements (best case). In total, the index describes the observed number of duplicated elements in relation to the maximal possible number of duplicated elements. Given five different lists of genes (including initial results) or four lists of gene sets, we can make pairwise comparisons, as well as compare three, four or even all five of the lists at once (#lists

= 2,3,4,5). The number of elements per list equals 100 for the genes and 20 for the gene sets. Comparing only two lists with each other, the index gives the proportion of elements in list two that do occur in list one as well. Note, when comparing more than 2 lists, the index may be larger than for the corresponding subsets of lists.

5.4.5 Results

By single SNP analysis of the Rheumatoid Arthritis (RA) data, we observed 334 SNPs with genome-wide significance. This large number of associated SNPs with very small p-values is a specialty of the data that results from the important role of the HLA region in RA development. The significant SNPs belong to 90 different genes including 81 genes of the HLA region. 153 of the 876 examined gene sets involve at least one of these 90 genes. A Manhattan plot of the initial single SNP results is given in figure5.5.

Taking a look at non-HLA genes known for an association with Rheumatoid Arthritis (table 5.3), we found only PTPN22 (rank 55), C5 (rank 78) and TRAF1 (rank 82) within the top 100 genes.

To reduce the number of pathways for the analysis to 100, we started selecting all pathways that involve the top gene, then added the ones that include the gene on rank two and so on, until we reached 100 selected pathways. We processed the top 75 genes to reach that final number. Only two of these gene sets were without genes from the HLA region. This leads to the preference towards HLA in our analyses.

I (GSEA): With the gene set enrichment method alone 47 gene sets reached a FDR<0.05 (that is nearly half of the considered 100 pathways) and still 20 of these had a FWER<0.05. These presented our top 20 gene sets used for the method comparison.

Figure 5.5: Manhattan plots of single SNP results (initial GWAS).

left plot: includes all results; right plot: zoom into the left plot to get a better impression of the non-HLA results

All gene sets had a nominal p < 0.05. To obtain 100 top genes, we started extracting the leading edge subset of the best ranked set and added the genes of the LES for the next best ranked gene set until we reached 100 genes. For this, the LES of gene set on rank 20 is not considered since the top 100 genes were already filled.

IV (HBP+GSEA): When the HBP preceded the GSEA we observed 7 gene sets with a FDR < 0.05 and 3 of these with FWER <0.05. The two best ranked pathways involved 68 different genes in their LES that were used for the top 100 gene selection.

Thus only an additional 32 LES genes from the pathway on rank position 3 were considered, although its LES even involved 67 genes.

II (HBP) and III (HBP+HBP): For the two pure hierarchical Bayes strategies the SNP ranking was changed only slightly by the external gene set information. Taking a closer look at the estimated hyperparameters, we can see that for both strategies 49 of the 100 gene sets involved as covariates in the hierarchical model had positive beta-regression-coefficients in the prior probability model of association. A positive regression coefficient represents an up-ranking of all SNPs involved in the corresponding gene set. Since each of the 100 pathways incorporated in the model involves at least one highly significant gene due to the strategy used for selection of these 100 gene sets, the high number of positive regression coefficients is not surprising. Based on the re-ranked SNP lists of strategies II (HBP) and III (HBP+HBP), a new ranking list on the gene level was obtained by using the maximum posterior probability as a representative for each gene. For comparison purpose, the top 100 genes of these lists were picked.

Comparison of most promising genes

The figures of 5.6 show the comparison of the top 100 genes of all 4 strategies with the initial analysis and each other. In the left plot, the initial ranks of the top 100 genes for each of the the different methods is plotted on the y-axis. The figure on the right displays a Venn-Diagram representing the top 100 gene overlap. As a general overlap-index for all four strategies we obtain II,II,III,IV(genes) = 0.51.

The gene list we obtained by strategy II (HBP) is nearly the same as the one we get

Figure 5.6: Top 100 genes after applying GSEA, HBP, HBP+HBP and HBP+GSEA.

left: Comparison of the top 100 genes with initial ranking positions. The top 100 genes per method were ordered by initial GWAS rank. right: Venn diagrams for the overlap of top 100 genes between our four strategies

by simple single SNP test. Only 2 genes from original rank 112 and 127 are replaced by two other genes and occurred newly in the top 100. The same behavior holds for strategy III (HBP+HBP). The list of top genes is almost identical to the one from strategy II (III,III(genes) = 0.89) and stayed almost unchanged compared with the initial list (III,III,initial(genes) = 0.94, III,initial(genes) = 0.98, IIII,initial(genes) = 0.89). 11 new genes appeared in the top 100 lists that were initially ranked between rank 229 and 586. Taking a look at the non-HLA genes known for an association with RA (table 5.3), PTPN22, CTLA4, CD28, CD40, PRKCQ and PTPRC reached mentionable higher ranks for strategy III (HBP+HBP) than in the original ranking (table 5.3). For strategy II, no such improvement is observed. Note that for strategies II and III 72 and 82 genes respectively had a posterior probability of 100% and even 3,423 and 6,922 genes had posterior probabilities >80%. Thus the HBP only strategies yielded nearly no change on the gene level. This supports Lewinger et al.’s conclusion that the approach is not helpful if highly significant associations occur. Because of the high impact of the HLA genes on chromosome 6p21 and very low p-values occurring for SNPs of these genes, the HBP approach cannot change the results by much. The leading edge subsets of the GSEA strategies highlighted many new genes in comparison to the top 100, primarily HLA genes, of the initial single SNP tests. Only 25 and 16 of the initial top 100 list are also included in the top 100 genes of I (GSEA) and IV (HBP+GSEA) with an overlap index ofII,IV,initial(genes) = 0.31 (II,initial(genes) = 0.25 ,IIV,initial(genes) = 0.16). The newly occurring genes in the leading edge subsets had initially ranks of up to 10,789 and 21,361, respectively.

Additionally, the LES gene lists of the two GSEA methods also differ remarkable, with an overlap index of II,IV(genes) = 0.37. They have only 37 genes in common. As shown in table 5.3, from the non-HLA genes known to be associated with RA, 11 genes belong to the LES of the significant pathways (FDR <0.05) of GSEA. For HBP+GSEA, only four of these plus one more were found in the LES.

Summarizing promising genes, the HBP only methods reveal the extraordinary

impor-Figure 5.7: Venn diagrams for the overlap of top 20 pathways between our strategies

tance of the HLA region with more than 89 HLA genes retained in the top 100 lists.

The ranking of genes changed only slightly compared to the initial list. In contrast, apart from only 29 and 19 HLA genes, respectively, the GSEA strategies include many additional other genes, that could be a new starting point to identify yet unknown factors influencing the disease.

Comparison of most promising gene sets

In total, we find 51 different gene sets in the four top 20 lists of the different strategies, with an overall overlap-index of I(gene sets)

I,II,III,IV = 0.48. Two gene sets, GO0002460: the adaptive immune response gene set (66 genes) and hsa04612: the antigen processing and presentation gene set (61 genes), appear in all four lists. Of the 61 genes in hsa04612, 22 are from the HLA region, while GO0002560 contains only 3 HLA genes.

Both sets have 3 genes in common (HLA-DMA, CD74 and LTA). Since the non-HLA genes in the latter gene set are involved in the activation or inhibition of immune reactions, this set is a reasonable candidate for RA. We can see a tendency ofhsa04612 to reach higher ranks with the GSEA methods (third and fourth rank), as with the strategies with a final HBP step (rank 11 and 17). On the contrary,GO0002460 ranked between 11th and 19th rank in all four strategies.

Figure 5.7 shows the overlap of the top 20 gene sets per strategy in a Venn diagram.

Strategy II yields essentially different results to the others, with list-to-list overlap indices of I(gene sets)

I,II , I(gene sets)

II,III , I(gene sets)

II,IV ≤ 0.25. The remaining strategies comprise 38 different gene sets, have an overlap-index of I(gene sets)

I,III,IV = 0.53, (I(gene sets)

I,III = 0.35, I(gene sets)

I,IV = 0.55, I(gene sets)

III,IV = 0.55) and share 5 gene sets. The latter are composed of 11 to 63 genes and have pairwise no more than 5 genes in common. Comparing the strategies involving GSEA (I and IV) we have an overlap-index of I(gene sets)

I,IV = 0.5, while strategies II and III – both compassing HBP only have an overlap-index of I(gene sets)

II,III = 0.15. Hence, this indicates a more robust ranking of gene sets by GSEA than HBP, although strategy I and IV even share 11 sets of their top 20.

Although the top 100 gene list of both strategies with HBP as last step (II and III) are almost identical, their top 20 gene sets diverge considerable. Furthermore, it has to be mentioned that the order of gene sets for the HBP only methods (II and III) is

Table 5.3: Non-HLA genes known for an association with Rheumatoid Arthritis (Raychaudhuri, 2010; Amos et al., 2009)

INITIAL I: GSEA II: HBP III: HBP+HBP IV: HBP+GSEA

gene rank LES gene rank gene rank LES

include exclude include exclude include exclude include include

HLA HLA HLA HLA HLA HLA HLA HLA

PTPN22 55 1 24,38 25 1 28

Begovichet al.(2004)

C5 78 4 17,19,20,25, * 80 2 75 *,

Plengeet al.(2007) 30,34,42,49

TRAF1 82 6 83 3 98

Plengeet al.(2007)

BLK 167 65 183 6187 273

Gregersenet al.(2009)

CTLA4 358 246 1 382 8452 219 2,*

Plengeet al.(2005)

CD28 400 286 1,6,8,15, * 407 67 97 1,2,*

Raychaudhuriet al.(2009) 25,31,38

CD40 489 374 1,6,8,15, * 510 3372 129 2,*

Raychaudhuriet al.(2008) 19,30,31,38

TNFAIP3 570 455 45 548 3835 679

Plengeet al.(2007)

IL2RB 618 503 809 5439 1168

Bartonet al.(2008)

PRKCQ 625 510 31,38 603 2342 260 *

Bartonet al.(2008)

REL 644 529 677 10746 711

Gregersenet al.(2009)

PRDM1 827 710 885 4409 1235

Raychaudhuriet al.(2009)

AFF3 1751 1633 1821 2497 1282

Bartonet al.(2009)

PTPRC 1765 1647 19,24,30, * 1647 398 391 2,*

Raychaudhuriet al.(2009) 31,38,45

IL21 2517 2398 31,38 2694 3684 880 *

Zhernakovaet al.(2007)

CD2 3579 3458 24,31 3895 3073 1011 *

Raychaudhuriet al.(2009)

IGSF2 3605 3484 25 * 3692 6212 3395

Raychaudhuriet al.(2009)

TAGAP 3674 3553 4084 954 3951

Raychaudhuriet al.(2009)

STAT4 4633 4511 4748 3310 4652

Remmerset al.(2007)

CCL21 4957 4835 4538 11506 5647

Raychaudhuriet al.(2008)

CD58 5718 5595 4023 4575 6627

Raychaudhuriet al.(2009)

IL2RA 6620 6496 8357 798 4616 *

Thomsonet al.(2007)

IL2 8858 8733 8849 12306 1302 1

Zhernakovaet al.(2007)

TNFSF14 10669 10542 9957 11365 13669

Raychaudhuriet al.(2008)

KIF5A 11263 11135 11793 11919 12079

Bartonet al.(2008)

TRAF6 12186 12058 12234 14275 6199

Raychaudhuriet al.(2009)

RAG1 12216 12088 16628 1074 17017

Raychaudhuriet al.(2009)

FCGR2A 12256 12128 13650 10713 13361

Raychaudhuriet al.(2009)

For initial single SNPs analysis, HBP (II) and HBP+HBP (III) the corresponding gene ranks are given. For GSEA (I) and HBP+GSEA (IV) the ranks of significant gene sets (FDR0.05) are given that involve these genes in their leading edge subset (LES). Genes belonging to the LES of sets significant according to the nominal p-value are marked by *. For the GESA excluding the HLA region, none of the gene sets is significant according to FDR.

only build by the estimated regression coefficients (equation of4.19) that represent the increase or decrease of the prior probability for each SNP involved in the corresponding gene set. The standard deviation of these estimates is neglected. However, for the strategies with GSEA (I and IV), the order of the gene sets is based on gene set significance. In general, taking a closer look at the gene sets that show up in the lists, of both strategies with HBP as last step (II and III) are almost identical (III,III(genes) = 0.89), their top 20 gene sets diverge considerable. Many of them have biological plausibility.

For the biological interested reader, the lists of the different top 20 gene set lists are given in appendix table B.1.

To see how the methods behave when we do not have such a huge number of very high signals, we repeated the simple HBP and GSEA analyses excluding the HLA region. Based on the recommendation of Lebrec et al. (2009), we excluded all genes in the region between the genes MOG and KIFC1. These comprise 128 genes and 1,336 underlying SNPs. After the exclusion, 9 genome-wide significant genes remained, including PTPN22, TRAF1 and C5 (table 5.3). For the HBP, taking a look at the estimates for the posterior expectations of the noncentrality parameters that represent the strength of association (equation 4.23), PTPN22, C5 and TRAF1 were located on the three top ranks. C5 and TRAF1 were up-ranked in comparison to the initial ranking. Their corresponding posterior estimates were given by 2.21, 1.70 and 1.67 (table 5.3). The estimates for all β parameters of the logistic regression model from level 3 (equation 4.19) were positive, representing an increase of prior probability for SNPs involved in any of the pathways. The basic prior probability from that model for a SNP in none of the pathways was estimated by 0.9. Hence, all SNPs reached a posterior probability to be associated of at least 90%. As it is not plausible that everything is associated with the disease, results have to be considered with caution. In comparison, for the analysis with the HLA region included, the basic prior probability of a SNP involved in none of the genes sets was given by 0.175 and varied for the other SNPs between 0.015 and 0.76 depending on the corresponding pathway membership and SNP information. For GSEA, none of the considered pathways reached a FDR or FWER < 0.05. Hence, GSEA was no help to select candidate genes for further investigations when HLA is excluded.

5.5 Comparison with other results from the Genetic Analysis