• Keine Ergebnisse gefunden

7.5 Analysis of main effects integrating pathway information

7.5.6 Comparison of top genes between studies

As mentioned in section 7.4, we used EM for SNP re-ranking after the analysis with HBP for pathway integration. In figure 7.12 we can see a comparison of the rankings according to the three different posterior quantitiesPMi, ,EM+

i and EMi, i= 1, . . . , NM. For all studies and both pathway models, PMi and EMi were highly correlated to each other, while EM+ had a much lower similarity with both. This supports Lewinger et al.

(2007) observations in his simulation studies that PMi and EMi show similar power while EM+

i performs worse. Although the results presented in the following are based on EMi, the results are representative for the ranking according to PMi as well due to the high correlation.

When comparing the initial regression gene ranking to the gene ranking after inte-gration of pathway information, we have a gain of consistency between the studies.

In figure 7.13, the pairwise comparison of rankings between studies for HBP and initial regression are graphically contrasted in list comparison-plot. Considering M1, the concordance between CE-IARC, GLC and MDACC is clearly increased. SLRI

Figure 7.12: List comparison plots of gene rankings according to different HBP pos-terior quantities EM+

i,PMi and EMi. The y-axis shows the proportion of common genes for a particular number of top genes given on the x-axis.

(a) Pathway model 1 (b) Pathway model 2

Figure 7.13: List comparison plots of gene rankings between different studies for HBP (EM) and initial regression results. The y-axis shows the proportion of common genes for a particular number of top genes given on the x-axis. The darker points indicate a significant overlap.

Table 7.8: Number of overlapping top 100 genes between studies and models contrasted for initial regression results (upper triangle) and HBP (lower triangle)

Initial regression analysis

GLC CE-IARC MDACC SLRI

M1 M2 M1 M2 M1 M2 M1 M2

M1 - 32 2 1 5 4 3 3

GLC M2 12 - 1 0 1 1 2 2

M1 2 8 - 56 3 3 2 1

CE-IARC

M2 52 2 5 - 3 3 1 0

M1 11 4 13 8 - 81 1 1

MDACC

M2 3 8 32 2 34 - 2 1

M1 0 4 11 0 20 10 - 34

HierarchicalBayes Prioritization

SLRI M2 60 3 1 55 13 2 0

-is the only study behaving differently. For SLRI and GLC, similar cons-istencies are observed with initial regression and HBP, comparing SLRI ranking with CE-IARC and MDACC an improvement on the top ranks is observed, while the similarity gets worse with increasing ranks. For M2, all pairs show a gain in consistency using HBP, with CE-IARC and SLRI in particular worth to mention, with at least 40% overlap over the whole gene ranking lists.

In table 7.8, the overlap of the top 100 genes between the different studies of initial ranking and ranking based on the posterior quantity are shown. Again, we see a clear increase in consistency by pathway integration. For the initial results, 3 overlapping genes were given for CE-IARC and MDACC, for all other combinations we had only 1 or even no gene in common. After the integration of pathway information and re-ranking of the SNPs, the number of common genes between the different studies increased for most combinations to up to 60 genes. In particular, for all pairwise comparisons of GLC M1, CE-IARC M2 and SLRI M2, as well as MDACC M1 with CE-IARC M2 and SLRI M1 a high number of common genes is observed.

Between the two models per study a higher number of common genes was observed with initial regression. A possible reason, as already mentioned before, may be that the top results for M1, for all studies except of MDACC, are not necessarily related to lung cancer directly but possibly to the unconsidered confounding factor smoking.

While this already affects the top genes for the initial regression, the effect may be even increased by the additional pathway information used for the HBP. This even more severe emerge of the smoking related genes for M1 leads to less common genes between M1 and M2 per study.

Considering the top 100 genes for each of the analyses, we observe 521 different genes.

Of these, 112 occur in 2 different studies, 52 occur in 3 studies, and 4 are in top 100 for at least one model of all four studies. For the initial regression, 577 different genes were observed considering the top 100 genes for each analysis. Only 20 of these occur in 2 different studies. None occurred in 3 or four studies.

Gene set enrichment analysis

Table 7.9 shows the total number of LES genes extracted from the pathways with a nominal p≤ 0.05 in the GSEA for each of the analyses. While GLC, CE-IARC and SLRI had around 300-400 LES genes for each of the analyses, for the MDACC study

Table 7.9: Number of LES genes of GSEA for the different studies

GLC CE-IARC MDACC SLRI

model 1 304 301 47 321

model 2 284 423 94 343

model 1 2 108 204 48 66 model 1 2 481 521 94 599

Table 7.10: Number of overlapping LES genes of GSEA between the different studies and corresponding p-values for the overlap in brackets

GLC GLC GLC CE-IARC CE-IARC MDACC

CE-IARC MDACC SLRI MDACC SLRI SLRI

model 12 3 99 17 44 3

1 (0.8599) (0.4507) (2.2·10-54) (6.7·10-11) (8.1·10-10) (0.4870)

model 38 10 24 17 70 14

2 (0.0001) (0.0151) (0.0396) (0.0003) (1.1·10-16) (0.0010)

model 44 66 47 75 204 70

12 (8.1·10-10) (7.2· 10-21) (1.3·10-06) (1.7·10-29) (3.6· 10-179) (1.1·10-16)

only 47 (M1) and 94 (M2) LES genes occurred. A reason for the identification of less LES genes may be the lack of never smokers in the MDACC data.

In the same table we furthermore see the number of intersecting LES genes between M1 and M2 per study. The intersect is highly significant for all studies. This is not surprising since the same data are underlying the slightly different analyses. In particularly noticeable is MDACC, with the LES genes for M1 a subset of these for M2. This result fits well to our very similar HBP pathway ranking for both models of MDACC and our hypothesis, that results are much more similar as for the other studies due to the missing never smokers.

In table 7.10, the overlap of LES genes between the different studies is given. Measured by the total number of genes occurring within the considered 234 pathways, the overlap of LES genes between GLC and SLRI, SLRI and CE-IARC as well as CE-IARC and MDACC M1 is significant. For model 2, between all pairs of studies we have a significant overlap. The same holds for the combined sets of model 1 and model 2. Comparing the LES genes of M1 of one study and M2 of another study, for most combinations (9 out of 12) significance is given as well. In total, 1354 different LES genes occurred. Of these, 299 were LES genes for at least 2 different studies, 40 were found by 3 different studies. Two genes were identified in the LES of all four studies.

SUMSTAT method

To evaluate the results of SUMSTAT on the gene level, we built for each analysis a new ranking list containing only these genes occurring in the corresponding significant pathways. Furthermore, the top 100 genes for each of the new lists was considered and compared between the different studies. In table7.11 the number of common top genes between the different analyses is shown. We clearly see that the number of common top genes is larger than based on the initial regression results - shown in the upper part of table7.8. However, the consistency given by HBP exceeds the one by SUMSTAT. In

Table 7.11: Number of overlapping top 100 genes between studies and models contrasted for initial regression results (upper triangle) and SUMSTAT (lower triangle)

Initial regression analysis

GLC CE-IARC MDACC SLRI

M1 M2 M1 M2 M1 M2 M1 M2

M1 - 32 2 1 5 4 3 3

GLC M2 21 - 1 0 1 1 2 2

M1 6 2 - 56 3 3 2 1

CE-IARC

M2 5 3 55 - 3 3 1 0

M1 10 6 6 8 - 81 1 1

MDACC

M2 9 9 8 8 79 - 2 1

M1 11 3 5 5 4 9 - 34

SUMSTAT

SLRI M2 4 4 3 3 2 1 43

-total, 537 different genes occurred in the 8 different top 100 gene lists. 43 were found for two and only 11 for three different studies. No gene is identified in the top 100 for all four studies.