• Keine Ergebnisse gefunden

2 . 2 Genome-wide association data

4

SNPs

log10(pvalues)

(a)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0.8

1 1.2 1.4

SNPs

Odds Ratio

(b)

Figure2.2: Data set with one disease-specific and one disease-unspecific pattern. The data set contains500cases and500controls with10,000SNPs. a) p-values of the SNPs, b) odds ratios of the SNPs.

2 . 2 Genome-wide association data

In this thesis, we will test our algorithms on two different GWA data sets: a coronary artery disease data set and an inflammatory bowel disease data set.

2.2.1 Coronary artery disease data set

The coronary artery disease (CAD) data set originates from the German MI family study II [33]. This data set was included in the large CARDIoGRAM meta-analysis [26]

(see Chapter1). It contains2506individuals, of which1222are cases and1284are con-trols,1643are males and1284are females. All individuals are Caucasians of European origin. All cases had a validated myocardial infarction prior to an age of 60 years.

Approximately half of the cases have a positive family history for CAD. The control individuals are population based. The total number of SNPs is 478,040. The missing values are imputed [34].

We filtered the data with standard quality criteria: MAF<1% and a maximum de-viation from the Hardy-Weinberg equilibrium (HWE) of 10 4. There are no missing values in the data because these values were imputed. Hence, we did not need to ap-ply a threshold to restrict the number of missing values. After quality filtering, the total number of SNPs is432,189.

18

As we discussed in the previous section, SNPs that lie only a few base pairs apart are generally inherited together. This means that these SNPs carry the same informa-tion and are hence to a certain degree redundant. In addiinforma-tion, some combinainforma-tions of SNPs, which need not necessarily lie close to each other, occur more often than we would expect by chance. Such associations between SNPs are referred to as linkage disequilibrium (LD). To obtain a more manageable data set, we reduced the number of SNPs by applying LD pruning. This removes much of the redundant information, since groups of LD SNPs that lie close together on the chromosome and hence have a similar inheritance pattern are now represented by only a few SNPs.

We LD pruned the data set using the PLINK software with the “indep” option. The LD pruning in PLINK is based on the variance inflation factor (VIF) and recursively removes SNPs within a sliding window [35]. The VIF is given by 1 1R2, whereR2is the measure of LD (see Chapter3).

We used the standard settings: Window size of50SNPs, window shift of5SNPs in each step, and a VIF threshold of2. After LD pruning, the remaining number of SNPs is118,247. For cross-validation, we divided the data into five sets with equal numbers of cases and controls; each cross-validation fold consists of488individuals.

A closer look at the data

Figure2.3 plots the p-values and OR for all SNPs, ordered by base pair position. Al-ternate chromosomes are shaded light and dark.13SNPs are genome-wide significant at a significance threshold for the p-values of 10 8; this threshold is corrected for mul-tiple testing. The MAF for these SNPs lies between around1% and 15%, and the the ORs range from 0.1061to 19.89; these are relatively large effect sizes. 4 of the signif-icant SNPs have an MAF slightly larger than5%, and the ORs for these SNPs range from 0.31 to 2.56. The SNPs that have been validated in the literature to be associ-ated with CAD are not among the SNPs that are significant in this data set and have smaller effect sizes. This may indicate that the effect sizes for some of the SNPs found to be significant here are artefacts. Note also that the search for significant SNPs was performed on the LD pruned data set; this may explain why the significant SNPs did not include those previously reported to be associated with CAD.

2.2.2 Inflammatory bowel disease data set

The inflammatory bowel disease (IBD) data set was obtained from the public database dbGap run by the NCBI [36,37]. This data set was included in the large Crohn’s disease meta-analysis published in [38]. The data set contains1760individuals, of which813 are cases and947are controls. Approximately half of the data set are Crohn’s disease (CD) patients and controls of non-Jewish European ancestry. The other half of the data

19

0 10 20

SNPs

log10 pvalule

(a)

0 10 20

SNPs

OR

(b)

Figure2.3: CAD data set. Alternate chromosomes are shaded light and dark. a) p-values, b) odds ratios.

set contains individuals of Jewish ancestry. The total number of SNPs is 317,503. We applied the same quality criteria as for the CAD data set: MAF<1% and a maximum deviation from HWE of 10 4. Unlike the CAD data set, the IBD data set contains missings that have not been replaced by imputed values. Therefore, we also applied a missing threshold of 2% per SNP. Since some of the individuals have a high missing rate, we additionally applied a threshold of6% missings per individual.

The data set remaining after quality filtering contains1719individuals, of which789 are cases and 930are controls. The total number of SNPs is 292,162. No LD pruning was applied to this data set.

For cross-validation, we divided the data into five sets with equal numbers of cases and controls; each cross-validation fold consists of314individuals.

A closer look at the data

Figure2.4plots the p-values and OR for all SNPs, ordered by base pair position. Alter-nate chromosomes are shaded light and dark.5SNPs are genome-wide significant at a significance threshold for the p-values of 10 8; this threshold is corrected for multiple testing. The ORs of these SNPs range from0.29to1.69with an MAF of5% to36%. The significant SNPs lie within two genes, the NOD2and the IL23R gene. Both are known susceptibility genes for IBD [39].

20

(a)

(b)

Figure2.4: IBD data set. Alternate chromosomes are shaded light and dark. a) p-values, b) odds ratios.

21

22