• Keine Ergebnisse gefunden

7.3 Preprocessing of the data

7.3.1 Quality Control

We conducted a systematic quality control as a first step of our analysis at each study center separately, using comparable quality criteria (section 3.2.3). We will outline the procedure for the German Lung Cancer Study.

Of the 514 cases and 488 controls selected for GLC study, genotyping failed for two of the cases. 12 additional individuals (8 cases, 4 controls) were excluded since genotypes for more than 10% of the 561,466 SNPs were missing. 966 of the remaining 988 persons had a call rate of more than 95% and the overall genotyping rate was 99.4%. We started checking the sex of the individuals based on the X-chromosomal information. For one case, the determined sex did not agree with the reported one, so that we excluded this person. Two cases showed a low rate of heterozygous genotypes in comparison to the other individuals, two controls showed an excess of heterozygotes. These were excluded as well. As a next step cryptic relatedness between the different participants was in-vestigated by determining pairwise similarities. We identified 3 case pairs as duplicates or monozygotic twins and 17 pairs of second degree relatives. For each of the identical pairs the smoking status agreed and we removed one individual at random. The second degree pairs included 13 different persons, most of them involved in a complex network of relatedness as shown in figure 7.1. Of this group, we removed as few individuals as possible (2 cases and 4 controls) so that no second degree relatives remained in the sample.

As a last step of quality control for the individuals, we performed a principal component analysis (PCA) on a subset of nearly 100,000 SNPs to assess population structure and identify ethnic outliers (section 3.2.4). Therefore we used the software EIGENSOFT (Price et al., 2006). Since we restricted our analyses to Caucasians, six individuals (4 cases, 2 controls) with a Non-Caucasian self-reported ethnicity (Arabs, Asians) were removed in advance. For a previous analysis, STRUCTURE was applied to the data set to assign the different probes to European, African or Asian ancestry with HapMap Phase II data as reference sample (International HapMap Consortium,2005). Two con-trols with 40% African background were identified. These were the concon-trols strongly deviating from the sample heterozygosity distribution that was mentioned above and they were already excluded. Furthermore, one of the self-rated Arabs was clearly identi-fied. We also discovered that some of the cases and controls with an Eastern European and Russian background include a low percentage of Asian background (up to 16%). A graphical presentation can be seen in figure7.2. The SNP subset used for EIGENSOFT was obtained by selecting markers from the whole set, so that no high LD between the chosen markers remained. Additionally, non-autosomal SNPs were removed as well as monomorphic SNPs. The principle component axes were tested for statistical sig-nificance by Tracy-Widom statistic (Tracy and Widom, 1992). Our PCA provided 20 eigenvectors with a p-value≤0.05, of whom 17 even had p-values ≤10−7. In figure 7.3 we can see plots of the first 8 principle components with the single individuals colored according to their self-reported origin. All four plots show a main core cluster involving most of the individuals with some outliers in the different directions.

We repeated the PCA using an iterative procedure integrated in EIGENSTRAT to au-tomatically remove outliers. In the first iteration, 18 persons were removed and 15 more

0.22

0.21 0.23

0.22

0.25 0.22 0.24

0.24 0.21 0.28

0.21 0.23

0.31 0.25 0.27

0.24

Figure 7.1: Overview of the 2nd degree relatives in the GLC. The nodes of this plot represent the different individuals denoted by their study ID. The edges represent the relatedness between the individuals with the given number the corresponding similarity measure. For genetically identical individuals, this measure equals 1, while values close to 0.5 denote first degree relatives and values close to 0.25 are given for second degree relatives.

in a second iteration. Thereby, the number of significant eigenvectors was reduced to 4 with a p-value<0.05, involving 3 highly significant ones. In 4 more iteration steps, 3, 6, 1 and 1 more individuals were removed before the outlier removal terminated. 3 PC axes still were significant. When checking the reported ethnicities for the identified outliers, we found that several were of East European or Russian origin, as also observable from figure 7.3. We decided to remove the outliers from the first two iterations and to use the first 4 principle components (PCs) in following analyses when possible. These four PCs are displayed in figure 7.4. Individuals are colored depending on the originating study (LUCY, Heidelberg, KORA). We see no major genetic differences between these groups. In total, the final sample involved 935 individuals for the analyses (467 cases and 468 controls).

Subsequently SNPs were filtered according to their proportion of missings, minor allele frequency or deviation from HWE within the controls. We removed 7,889 SNPs with more than 5% missing genotypes, 23,778 SNPs with a MAF< 1% and 405 SNPs with a HWE p-value within controls<10−7. Furthermore, 2.728 heterozygous haploid geno-types (SNPs on X or Y chromosome in men) were detected and set to missing. Finally, after frequency and genotype pruning 529,730 SNPs remained.

All quality procedures – except of the identification of outliers and population structure – were performed with the GWAS software PLINK. More detailed information on the

Figure 7.2: Assignment of the GLC individuals to Caucasian, Asian and African genetic background represented by Hap Map phase II reference populations CEU, HCB and YRI using the population structure software STRUCTURE (Pritchard et al., 2000).

Table 7.2: Quality criteria used for our TRICL GWAS analyses

SNP specific quality checks

Call rate 95%

Minor allele frequency 1%

Hardy Weinberg Equilibrium in controls pHWE−controls10−7 Individual specific quality checks

Call rate 90%

Sex mismatch female F<0.2 and male F>0.8

Heterozygosity [mean F +/- 6 standard deviation F]

Cryptic relatedness proportion alleles IBD<0.20

Population outliers Caucasian ancestry,

|PLINKs nearest neighbor Z score|<4

motivation for the different filter criteria and the corresponding usage of PLINK can be found in the appendix A.2. An overview of the thresholds used for the quality filtering process is given in table 7.2.

Since we did not strictly fix how outliers should be identified, the methods varied for the different studies. For the Central Europe study, STRUCTURE was used, defining population outliers as individuals with an ancestry probability rate of being Caucasian

<80%. MD Anderson used the outlier detection diagnostic in PLINK (absolute value of the nearest neighbor Z score>4).

For the SLRI, MDACC and CE-IARC study 331, 1,150 and 1,901 lung cancer cases, 499, 1,134 and 2,503 controls and 314,072, 312,452 and 310,045 SNP remained for the analysis after excluding subjects and markers based on the different quality criteria.

−0.15 −0.05 0.05 0.15

−0.15−0.050.050.15

−0.10 0.00 0.10

−0.15−0.050.05

−0.10 0.00 0.10

−0.15−0.050.05

Others

Figure 7.3: Principle component analysis of GLC. Plots of the first 8 principle com-ponents with outliers included. The different colors represent the different reported eth-nicities.

−0.10 −0.05 0.00 0.05 0.10

−0.10.00.10.2

−0.05 0.00 0.05 0.10 0.15 0.20 0.25

−0.15−0.10−0.050.000.050.100.15

KORA controls LUCY cases DKFZ cases

Figure 7.4: Principle component analysis of GLC. Plots of the first 4 principle compo-nents with outliers excluded. The different colors represent the three different underlying studies.