• Keine Ergebnisse gefunden

6.1.2 �ality Control

6.1.3 Genome-wide association study

A genome-wide association study (GWAS) is performed to analyze genetic variations of the genome to identify allele expressions which appear commonly together with a speci�c pheno-type, e. g., a disease.

TheGWAScan be calculated usingPLINK– a program, that provides comprehensive analysis selections for genomic studiesPurcell(2006);Purcell et al.(2007). Selected markers, e. g., SNPs, are chosen to be examined, most of them localized in non-protein-coding regions, such as introns or between genes. A possible association to a speci�c phenotype is evaluated, as shown in Fig.6.2 on the next page. On the left side of the�gure, there is no signi�cant di�erence in the phenotype between the di�erent genotype groups. The slope of the�tting line is rather�at.

On the right side of the�gure, a homozygote major allele (AA) shows a signi�cantly di�erent manifestation of the phenotype than the homozygote minor allele (aa). Since the heterozygote phenotype (aA) lies directly in the middle of the two homozygotes, this SNP’s e�ect would be well described by a dosage model indicated by the steeper �tting line. ForGWAS these associations for all known SNPs on the genome were computed, yielding ap-value, which indicates the signi�cance of each�nding.

The p-value represents the probability of �nding a more extreme test-statistic than the observed one. It ranges between 1 indicating no and 0 indicating high signi�cance. This means that the lower thep-value the less likely it is to�nd such data under the null-hypothesis and the more likely it is that the alternate hypothesis can be accepted. In standard clinical trials, the threshold is commonly set to 0.05 (5 % probability) for single tests, for genome-wide analysis to5·10 8to account and correct for multiple testingBarsh et al.(2012). In case of deceeding the threshold, the result can be assessed as signi�cant and the null-hypothesis rejected. For example, when having normally distributed data (with mean= 0and standard deviation = 1,

6.1. TUM 1 Dataset 49

Association between genotype and phenotype

phenotype

Figure 6.2:The figure on the le�shows no, whereas the figure on the right shows a significant association between genotype and phenotype.

for which the95 %con�dence interval lies between 1.96 and +1.96, as shown in Fig.6.3 on the following page) the probability of having a signi�cant result above the 95 % con�dence interval, within an one-sided test, is indicated by the red area in Fig.6.3 on the next page which equals 2.5 % of the total area under the bell curve. If we consider all values outside of the con�dence interval including also the area to the left of 1.96 (again 2.5 % area under the curve, AUC), corresponding to a two-sided test, we�nd 5 % in the marked region of the tails.

Using this data, theGWASwas performed on the measured antibody titer against interferon-β as well as on the normalized antibody titer. Both calculations were interesting because the antibody titer values are distributed di�erently as shown in Fig.6.1 on page 45. The covariates sex andage at sampling (AAS) as well as components C1 to C5 from the multi-dimensional scaling (MDS) analysis of the identity-by-state matrix, which outline genotype similarity of a population, were included. Figure6.4shows an example of a scatter plot of the MDS components C1—C5.

TheGWASon the normalized phenotype data yielded some lowp-values, nevertheless, no SNP reached the genome-wide signi�cance limit. Our results showed 390 SNPs withp-values between10 5 and10 6, the highest association found for SNPrs8051893on chromosome 16 with a p-value of3.515·10 7. This SNP is localized in intron 1 of theWFDC1gene, which is considered a tumor suppressor gene. All information of a genes localization and function are retrieved from theDatabase of Single Nucleotide Polymorphisms,dbSNPwithin this thesis Bethesda(2005);Sherry et al.(2001).

The resulting top SNPrs697296of the secondly performed GWAS, our phenotype being the

−4 −2 0 2 4 0.0

0.1 0.2 0.3 0.4

x

p

−1.96 1.96

2.5%

Figure 6.3:Interpretation of thep-value. The confidence intervall indicates the area between the two vertical lines marked at 1,96 and 1,96. The red area displayes thearea under (the) curve (AUC)of 2.5 %, showing the significant domain.

SNP allele 1 allele 2 frequency allele 1 info score β SE p-value rs8051893 C T 0.5925 0.8885 0.373 0.07 3.515·10 7

Table 6.4:Top SNP from GWAS with normalized antibody titer of theTUM 1 dataset.

measured antibody titer (‘AB’) and also including covariates sex, age at sampling (‘AAS’) and C1 to C5, showed ap-value of 4.141 x10 7.

SNP allele 1 allele 2 frequency allele 1 info score β SE p-value rs697296 C T 0.4169 0.9230 15.6294 3.02 4.141·10 7

Table 6.5:Top SNP from GWAS with measured antibody titer (‘AB’) of theTUM 1 dataset.

Although not directly localized on a gene,dbSNP reports its position close to the PRICKLE gene on chromosome 3. Itsp-value is not signi�cant, but including this particular SNP to the 386 already persisting SNPs in the PRICKLE gene, the SVM prediction model could yield higher correlation between measured and predicted antibody titer than without SNPrs697296. The r-value increased from 0.471 to 0.473 caused by the a�ect of only one additional SNP. For further details see chapter5.

A Manhattan Plotprovides an overview of the genome-wide p-values obtained from the GWAS. The genomic position of each SNP over all chromosomes is displayed along thex-axis and the negative logarithm of thep-value is on they-axis. Each point represents the calculated p-value at the localization of one SNP. In Fig.6.5 on the next pagethe top SNP on chromosome 3 with the lowestp-value is clearly recognizable.

6.1. TUM 1 Dataset 51

C1

-0.02 -0.01 0.00 0.01 0.02 -0.02 -0.01 0.00 0.01 0.02

-0.020.000.020.04

-0.02-0.010.000.010.02

C2

C3

-0.03-0.010.010.03

-0.02-0.010.000.010.02

C4

-0.02 0.00 0.02 0.04 -0.03 -0.01 0.010.020.03 -0.02 -0.01 0.00 0.01 0.02

-0.02-0.010.000.010.02

C5

Figure 6.4:Sca�er plot of the MDS analysis showing the components C1—C5.

Another way of interpreting the distribution ofp-values in our study is by using the Quantile-Quantile Plot (QQ-Plot). The distribution of two variables are displayed to be compared. If all dots are disposed along the diagonal, their distribution would be equal. In this case the testing values are arranged according to the negative logarithm of thep-values of our data, a comparison of expected and observedp-values, as shown in Fig.6.6 on page 53. Slight deviations can be accepted but the approximately comparable outcome of observed and expectedp-values is an important requirement for continuing the study.

Carefully preparing our data and the lack of�nding any signi�cant SNPs correlating with our phenotype led us to search for advanced techniques to improve prediction beyond the single SNP results and also include possible interactions. One technique that allows this kind of analysis is machine learning with support vector machines. We continued our project with the intention of creating a model with SVM calculating SNP-interactions, see chapter5.

Figure 6.5:Manha�an plot of theTUM 1 datasetwith the obtainedp-values from the GWAS, the green line indicates suggestive, the red line genome-wide significance.

6.1. TUM 1 Dataset 53

Figure 6.6:QQ plot of theTUM 1 dataset.