• Keine Ergebnisse gefunden

Screening for RLS Associated Rare Variants and Genes Using the Human Exome

Im Dokument Genetics of Restless Legs Syndrome  (Seite 51-55)

3 Methods

3.1 Screening for RLS Associated Rare Variants and Genes Using the Human Exome

To gain new insights into the contribution of rare genetic variants to RLS, rare SNP genotype data was used from the Human Exome BeadChip genotyping array of a German/Austrian case-control cohort. The data was subject to a quality control, a single variant and gene level association analyses.

3.1.1 Quality Control of the Genotyping Data

Raw genotype data was obtained from the Illumina Human Exome BeadChip 12v1_A (“ExomeChip”) from Prof. Winkelmann for RLS cases (unpublished data, of German/Austrian descent) and from KORA [354, 355] and HNR [55, 356] for population based German controls. All participants provided informed written consent.

PLINK [59]v1.07 and R [4]2.15 and 3.0 were used to perform the quality control.

HNR [55, 356] samples’ (2,486 individuals) initial 247,870 markers were called using the CHARGE cluster file 1.0 [357] and the Illumina GenomeStudio V2011.1V (“Genotyping” GenomeStudio Module 1.9.4 and “Illumina Genome Viewer” GenomeStudio Module 1.9.0). Low quality markers were removed [357]. Call rate filters were applied sequentially (individuals’ genotyping rate < 80%, markers’ genotyping rate < 80%, individuals’ genotyping rate < 98%, markers’ genotyping rate

< 98%). A sex check was done based on 126 markers on the Y chromosome (non-PAR) and the respective genotype call rate; a call rate ≥ 50% was defined as a male property.

For RLS samples, a raw genotype dataset was obtained for 3,758 individuals (primary RLS cases) and 80 HapMap [358-363] controls. The CHARGE cluster file [357] version 1.0 was used as before to perform the calling on initial 247,870 markers. Samples and markers were filtered at call rates of 80% and 98%, as before. A sex check was performed based on 126 markers on the Y chromosome at a call rate threshold of 50%, as before.

The genotypes provided by KORA [354, 355] were called as before for initial 2,921 individuals and 237,982 markers, which had been subject to a KORA [354, 355] internal primary QC. Samples and markers were filtered at call rates 80% and 98%, as before. A sex check was provided by the collaborator.

The post-QC datasets were merged. Call rate filters were applied (individuals’ genotyping rate < 98%, markers’ genotyping rate < 98%). An autosomal dataset was obtained.

To check for putative sample contaminations, the data was split at MAF 1%, and a call rate QC was repeated as done before, and the sample-wise heterozygosity was calculated as a ratio of the number of observed heterozygous genotypes and the total number of observed genotypes. Samples were removed that showed an excess heterozygosity (with being more extreme than the mean ± 4 sd) in at least one of the two datasets.

A relatedness check was done: The autosomal dataset was cleaned for suspicious samples from the heterozygosity check, and markers were removed that were not on the Human Exome BeadChip v1.1 [357]. Markers were removed with MAF < 1%, HWE violation in controls (p ≤ 1E-06) and in long range LD regions [364]. The data was pruned for LD (PLINK [59] command “--indep-pairwise 100 5 0.2”). PI_HAT was calculated using PLINK [59], which estimates the pairwise proportion of alleles shared by IBD [59]. Samples were removed, which were overrepresented in all pairs with PI_HAT > 0.075. HapMap [358-363] samples were removed as well as duplicates (PI_HAT > 0.8, the better genotyped duplicate was kept). Related individuals were determined (PI_HAT > 0.09375,

3 Methods

between third and fourth degree relatives) and those were flagged that would lead to a maximum of unrelated individuals after a putative removal and that had a lower genotype call rate (as a secondary criteria).

The genome-wide dataset was cleaned from suspicious samples of the heterozygosity check.

Markers were removed that violated the HWE in controls (p ≤ 1E-07). Samples were removed that showed over-relatedness, were duplicated or were HapMap [358-363] samples (80 HapMap samples). Duplicated markers were removed based on the better respective call-rate. Four markers were removed that were known for having bad variant callings. Samples were removed with missing age, and only polymorphic autosomal markers were kept. This dataset was suitable for further downstream association analyses.

To enable genetic similarity estimations, the dataset was further pruned: Markers were removed with MAF < 0.1%, markers in long range LD regions [364] and markers in LD (PLINK [59] command

“-indep-pairwise 100 5 0.2”). Two MDS coordinates were calculated for visual inspection of population structures using PLINK v1.90b3.32 [42, 59, 365] and only unrelated individuals.

3.1.2 Single Variant Association Analysis

A fast genome-wide mixed model single variant association analysis was done with the association analysis dataset from the quality control section using FaST-LMM v2.06 [37]. To estimate genetic similarities, the dataset for genetic similarity estimations was used during the analysis. The analysis options were set to not normalizing genotypes and to excluding variants in a range 2 Mb around the test SNP from genetic similarity estimations (to avoid proximal contamination [366, 367]). Age and sex were included as covariates. The association signals were validated for p ≤ 0.001 using the same dataset and GMMAT v0.07 [48], and odds ratios were obtained. The genotype calling cluster plots were inspected to exclude false positive association signals.

The FaST-LMM [37, 366, 367] analysis was repeated with conditioning on the lead SNPs from the unconditioned analysis for the published RLS associated genomic regions [245] but only for individuals with non-missing genotypes in the conditioning SNPs.

3.1.3 Gene Level Association Analysis

The disease trait was transformed using GRAMMAR [49]. Therefore, polygenic residuals were obtained from a linear mixed model (R [4] package GenABEL [368] v.1.8-0) that random effects were based on an IBS matrix [49] from the dataset for genetic similarity estimations, and the fixed effects were age and sex. Additional covariates were added for a conditional analysis on the RLS associated genome-wide significant lead SNPs from the single variant analysis, and missing genotype covariates were mean imputed.

For gene level analysis, markers were kept with MAF < 5% in cases or controls from the association analysis dataset. Only genes were considered with more than two markers (based on the Illumina’s annotation file v1 [369, 370]).

The associations were evaluated between the polygenic residuals and the groups of rare variants (gene based grouping) based on empirical p values of SKAT [92], burden [371] and combination tests [92, 346, 372] (10,000, 110,000 or 1,110,000 permutations) using R [4]. To obtain these p values, the SKAT [92] and burden tests (BRV [12], a sum of rare alleles for each individual) were applied to permutations of the polygenic residuals and two respective p values were obtained for each permutation. (A genotype imputation was applied to the test to address differential missingness between cases and controls [12].) Thus, for each gene, two (correlated) vectors of test statistics

3.1 Screening for RLS Associated Rare Variants and Genes Using the Human Exome BeadChip

were obtained (t statistic for the burden test via a linear model [371] and Q statistic for the SKAT [92]). For each type of gene level test, an empirical p value was obtained (Equation 1, Equation 2).

Equation 1

=∑ ≥ + 1

+ 1

pSKAT = empirical p value for SKAT

I Qk ≥ Qo = function assessing SKAT statistics from kth permuted polygenic residuals equal/larger than the SKAT test statistic from the original vector of polygenic residuals

returning 1 or 0 K = number of permutations

Equation 2

345678=∑ 9:; ≥ :;< + 1 + 1

pBURDEN = empirical p value for burden of rare variants

I tk² ≥ to² = function assessing squared burden test statistics from kth permuted polygenic residuals equal/larger than the squared burden test statistic from the original vector of

polygenic residuals returning 1 or 0 K = number of permutations

To obtain empirical p values for combinations of SKAT and burden tests, the minimum-p and Fisher method was applied [346]. The empirical test statistics were used to calculate their rank-based p values, which were combined to a new test statistic using the minimum-p or Fisher method for each permutation, respectively (Equation 3, Equation 4, Equation 5, Equation 6) [346].

Equation 3

, =EFGH

rank Qk = rank of SKAT statistic Q from the kth permutation

Equation 4

345678, =EFGH |: |

rank |tk| = rank of absolute burden test statistic t from the kth permutation

Equation 5

JKL M75, = −2 log9 345678, < − 2log ,

3 Methods

Equation 6

JPQR, = min 345678, , ,

A test statistic was also obtained for the original observed p values from the SKAT and burden test using the Fisher and minimum-p method (Equation 7, Equation 8,) and its statistical significance was assessed using themselves and the results from Equation 5 and Equation 6 as realizations of the null hypothesis, respectively (Equation 9, Equation 10) [346]:

Equation 7

JKL M75= −2 log 345678 − 2log

Equation 8

JPQR = min 345678,

Equation 9

KL M75=∑ 9JKL M75, ≥ JKL M75< + 1 + 1

Equation 10

PQR =∑ 9JPQR, ≥ JPQR< + 1 + 1

In analysis “A” and on a genome-wide scale, the analysis was performed without weights or with CADD score (v1.0) [15] based weighting. Therefore, the raw scores from CADD 1.0 [15] were extracted for all variants of the association analysis dataset and transformed to range from 0 to 1:

the minimum was subtracted from the raw scores and the resulting values were divided by their maximum. Only genes were included in the analysis that had at least two annotated different variants. Primary candidate genes were selected by a p value threshold that was based on the effective number of null hypothesis: For each gene, the smallest p value was obtained from the burden test results of the first 10,000 permutations of the polygenic residuals. The result was a vector of p values. This vector of p values was corrected for multiple testing using the Sidak approach [336] and a range of independent null hypothesis from 1 to the total number of genes (Equation 11).

Equation 11

SSSSST = 1 − 1 −P SSSSTU R

SSSSSTP = vector of p values corrected for multiple testing SSSSTU = vector of p values not corrected for multiple testing

n = number of independent null hypothesis

Im Dokument Genetics of Restless Legs Syndrome  (Seite 51-55)