• Keine Ergebnisse gefunden

MIPseq of a Case-Control Cohort Using the Rebalanced MIPs Pool

Im Dokument Genetics of Restless Legs Syndrome  (Seite 64-68)

3 Methods

3.2 Targeted Sequencing of RLS Candidate Genes

3.2.4 MIPseq of a Case-Control Cohort Using the Rebalanced MIPs Pool

Seqpower [411] and a European population model [412, 413] were used to evaluate the power of gene level association tests with MIPseq data. A variety of scenarios was set up for an analytical power estimation that compared the cumulative MAF between cases and controls (sample size of 1,500 or 10,000). The parameters were chosen as follows: The prevalence of RLS, 5% to 10% [87], was set as baseline effect. The proportion of causal detrimental variants was set to 75% or 100% as a MEIS1 study observed 75% of 13 non-synonymous variants causing a phenotype in zebra fish [267].

The results from the same study and Manolio et al. [309] hinted to an OR of 3 to 4 as an upper bound for the effect size of detrimental variants. The lower bound was set to 1. The proportion of missing (non-sequenced and non-callable) sites was set to 25%. The proportion of missing genotypes was set to 1%. This was estimated by the application of a SNP call rate QC [59] (violation of HWE at p

≤ 0.001, missingness ≥ 20%, 10X coverage) using the HiSeq 2500 data of the rebalanced MIPs pool.

The proportion of non-causal variants was set to 25% (randomly assigned, 100 replications). The power analysis was based on variants with MAF ≤ 5% and genes having at least 2 variants. The proportion of false genotype calls was set to 1%, which was based on general concordance observations in Illumina NGS applications (IHG NGS facility, personal communication, 2015). No protective variants were considered and all other parameters were used with standard settings [414].

3.2.4.2 Capacity Estimation for HiSeq 4000 Sequencers

The HiSeq 2500 data from the rebalancing was used to simulate a 372-plex MIPseq library of cases and controls on a HiSeq 4000 sequencer for a whole range of sequencing lane numbers (see

3.2 Targeted Sequencing of RLS Candidate Genes

appendix with R [4] functions) (clusters per lane on a HiSeq 4000 = 3,125,000 [415], average proportion of off-target reads = 0.22, 25X coverage as threshold for calling variants). The average number of MIPs’ targets suitable for calling was calculated and the optimum number of lanes determined to fulfill a cost-efficient well powered MIPseq experiment.

3.2.4.3 Preparation and Sequencing of MIPseq Multiplex Libraries

MIPseq libraries were created for a cohort of 704 German RLS cases and 752 KORA [354, 355]

controls. The gDNA was stored in 96-well plates with 88 or 94 samples, respectively. In addition, two QC-control DNA samples (1 male, 1 female) were replicated on each plate. MIP target capture, exonuclease treatment and PCR/qPCR were done as before using the rebalanced MIPs pool, and one case and KORA control plate were processed simultaneously. All participants provided informed written consent.

Post-PCR solutions were pooled by PCR plate with 10 µL per sample, which contained impurities (e.g.

primers, by-products). They were removed using a protocol that was based on a double purification with AMPure XP beads: 400 µL PCR pool were mixed by vortexing with 360 µL of warm (RT) AMPure XP beads suspension (0.9 fold of the PCR product volume) and incubated to bind the DNA to the magnetic beads (RT, 10 min). Then the beads were attached to the wall of the reaction tube using a magnetic rack (RT, 5 min). The liquid phase was discarded. The attached beads were washed twice with 700 µL of fresh 80% ethanol by gently inverting the reaction tube in the magnetic rack (RT, 30 s). The ethanol was removed completely. The attached beads with DNA were dried by leaving the lid open (4 min to 6 min, RT). Then 100 µL low-TE buffer were added and the DNA was eluted by vortexing, followed by incubation on a magnetic rack (1 min, RT). The eluate was purified again by using a similar protocol, but with 90 µL of warm AMPure XP beads suspension for DNA binding and 40 µL of low-TE buffer for elution. (During the balancing of the MIPs pool, the post-PCR pools had a lower volume and the volume of AMPure XP bead suspension was kept at a ratio of 0.9, however, the elution volume was not scaled down accordingly, but was at least 20 µL, to keep the elution efficiency high [416]). The purified pools were run on a Bioanalyzer with Agilent DNA 1000 chips for QC, and they quantified using the QUBIT 2.0 system with the QUBIT dsDNA HS assay kit (1 µL sample) according to manufacturers’ protocols. Four pools (2x RLS cases, 2x KORA [354, 355]

controls) were pooled in equimolar amounts (with respect to the pools’ sample size) into MIPseq libraries. MIPseq libraries were submitted to the Helmholtz Zentrum München IHG NGS facility for sequencing on Illumina HiSeq 4000 machines (6 lanes per MIPseq multiplex library), and the output was preprocessed as described before.

3.2.4.4 Analysis of MIPseq Data from a Large Case-Control Cohort 3.2.4.4.1 Technical Quality Control

If the output from different HiSeq 4000 runs was similar, then the output quantity between MIPseq libraries was to be similar as well. Therefore, demultiplexed reads were counted per MIPseq library and compared (χ2 test, R[4]). Furthermore, the same reads were counted per individual and compared between MIPseq libraries (ANOVA, R [4]).

The called variants were filtered at coverage ≥ 25X, being bi-allelic SNVs and called at least once in the cases, controls or replicated quality control samples (vcftools 0.1.12 [99] and PLINK [59]). The individuals’ missingness of genotypes was calculated for further QC purposes (using PLINK [59]).

For each individual, the MIP performance curve was determined (similar to the balancing and rebalancing curve): The on-target reads were stratified by MIP and counted, and the counts were sorted and then weighted by the total number of the individual’s on-target read counts. The result

3 Methods

was an individual, relative MIPs performance curve (RMPC). The lower and upper quartiles were determined as a comparable representation of its shape (RMPC25 and RMPC75, respectively).

Parameters were assessed that might explain the performance of MIPseq multiplex libraries and possible batch effects. Therefore, several individuals’ parameters were regressed against each other (linear regression or logistic regression, respectively, using R [4]) to estimate the pairwise associations: total number of reads, total number of mapped reads, total number of on-target reads, ratio of on-target reads with respect to total number of reads, proportion of missing genotypes, upper/lower quartiles from the relative MIPs’ performance curves (RMPC25 and RMPC75), and the case-control status. Furthermore, measures of R² were estimated (Spearman’s or Nagelkerke’s R², respectively, using R [4] and the package fmsb v0.5.2, [417]) to assess how much of the variance is explained by the respective factors.

To validate some of the associations, additional tests were performed: First, the RMPCs were further compared between cases and controls. Therefore, the upper/lower quartiles from the relative MIPs’

performance curves (RMPC25 and RMPC75) were stratified by case-control status. Within these groups, the quantiles were classified as “high” or “low” based on the overall median of the respective quantiles, and χ2 tests were applied (R [4]). Second, if the pooling of plates into MIPseq libraries was correct, then each individual should have been equally represented in a MIPseq library, most importantly when stratified by originating from either case or control plates. To test this, the contributions were calculated as proportions of each plate to its MIPseq multiplex library (adjusted for the plates’ samples sizes and based on the demultiplexed reads). The proportions were summed for case and control plates for the respective MIPseq multiplex library, and differences in the resulting two distributions were assessed using a paired t-test. Third, the on-target ratios were compared between cases and controls by a Fligner-Killeen test of homogeneity of variances and a t-test (assuming non-equal variances) (as implemented in R [4]).

For qualitative visualization, for each individual, mean and median MIPs’ targets’ coverages were calculated as ½ of the read counts, based on the MIPs that worked at least once in a case or control sample. They were averaged for cases and controls. Furthermore, for each MIP, the 2.5%, 50% and 97.5% quantiles of target coverage were calculated and plotted for cases, controls, and combined cases and controls.

Pairwise concordances were calculated between replicated QC-control DNA samples using the SNV genotype data (after refiltering for called genotypes ≥ 25X coverage) and R [4].

3.2.4.4.2 Quality Control for Association Analysis of MIPseq Data

Quality control steps were performed using PLINK v1.07 [59], PLINK 2 [42] (v1.90b3.32, [365]), R [4]

and vcftools (0.1.12, [99]).

The called SNVs of at least 25X coverage were used, and cases and controls were kept. A call rate filter of 50% was applied to samples and SNVs. SNVs were further filtered for HWE violation in controls (PLINK, p ≤ 1E-06). The dataset was subset to polymorphic SNV with MAF ≤ 0.05 for analysis purposes (PLINK [59] & R [4]).

To correct for population stratification, genome-wide post-QC SNP data was obtained for cases and controls from Prof. Winkelmann (unpublished, Affimetrix AxiomChip, including 6,228 RLS cases and 10,992 controls) and KORA [354, 355] (cohort S3F3: Omni Express & Omni 2.5 chip with 4,086 controls, cohort S4F4: Affimetrix AxiomChip with 3,788 controls), respectively. The three datasets were subject to a secondary QC to minimize batch effects from genotyping and primary QC: SNVs with a call rate below 98% and HWE violation (p < 0.001) were removed. The datasets were pruned

3.2 Targeted Sequencing of RLS Candidate Genes

to consistent SNPs and merged. After merging, rare SNVs (MAF < 5%), non-autosomal SNVs and SNVs in long-range LD regions [364, 418] were removed and the remaining SNPs LD pruned (PLINK [59] command “—indep –pairwise 50 5 0.2”).

The genotyping chip data and the MIPseq data were pruned to the sample intersection, for which age and sex covariates were available (the better genotyped sample was kept in case of genotyping duplicates).

The concordance was determined [419] between the MIPseq genotype data and the ExomeChip genotype data that was used for the gene level association analysis before (see section 3.1.3, page 26) from the markers and individuals that were shared between both datasets. The non-reference genotype concordance was calculated for each marker and plotted against the MQ value (mapping quality) of the markers reads in the MIPseq data.

Based on the genotyping chip data, pairs of related individuals were determined (PLINK 2 v1.90b3.32 [42, 365], “genome” command, PI_HAT ≥ 0.09375, between third and fourth degree relatives), and the worse genotyped individuals removed. Then the first two principal components were calculated (PC1, PC2) using PLINK 2 v1.90b3.32 [42, 365]. The principal components were used to filter population outliers (more extreme than the mean of PC1 or PC2 ± 3 sd of PC1 or PC2, respectively).

After removal of population outliers and in an iterative loop, the principal components were recalculated and additional outliers were determined and removed until no outlier was left. The last set of principal components was used as a final set of PC covariates for the later association analysis with the remaining individuals.

3.2.4.4.3 Gene Level Association Analysis

For gene level association analysis, the SNV loci of the MIPseq data were assigned to gene sets: The target regions from the MIP design were used as a template (including any exon, 20 bp off any exon for exon-intron boundaries, and 500 bp upstream of any transcriptional start site), and all SNVs from the MIPseq dataset were mapped to the gene sets.

The MIPseq dataset’s individuals were filtered for the worse genotyped related individuals and the population outliers. Monomorphic SNVs were removed. The analysis was restricted to variants with MQ > 10 and to genes with at least two variants.

A SKAT-O [93] analysis was performed on the MIPseq data including age, sex and the first two principal components as covariates. The default small sample size adjustment was applied to SKAT-O [347]. As implemented in the package, monomorphic SNVs and SNVs with high missingness rate (cut-off at 15%) were excluded from the analysis, and further missing genotypes were mean imputed to reduce the inflation of the alpha error due to differential missingness [12]. To weight SNVs in the analysis, a linear weighted kernel was applied. The SNV’s weights were calculated based on the beta density distribution and the MAF (Beta(MAF, 1, 25)²) [92].

3.2.4.4.4 Single Variant Association Analysis

A single variant association test was performed with the data before the removal of related individuals and the PCA-based outlier pruning. The software GMMAT (v.07, [48]) was used, which can adjust for population substructure and cryptic relatedness by a generalized linear mixed model.

Therefore, a genetic relationship matrix was needed. It was calculated based on the SNPs that were used for PCA calculations using GEMMA (v0.94.1, [44]). Using GMMAT [48], a null model was fitted to the case-control status as the phenotype, and age and sex were used as fixed covariates and genetic relations as random effects. A score test was applied to all variants. SNVs with p below 1E-04 were further analyzed with a full regression using GMMAT’s Wald test function [48], and effect size

3 Methods

estimates were obtained. All variants were annotated using the CADD annotation [15] (database 1.3) and dbSNP [403, 420] (human 9606 GRCH37p13, build142). Further annotation was obtained from the Helmholtz Zentrum München NGS pipeline, which was based on dbNSFP3.0a [94].

Im Dokument Genetics of Restless Legs Syndrome  (Seite 64-68)