Variant discovery workflow - Methods and Implementation

2 Methods and Implementation

2.13 Variant discovery workflow

The discovery of the variants in the form of mutation sites and expression difference between two samples (Normal vs. Disease) is conducted by following the workflow depicted in the Figure 3.

The fasta files generated from the RNA-seq forms the input, which then proceeds into the mutation sites analysis workflow (see section 2.13.1) and subsequently into the expression analysis workflow (see section 2.13.2). Final output of this step is the generation of two files: one output file contains the mutation sites and the second one the expression difference specific to the disease state.

Figure 3: Variant discovery workflow

2.13.1 MUTATION SITES ANALYSIS WORKFLOW:

To discover mutation sites in both the nuclear and mitochondrial genomes, specific workflows were designed. In this section a general overview of the variants

workflow is briefly discussed.

INPUT FILES (NORMAL AND DISEASE)

MUTATION SITES ANALYSIS

WORKFLOW EXPRESSION ANALYSIS

WORKFLOW

OUTPUT, DISEASE SPECIFIC MUTATION AND EXPRESSION

EXPRESSION FILES

In each experiment, the workflow (Figure 4) begins with the FASTQ files of each sample as the input file. Reads may be single- or paired-end depending on the design of the experiment. The input file then enters individually into the nuclear genome variant analysis workflow (see section 2.13.1.1) and mitochondrial genome variant analysis workflow (see section 2.13.1.2), the results of which are

subsequently manipulated by several analytical processes. Finally, an output file with the list of mutations and its annotations giving information about the gene harboring the mutation, as well as its potential effect on the protein function is returned.

After obtaining the mutation sites and its annotations for both samples (usually Normal vs. Disease), disease-specific data is extracted by comparing it with the normal data.

FASTQ FILE

NUCLEAR GENOME VARIANT ANALYSIS

WORKFLOW

MITOCHONDRIAL GENOME VARIANT

ANALYSIS WORKFLOW

ANNOTATED VARIANTS FILE

INPUT FILES

ANALYTICAL PROCESSES

OUTPUT FILES

Figure 4: Overview of variants analysis workflow depicting the levels of data handling (input, manipulation and output).

2.13.1.1 NUCLEAR GENOME VARIANT ANALYSIS WORKFLOW:

For each sample in the experiment, identification of variants in RNA-seq data is achieved by following the best practice recommendations provided by GATK ( available at:

https://www.broadinstitute.org/gatk/guide/best-practices?bpm=RNAseq . accessed on July 21, 2015).

Module I. Pre-processing RNA-seq reads:

This module pre-processes raw RNA-seq reads, which includes removal of adapter sequences and quality control. This is achieved by processing the raw RNA-seq reads using Trim Galore (see section 2.4) with selecting default adapter sequence for illumina reads and a quality cutoff score of 20 on phred scale.

The processed FastQ reads move over to module II, and here they are mapped to the reference genome.

Module II. Map to the Reference genome:

This module maps pre-processed RNA-seq reads to the human genome version 19.

This is accomplished by the STAR aligner (see section 2.5.2) using the 2-PASS approach, wherein the first pass (1-PASS) identifies the splice junctions and then guides the second pass (2-PASS) towards the final alignment.

MarkDuplicates (see section 4.3a), a command line tool belonging to the Picard tool (see section 4.3) is used to remove duplicate reads.

The final alignment file in BAM format further pre-processed (module III) for

subsequent variant detection using the variant calling and filtering module (module

Module III. Pre-processing alignment file:

Module III pre-processes the alignment file for variant calling by removing

overhanging reads in intronic regions. This is accomplished by the tool Split’N’Trim (see section 2.8.1) available in the GATK (see section 2.8) package.

Furthermore, the mapping quality value of the alignment file is adjusted using a tool called ReassignOneMappingQualityFilter (see section 2.8.2) offered in the GATK package.

Finally, local realignment around indels is performed (see section 2.8.3) and base score recalibration is done using the tool BaseRecalibrator (see section 2.8.4) provided by GATK package.

The alignment file in BAM format is next submitted to varant calling and filtering (module IV).

Module IV. Variant calling and filtering:

In this module the critical task of variant calling (see section 2.8.5) is performed.

Identified variants are subsequently filtered to obtain the bona fide variants present in a sample.

Filtering of the variants is done using the SNPiR pipeline (see section 2.9). In this step, false calls are removed in repetitive regions of the genome, homopolymer regions, around splice junctions and known RNA editing sites. Finally, one is left with the most likely, genomic SNPs.

Module V. Annotation of filtered variants:

The final annotation of identified variants is achieved by the Oncotator tool (see section 2.10).

2.13.1.2 MITOCHONDRIAL GENOME VARIANT ANALYSIS:

In this section, the prediction of variants in the form of homoplasmic or

heteroplasmic sites in the mitochondrial genome is described. The workflow to decode heteroplasmic sites in mitochondrial genome was adapted from the work of [Goto et al., 2011] and was integrated in a Galaxy workflow [available at:

https://usegalaxy.org/u/aun1/p/heteroplasmy, accessed on July 21, 2015], which is an open, web based platform to do biomedical research [Giardine et al., 2005].

The workflow is divided into several modules as detailed below.

Module I. Pre-processing RNA-seq reads:

This module pre-processes raw RNA-seq reads, which includes removal of adapter sequences and quality control. Trim Galore (see section 2.4) was used with selecting default adapter sequence for illumina reads and a quality cutoff score of 20 on phred scale.

The processed FastQ reads are in the next step mapped to the reference genome (module II).

Module II. Map to the Reference genome:

The filtered reads are mapped to the human mitochondrial revised Cambridge reference sequence [Andrews et al., 1999]. Since the mitochondrial genome is a circular molecule, the genome has to be extended on both ends by the length of the input reads -1. Reads are mapped to this extended genome using the BWA aligner (see section 2.5.1) with the default settings.

The alignment file in BAM format is further used for calling variant sites (module III).

Module III. Segregating the alignment file:

In this module reads mapped to the plus strand are separated from reads mapped to the minus strand of the genome.

Both resulting read files are investigated for variant sites with a minimum coverage of 100 reads and the base quality value of 20. Only variant sites passing this filter are retained for further processing.

Module IV. Applying conservative filter settings:

For mitochondrial variant calling, a conservative variant detection setting of at least 5 % coverage a and base quality value of 20 on the phred scale is applied to call the variants. Variant calling is done for all reads in plus and minus direction separately.

Module V. Calling variant sites:

Variants present in the plus and minus strand files are merged and those sites which are common to both the files are chosen and retained as bona fide variant sites. This process eliminates the variants present on only strand, which are moste likely due to technical issues during sequencing.

The variant sites are further manually analysed for their hetero- and homoplasmic nature. Further annotation information is extracted for the variants by mapping them on MitoWheel (see section 2.11).

2.13.2 EXPRESSION ANALYSIS WORKFLOW:

Differential gene expression analysis for two conditions (Normal and Disease) is done in a workflow described below:

Module I. Pre-processing RNA-seq reads:

Pre-processing of reads is done as described in section 2.13.1.1 Module I,

Module II. Map to the Reference genome:

Pre-processed RNA-seq reads for the 2 conditions to be compared are individually mapped to the reference genome using TopHat2 (see section 2.5.3) with default settings. The alignment files is then submitted to transcriptome assembly and merging (module III).

Module III. Transcriptome assembly and merging:

For both conditions, Cufflinks assembles full length transcripts for all the genes expressed and provides an expression value in form of a metric called FPKM (see section 2.12).

The transcriptomes of both conditions are merged by Cuffmerge (see section 2.12.1) and further annotated using a provided annotation file.

Module IV. Differential expression of genes:

Keeping the merged transcriptomes as background gene models, Cuffdiff (see section 2.12.2) calculates the expression difference between two conditions and provides the statistical significance of the expression change between two samples.

Im Dokument Mitochondria and disease (Seite 57-64)