Preprocessing - Bioinformatic methods for eukaryotic RNA-Seq-based promoter identification

5.4 Results

5.4.1 Preprocessing

that CpG islands may also be used to distinguish between two promoter classes, since one class exhibits a much higher CpG density than the second one. A CpG analysis was integrated into the bioinformatics pipeline, based on thecpgplottool and part of the Staden bioinformatics software suite [Staden, 1996; Larsen et al., 1992].

It is known that the -1/+1 position has a preference for pyrimidine/purine (PyPu) dinucleotides [Carninci et al., 2006], which, at least in parts is triggered by the INR motif located around this position and exhibits a strong PyPu peak in its centre. These dinucleotide peaks are also captured by an additional filter, together with general statistics concerning nucleotide usage within core the promoter region and beyond, as these pattern might reveal effects of base pair composition on promoter type and shape.

(a) Sequencing quality of the background library. Generally, reads achieve reasonable quality scores of not less than 31, although base 51 includes outliers reaching down to phred 20.

(b) Sequencing quality of the peak library. Reads achieve qualities comparable to the background library. Bases starting at 49 however show decreasing quality values for outliers.

Figure 5.8: Quality overview of sequencing runs for background (5.8a) and peak library (5.8b) produced produced by the FastQC quality control tool. The x-axis shows the base position within the read, the y-axis indicates the phred quality score.

Peak library Background library

Yield 3.43 Gb 1.24 Gb

Read length 51 nt 51 nt

Read length (clipped) 46 nt 48 nt

No. of Reads 111,980,314 44,686,574

Discarded reads (quality filtered) 565,212 418,503

Remaining reads 111,415,102 44,121,362

Table 5.2: Results of the RNA sequencing run conducted on a Illumina HiSeq 2000 mashine. The remaining reads in the last row are the data source for all further analyses.

Illumina sequencing. Indeed, a fraction of the reads showed residual linker pat-terns, whereupon the whole read was removed instead of only removing the linker sequences from the read. A removal of adapter sequences yields very short remain-ing reads due to the initial read length of only 50 nt and was therefore considered infeasible. Adapters were found in 565,212 and 418,503 reads belonging to peak and background library (Table 5.2). These numbers were within the expected range (below 1%) and were removed.

5.4.1.1 Evaluation of different read mapping tools

The basis for all following analyses, TSS identification as well as promoter analyses is a correct mapping of as many reads as possible onto a chosen reference genome.

Only if reads are mapped back to their correct original position a precise location of transcription start sites is possible

miRNA filtering As this work’s focus are protein coding genes synthesised by RNA polymerase II, other non-coding mRNA types, such as miRNA should be fil-tered out prior to further steps. miRBase²³ [Kozomara and Griffiths-Jones, 2014], a central, curated registry of miRNA stemloops and mature miRNA sequences was selected a sources for miRNA reference data. The 20th database release, pub-lished in June 2013, contains 24,521 precursor miRNAs and 30,424 mature miRNA products. In order to maximise filter efficiency, both stemloops and mature se-quences were converted for use as read mapping reference. This mapping resulted in 2,074,740 (1.86%) peak reads and 1,329,483 (3.00%) background reads that pro-duced significant mappings for the miRBase reference. Those reads were excluded from further processing.

23http://www.mirbase.org

Read mapping against CHO-K1 and Chinese hamster references Initially, read mapping was performed with the Bowtie (version 0.12.7) [Langmead et al., 2009]

read mapper. The achieved mapping rates however, were not satisfactory (Figure 5.9). Since the RNA samples used within this work originate from a CHO-K1 cell line, a logical choice for a reference genome was the published CHO-K1 draft genome sequence [Xu et al., 2011]. Howsoever, with trimmed and filtered reads Bowtie was only able to map 47.06 % of peak reads and 13.39 % of all background reads to the reference genome. Low mapping rates as those reported by Bowtie may indicate a general problem in sequencing, such as contaminations, sequencing artefacts or large ratio of spliced RNAs in the sample, which cannot be mapped with simple mapping software such as Bowtie.

Figure 5.9: Graphical representation of the different mapping approaches evalu-ated. CH is used as abbreviation for Chinese hamster, CHO represents the CHO-K1 draft genome. The continuous increase of mapped reads and the decrease of unmapped reads is easily visible from approach to approach. Mapping rates for each tool in format peak CH/peak CHO, background CH/background CHO:

Bowtie: 47.06 %/39.10 %, 28.33 %/13.09 %; TopHat2/Bowtie: 60.29 %/49.09 %, 37.25 %/19.92 %; TopHat/Bowtie2: 63.82 %/53.08 %, 37.77 %/21.57 %; Bowtie2:

80.81 %/76.20 %, 75.02 %/66.72 %.

In a first step a random sample of reads was extracted from both libraries and screened for possible contaminations with BLAST [Altschul et al., 1990] against the nt database. Most of the reads had their best hit within the CHO-K1 genome.

However, a significant amount of reads did not score any hit in CHO-K1 but in

mouse, rat or human instead. No contaminations such as Escherichia coli or PhiX could be confirmed. Further analysis of reads scoring hits in rat and mouse showed perfect hits, suggesting that those reads are indeed from a CHO cell but probably located in non-assembled areas of the CHO genome, which still has draft status.

Since the CH draft genome is now publicly available [Brinkrolf et al., 2013; Lewis et al., 2013] a second read mapping with identical setup was employed to assess the effect on mapping ratios.

A direct comparison between both mappings is given in first section of four columns of Figure 5.9. Although the number of reads successfully mapped in-creases by 8 % to 47.06 % for peak reads and another 15 % increase to 28.33 % is visible for background reads both rates are still below 50 % and require fur-ther improvement to increase the yield of recovered reads. As pointed out earlier, it is possible that a certain amount of reads could not be mapped due to splice sites within the reads. Although the read length of only 50 bp makes this scenario unlikely, read mapping was conducted with TopHat2 (version 2.0.4) [Kim et al., 2013], a splicing site aware transcriptome mapping extension that is based on the Bowtie mapper family, to assess the influence of splice sites on the mapping. Again, CHO-K1 and CH provide the reference indices, while peak and background reads of both libraries were mapped separately. Two setups differing in the underlying Bowtie version (0.12.7 and 2.0.0-beta6) were prepared to evaluate potential im-provements added in Bowtie2. Analysis of TopHat2’s output showed promising results already for the Bowtie backend, yielding about 11.5 % gain for peak and 7.5 % for background. However, differences between both setups do not exceed 4 % for the peak library and 2 % respectively for background library. In a last mapping setup, Bowtie2 was employed using default parameters with the exception of the so called “localmode”. This mode adds the possibility to clip 5’ and/or 3’ end of the read in order to obtain suitable alignments. In some cases, this ability might be able to recover additional, otherwise not mapping reads. Prior to sequencing, the first cDNA strand was synthesised using random N6 primers (see Section 5.2.2).

Indeed, a recent study by Hansen et al. [2010] showed cases of biases in Illumina transcriptome sequencing which were caused by N6 random priming during library preparation. Due to the possibility of clipping non-aligning regions of the read the percentage of mapped reads may increase significantly. The difference between the best TopHat2/Bowtie2 run (Figure 5.9, third group of columns) and the Bowtie2 run with localmode enabled (Figure 5.9, last group) is easily visible and resulted in mapping rates over 80 % for peak reads and 75 % for background reads (given the CH reference), while both mappings with CHO-K1 reference performed about 4 % and 9 % inferior compared to the CH run. Overall, the mapping results seem to verify the assumptions on potentially biased reads produced by the RNA sequenc-ing setup. Further analyses of mappsequenc-ings generated by Bowtie2 showed a significant prevalence for soft clipping events at the 5’ end of the reads (56.38 % vs 18.20 % for background reads, 46.54 % vs 37.75 % for peak reads, based on the CH mapping), which correlates with the 5’ bias suggested by Hansen et al. [2010].

Im Dokument Bioinformatic methods for eukaryotic RNA-Seq-based promoter identification (Seite 118-123)