• Keine Ergebnisse gefunden

Material and Methods

2. Materials and Methods

3.1. RNA-seq Data Analysis Pipeline

3.1.1. Gene Annotation

3.1.2.7. Alignment Accuracy

Finally, the mapping accuracy of the four alignment tools was tested with the aid of three simulated 100 bp RNA-seq samples (Chapter 2.1.1). For this purpose, four co-efficients were calculated, i.e. the fraction of reads where every single base could be mapped correctly (match), the fraction of reads which do not align completely correct but overlaps the correct mapping position by at least one base (partly match), the frac-tion of reads that map to a different than the true posifrac-tion (no match) and the fracfrac-tion of reads that could not be aligned at all (not aligned). The performances of the four aligner on each of the three simulated samples are shown in Figure 3.11. In terms of perfectly placed reads, all three split read aligners perform equally well on the stan-dardin silicodatasets (Simulated Read Sample 1 and 2) with consistently more than 93% of reads falling in this category. For the sample with higher polymorphism rates (Simulated Read Sample 3) the observed differences are more conspicuous. While GEM could align the highest amount (92%) of reads to the correct position STAR and TopHat lag behind and achieve 88% and 80%, respectively. For expression quantifi-cations it is vital that reads can be assigned to the correct gene. Thus, taking together reads falling into both categories, i.e. match and partly match, GEM and STAR yield comparable numbers for each of the three samples. Furthermore, investigation of the distribution of the number of overlapping bases for partly matched alignments re-vealed that most of them show high agreement with the true mapping position with an average of 93%, 94% and 93% overlapping bases per partly matched read for STAR, GEM and TopHat, respectively (Figure 3.12). BWA-MEM, in contrast, yields by far the lowest fraction of perfectly placed reads but on the other hand returns a high num-ber of partly matched ones. However, on average only 64 bases of them overlap the correct position. TopHat, again, shows the highest fraction of reads that could not be aligned. And eventually, STAR reports least misplaced reads.

3. Results

BWAMEMGEMSTARTopHat

BWAMEM GEM STAR TopHat

0.00 0.05 0.10

0.15

Corr:

0.986

Corr:

0.983

Corr:

0.978

0 5 10 15

Corr:

0.995

Corr:

0.992

0 5 10 15

Corr:

0.997

0 5 10 15

0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15

Figure 3.10.: This plot depicts the comparison of gene read counts of one sample that was aligned with all four aligners. In the lower left part of the plot log-arithmized read counts are plotted against each other. The upper right part shows the respective Spearman correlation coefficient of two aligner associated read count sets each and the diagonal illustrates the density plots of the logarithmized read count sets of each aligner.

Using the numbers illustrated in Figure 3.11, alignment qualities can be evaluated in terms of precision and recall as was done in Lindner and Friedel, 2012[132] where

58

3.1. RNA-seq Data Analysis Pipeline

match partly match no match not aligned

fraction of mapped reads

(a) Simulated Read Sample 1

0.00 0.25 0.50 0.75 1.00

match partly match no match not aligned

fraction of mapped reads

(b) Simulated Read Sample 2

0.00 0.25 0.50 0.75 1.00

match partly match no match not aligned

fraction of mapped reads

(c) Simulated Read Sample 3

Figure 3.11.: Alignment accuracy of simulated read data of the four tested alignment tools. Reads were grouped into four categories, i.e. match, partly match, no match and not aligned. Results are shown for each data set separately.

precision and recall are calculated as:

precision= T P

T P +F P (3.1)

T P

3. Results

cumulative fraction of partly matched reads

aligner BWA-MEM GEM STAR TopHat

(a) Simulated Read Sample 1

0.00

cumulative fraction of partly matched reads

aligner BWA-MEM GEM STAR TopHat

(b) Simulated Read Sample 2

0.00

cumulative fraction of partly matched reads

aligner BWA-MEM GEM STAR TopHat

(c) Simulated Read Sample 3

Figure 3.12.: Cumulative distribution of the number of overlapping bases across partly matched reads. The y-axis represents the fraction of reads that show not more than x overlapping bases with the correct mapping posi-tion (x-axis).

For that purpose, alignments have to be assigned to one of the three categories, namely true positive (TP), false positive (FP) and false negative (FN):

• TP: All alignments that perfectly but also partly match the correct mapping po-sition belong to this category. Partly matched alignments were included, first,

60

3.1. RNA-seq Data Analysis Pipeline

because assignment of a read to the correct gene is paramount and second, due to the high agreement of them with the correct mapping position (Figure 3.12).

• FP: Alignments that were mapped to a wrong position are classified as FP

• FN: All alignments that could not be mapped to the correct position, i.e. un-aligned reads but also incorrectly un-aligned ones, are included in this category.

Thus, like in Lindner and Friedel, 2012[132], the sum of TP and FN is equal to the total number of analyzed reads.

Resulting precision and recall values are shown in Table 3.1. Overall, each of the

BWA-MEM GEM STAR TopHat

precision recall precision recall precision recall precision recall Sample 1 0.9722 0.9712 0.9789 0.9772 0.9876 0.9844 0.9813 0.9505 Sample 2 0.9692 0.9685 0.9844 0.9829 0.9861 0.9832 0.9803 0.9484 Sample 3 0.9698 0.969 0.9868 0.9831 0.9874 0.9825 0.9821 0.8114

Table 3.1.: Performance of the tested alignment tools on the three simulated RNA-seq samples. The table shows the respective precision and recall scores for each of the four aligners.

four aligners yield relatively high precision as well as recall scores. However, TopHat results in slightly lower recall levels which is due to the higher number of unaligned reads.

3.1.3. Quality Control

All QC metrics provided by the pipeline are calculated after the alignment step for each sample separately using the alignment information stored in the BAM files (Chap-ter 2.3.2). Quality metrics for RNA-seq data were evaluated and established in the course of the GEUVADIS RNA-seq project and published in a companion paper by t’Hoenet al., 2013[239]. Some of them as well as other quality metrics are used and implemented in the here presented pipeline. In the following, four of them, namely

i. sequencing depth ii. exonic rate iii. rRNA rate

iv. XIST and Y-chromosome gene expression

3. Results