• Keine Ergebnisse gefunden

Full-text minute-space index

5. Comparative genomics on single nucleotide level: SARUMAN 89

5.2. Previous approaches

5.2.3. Full-text minute-space index

As mentioned, the methods for the creation and usage of a BWT described in the previous section represent a naive approach, e.g., one obvious optimization is that the first column has not to be stored actually, but it can be substituted if one counts the number of occurrences of each character within Gduring the creation of the BWT. As the first row is lexicographically sorted this information is sufficient to virtually recreate the first column.

The theoretical basis for the efficient use of Burrows-Wheeler transformations for indexing and compression of text and pattern search was introduced by Ferragina and Manzini (Ferragina and Manzini, 2000), who described a compressed BWT that uses only O(Hk(G) +o(1)) bits of memory per input symbol, where Hk is the k-th order empirical entropy of the text. This compression allows to keep an index of human genome size (approx. 3 Gb) in only 1.3 gigabytes of memory and can be created in linear time. In 2005 Ferragina and Manzini proposed the so-called “full-text minute-space index” (FM index) (Ferragina and Manzini, 2005).

The FM index is a compressed full-text substring index based on the BWT, it allows fast text search in the compressed index. It allows to identify the number

98 Chapter 5. Comparative genomics on single nucleotide level: SARUMAN

Figure 5.5.: Schematic representation of the exact text search with Burrows-Wheeler transformations. A given search string P to be found in a sequence G is processed from the end to the start. The block in the first column (FC) showing the last letter of the search string is the start point of the text search. For each row of this block the last column (LC) is checked if the character is identical to the second to last ofP. For all matching cases, in this example only one after the first itera-tion, the LF mapping is applied to get the next block of FC position and the search step in LC is repeated. This is done until the complete search pattern is processed or there are no valid characters left. If the complete pattern is processed the position of the search string in G can be obtained from a compressed suffix position index that is stored during the creation of the BWT.

as well as the positions of occ occurrences of a text pattern of length v within the compressed text in O(v + logu) time. Thus it is perfectly suited for short read mapping. Since the introduction of the FM index it is extensively studied in the field of bioinformatics and applied by different short read alignment tools. In the spring of 2009 three of the most popular short read mapping tools were published, all based on BWT indices: Bowtie, BWA, and SOAP2. They will be presented in the following paragraphs together with CUSHAW, a recent approach that can use graphics hardware to accelerate the alignments.

5.2.3.1. Bowtie and Bowtie2

Bowtie, published in March 2009 by Langmead et al. (2009), was the first short read matching software to use the FM index developed by Ferragina and Manzini (2005).

Bowtie makes some constraints to achieve a sufficient computing time. If the best

5.2. Previous approaches 99 match for a given query sequence (read) is an inexact match, Bowtie can not guar-antee to find the highest quality alignment (Langmead et al., 2009). Furthermore in default settings Bowtie reports only one alignment per read. If all alignments are required Bowtie gets significantly slower. The biggest drawback of Bowtie is that it does not support insertions or deletions (indels) within the alignments. This makes the tool unsuitable for all short read mapping applications where accuracy is crucial, e.g., in re-sequencing where frame-shifts would be lost.

To overcome this limitation, Langmead and Salzberg (2012) developed Bowtie2.

Bowtie2 combines the BWT-based FM index with a seed-and-extend strategy. The fast and memory efficient FM index is used to find short exact matches in a seeding step. Ungapped seed alignments found by the FM index approach are prioritized, and good scoring seed alignments are extended into full alignments by a dynamic programming approach in an SIMD implementation. Due to this dynamic pro-gramming step the accuracy of Bowtie2 is significantly higher compared to the predecessor.

5.2.3.2. SOAP2

SOAP2 (Li et al., 2009b), the successor of the “Short Oligonucloetide Alignment Program” SOAP (Li et al., 2008), uses a BWT compressed index as well. The algorithmic strategy of SOAP2 includes an indexing strategy that creates a hash table of the suffixes of the BWT index with a given seed size, e.g. with 413 = 226 entries for 13mers. This double indexing strategy allows the rapid identification of match positions for substrings of the given length and of exact read matches in general. For inexact matching that allows mismatches SOAP2 uses a “split-read strategy”. This strategy splits the a read into e + 1 segments if e mismatches are allowed, such that according to the pigeonhole principle it is guaranteed that at least one segment has an exact match. For non-matching segments the BWT decoding steps have to be executed.

Due to its double indexing approach SOAP2 is one of the fastest short read aligners, but at the same time the method is not suited for the detection of indels in the reads.

Although Li et al. (2009b) state that the split-read strategy works for mismatches as well as indels, in reality SOAP2 misses the vast majority of reads with indels (see evaluation in Section 5.3.3.2). Thus, SOAP2 obviously sacrifices some accuracy to gain a higher alignment speed.

SOAP3 as the third version of SOAP was published in 2012 (Liu et al., 2012a) and further accelerates the SOAP2 algorithm by the usage of GPU programming. This way SOAP3 achieves a 5-fold to 28-fold speedup above SOAP2 (according to the evaluation in (Liu et al., 2012a)). Unfortunately, the provided binaries of SOAP3 could not be executed on three different test systems, thus it could not be included in the evaluation.

100 Chapter 5. Comparative genomics on single nucleotide level: SARUMAN 5.2.3.3. BWA

BWA (Li and Durbin, 2009) is another BWT based short read mapping approach that uses an FM-index-like data structure. The unique selling point of BWA is that it has a much better support for reads with indels than Bowtie or SOAP2. BWA uses a heap-like data structure to mimic a breadth-first-search in a prefix trie to find all hits with a given maximal number of differences. The algorithm described in (Li and Durbin, 2009) guarantees to find all alignment positions of a read within a reference up to a defined error threshold, but the BWA developers made various modifications to it. These modifications brought a further speed-up, but at the same time added a heuristic aspect. To limit the search space of this trie search an array D is calculated that stores an estimated lower bound of differences up to the defined position. Based on this distance array certain paths in the prefix trie can be omitted. Furthermore only a limited number of mismatches is allowed in a defined seed sequence at the beginning of the read (default 32 bp).

The described strategy allows BWA to map short reads to a reference sequence with high speed and accuracy, and it is the first BWT-based short read matching application that has a proper indel support. But due to the described speed-up efforts the completeness of the underlying algorithm is forgone.

5.2.3.4. CUSHAW and CUSHAW2

Another recently published BWT-based short read aligner is CUSHAW (Liu et al., 2012b). CUSHAW again is based on the FM index data structure as proposed by Ferragina and Manzini, but the short read mapping is accelerated using graphics hardware via the CUDA framework. The algorithmic approach of CUSHAW is quite similar to BWA, but instead of using a breadth-first-search (BFS) on a prefix trie, CUSHAW uses a depth-first-search (DFS) as a BFS would be too memory consuming to be executed on a graphics card. Furthermore, instead of a prefix trie a complete four-ary tree of all permutations of the read sequence is used. Like BWA, CUSHAW uses a bounded search approach to minimize the search space, but the algorithm in its present form does not support insertions or deletions. To support a more exact alignment CUSHAW2 is available from the same developer team. CUSHAW2 does not support GPU usage via CUDA, but instead supports gapped alignments. Both versions will be evaluated in Section 5.3.3.2.

5.3. SARUMAN: Semiglobal Alignment of short