• Keine Ergebnisse gefunden

5. Comparative genomics on single nucleotide level: SARUMAN 89

5.2. Previous approaches

5.2.1. Seed based approaches

An established and popular approach for efficient approximate string matching is to split the process into two steps. In the first step small, exact matches of substrings

5.2. Previous approaches 93 of the query and the template are searched using exact string matching techniques;

such exact partial hits are called “seeds”. In a second step these seeds are used as starting point for an extension of the alignment, using e.g. dynamic program-ming approaches. There are numerous different seeding strategies that differ in the number of required seed matches, the use of contigous or spaced seeds, or the seed length, but most solutions are based on one of the two following principles, or combinations thereof: The qgram lemma states that two strings P and S with an edit distance of e share at least t qgrams, that is substrings of length q, where t =max(|P|,|S|)−q+ 1−q·e.

That means that every error may destroy up to q·eoverlapping qgrams. For non-overlapping qgrams one error can destroy only the qgram in which it is located, which results in the applicability of thepigeonhole principle. The pigeonhole princi-ple states that, ifn objects (errors) are to be allocated tomcontainers (segments), then at least one container must hold no fewer than dmne objects. Similarly, at least one container must hold no more than bmnc objects. If n < m, it follows that bmnc= 0, which means that at least one container (segment) has to be empty (free of errors). Moreover, if n < m this holds for at least m−n segments. Seed-based approaches are widely used in all fields of sequence comparison, e.g., the most popular alignment tool BLAST uses a seed-and-extend algorithm. Conse-quently, the problem of aligning short sequencing reads to long reference sequences was adressed by some seed-based alignment softwares. In the following section three of the most well-known dedicated short read alignment solutions that use seed-and-extend strategies are presented.

5.2.1.1. SHRiMP

SHRiMP, theSHort ReadMappingPackage (Rumble et al., 2009), relies on three algorithmic principles: spaced seeds (Califano and Rigoutsos, 1993), qgram filters (Rasmussen et al., 2006), and an accelerated Smith-Waterman implementation us-ing SIMD techniques. Spaced seeds are a variation of the classical exact matchus-ing seeds that allow mismatches at defined positions of the seed (for a good descrip-tion of spaced seed see (Ilie and Ilie, 2007)). Spaced seeds are often represented as strings of “1” and “*”, where “1” denotes a position where a matching base is required, while “*” denotes a wildcard where a match or mismatch is allowed.

As an example, the spaced seed “111**1**111” requires matches at positions 1-3, 6, and 9-11, has length 11 and a weight of 7, where the weight is the number of required matches in a spaced seed. Usually several spaced seeds of identical weight are used at the same time, SHRiMP provides several default sets of spaced seed which are shown in Table 5.1. Q-gram filters, as introduced by Rasmussen et al.

(2006), require multiple seed hits in close proximity to identify possible matches.

SHRiMP requires two spaced seed hits per 40bp window to define a possible match.

Such possible matches are verified by a vectorized Smith-Waterman implemen-tation comparable to that of Farrar (2007) that can rapidly compute the maximum alignment scores using an SIMD implementation. This SIMD implementation has

94 Chapter 5. Comparative genomics on single nucleotide level: SARUMAN

Table 5.1.: Spaced seeds: Default spaced seeds used by SHRiMP for seed weights from ten to twelve. Defaults are available for seed weights up to 18.

weight default spaced seeds

10 “11111**11111”, “1111**11***1111”, “1111**1**1**1**111”, “111**1***1****1**1111”

11 “1111**1111111”, “11111**11***1111”, “1111**1**1***1**1111”, “111**11**1****1**1**111”

12 “1111*1111*1111”, “1111*111**1****1111”, “1111****11**11*1111”

a three- to five-fold speed-up compared to unvectorized implementations. For can-didates with a sufficient alignment score the actual alignments are computed in a final step.

5.2.1.2. PASS

PASS, a “Program to Align Short Sequences” (Campagna et al., 2009), uses a classical seed based genome indexing approach in its first algorithmic step: An index of spaced seed words is created for the reference sequence, and reads are scanned for seed words also occurring in the genome index. In a second step PASS tries to extend these initial seeds to full length hits. For this purpose it uses precomputed score tables (PSTs) for all possible short sequences of a given length aligned to each other with defined alignment metrics. For fast access these PSTs are stored in main memory, allowing the rapid comparison of two sequence fragments flanking an initial seed hit. Thus, the execution time rises linearly with the read length as the number of PST queries is dependent on the read length, but at the same time the use of PSTs makes the runtime performance of PASS independent from the number of allowed gaps/errors as no alignments have to be computed. A drawback of PASS is that the PSTs are very memory consuming and can be applied only up to a very limited length. A PST for 8bp sequences needs approximately 4 gigabytes of memory, a 9bp PST would need almost 70 gigabytes.

5.2.1.3. mrFAST & mrsFAST

Two further seed-based short read mapping algorithms are mrFAST (Alkan et al., 2009) and mrsFAST (Hach et al., 2010), where mrsFAST supports only substitu-tions, while mrFAST also supports indels. Both methods use a classical seed-and-extend approach. Based on the number of allowed errorse, each read is partitioned into dreadlengthe+1 e non-overlapping segments of length k, called k-mers. An index of all such k-mers in all reads is generated, and a second index is created for all over-lappingk-mers found in the reference sequence, storing the positions of the k-mers within the reference.

By the pigeonhole principle there has to be at least one k-mer from a read that has a counterpart in the reference if this read has a match in the reference sequence with not more than e errors. By comparing the read and the reference index, such seed pairs can be found and extended to full alignment hits in a final step. While this algorithm is a standard approach in string matching, the novelty of mrFAST and

5.2. Previous approaches 95 mrsFAST is the implementation as a “cache-oblivious” algorithm, where “Cache-oblivious” means, that mrFAST and mrsFAST use a recursive divide-and-conquer technique to split the compute-intensive all-gainst-all comparison of the two indices into smaller sub-problems that fit into the CPU cache. This results in a significantly more efficient execution of the calculation.