• Keine Ergebnisse gefunden

Multi-Read Alignment

Figure 7.1: A target genome is copied and randomly broken into small fragments. In paired-end sequencing, each fragment is sequenced from both ends. The sequenced regions are called reads and the two reads belonging to the same fragment are called a mate-pair.

the library.

Once a genome has been sequenced, an assembler program is used to reconstruct the target genome from the set of reads. This set can contain thousands or even several millions of sequenced reads. There are two main assembly strategies: (1) de novo assembly and (2) reference-guided assembly.

The term de novo sequence assembly refers to the assembly of a genome from the raw read data without the help of an already sequenced reference genome. Classical de novo genome assemblers follow a three phase protocol: overlap phase, layout phase and consensus phase. In the overlap phase every read is compared to every other read and based upon the putative overlaps an overlap graph of the reads is computed. The overlap graph contains true overlaps as well as spurious overlaps introduced by sequencing errors, repeats or random alignments. The layout phase identies a subgraph in the overlap graph that denes a consistent layout of the reads. That is, conicting overlaps are heuristically resolved. The resulting resolved overlap graph determines an approximate placement of each read. Given such an approximate layout, a multi-read aligner computes the consensus sequence as well as a multi-read alignment. Consequently, the multi-read alignment problem is quite distinct from the alignments discussed so far, since it has to deal with a huge number of short reads that overlap only by a few bases.

In reference-guided assembly projects, we encounter a very similar multi-read alignment problem. In this scenario, the raw read data is mapped to a close relative whose genomic sequence is already available. For this reason, this kind of an assem-bly is also called template assemassem-bly, comparative assemassem-bly or resequencing in case of the same organism. The strength of this approach is that the overlap and layout

Figure 7.2: A newly sequenced genome with an unknown insertion with respect to a reference genome. The mapped reads (black lines) can be used to infer the layout of the mate pairs (gray lines). Mate pairs are indicated by arrows pointing to each other and the connecting, dotted line in-between them. From this inferred layout a multi-read alignment can be computed.

phase are unnecessary. The weakness, however, is that we have no placement infor-mation for unmapped reads, except possible mate-pair inforinfor-mation. Consequently, one needs to keep the number of unmapped reads small by choosing an appropriate reference genome.

A great variety of tools has been designed and developed specically for the purpose of mapping short reads. Examples are MAQ (Li et al., 2008a), SOAP (Li et al., 2008b), Bowtie (Langmead et al., 2009) or RazerS (Weese et al., 2009) from the SeqAn library. Almost all programs use a two step protocol: (1) A ltration algorithm is applied in order to identify candidate regions that possibly contain a match and (2) these candidate regions are veried for true matches. Filtration methods are based on single (Kent, 2002; Ma et al., 2002) or multiple seeds (Li et al., 2003), the pigeonhole principle (Navarro and Ranot, 2002; Li et al., 2008a,b), or counting lemmas using (gapped)q-grams (Burkhardt et al., 1999; Rasmussen et al., 2005). Verication methods encompass semi-global alignment algorithms (Myers, 1999) or local-alignment algorithms (Smith and Waterman, 1981).

Given the nal set of mapped reads, we can, however, only infer the mutual alignment of reads to themselves from this reference-based mapping. This implies that we cannot infer a correct multi-read alignment in novel insertions that are not present in the reference sequence. For small insertions, we might encounter reads bridging the insertion but for large-scale insertions we can only use anchored mate-pairs where exactly one read of the pair mapped to the reference (see Figure 7.2).

... 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 ...

Consensus G A T T G A G A C T G T A − C T G A T C

←Read1 G A T T A A G A C

→Read2 A T T G A G A C T G T A − C TA

←Read3 T G A G − C T G C A T C T G A T

←Read4 G A G A C T G T A − C T

→Read5 A GC T G C A − C T G A A C

→Read6 G A C T G T A − C T G A

→Read7 GC T G C A − C T G A T C

Figure 7.3: A multi-read alignment showing seven reads. The read orientation is indicated by the arrow in front of the read name. The top row shows the consensus sequence. The consensus letter in each column is the most frequent letter with ties broken arbitrarily. By iterating through the alignment column by column, one can identify sequencing errors in column 13, 22 and 25, and putative polymorphisms in column 16 and 20.

Current mate-pair libraries produce, however, mate-pairs of quite varying insert sizes. Libraries with about 10% size deviation are rather the rule than the exception.

In summary, we encounter two multi-read alignment scenarios in de novo and ref-erence guided sequence assembly projects. In the rst scenario the reads have quite accurate layout positions. This case corresponds to a situation where, for instance, all the reads could be mapped and we are only unsure about small insertions. In the second scenario, however, we have a number of unmapped reads and only with the help of mate-pair information we can infer the positioning of the reads. In de novo assembly projects the accuracy of the read layout largely depends on the as-sembler's layout module, so we might encounter both situations here. To address both scenarios, we designed, developed and experimentally veried two algorithms for multi-read alignments, a ReAligner algorithm (Anson and Myers, 1997) for ac-curate layout positions and a robust graph-based multi-read alignment algorithm for inaccurate layout positions. Both algorithms are described in detail in the next two sections.

To conclude this overview we also want to explain the desired properties of a nal multi-read alignment. In contrast to protein alignments, multi-read alignments are

only seldomly inspected manually. Quite often researchers automatically process such large-scale alignments by tools that call SNPs, detect genomic variations or separate haplotypes. Because of that, it is highly benecial if the alignments are so accurate that a simple column-based consensus calling is possible. A very small example highlighting this important property is shown in Figure 7.3. The illustrated multi-read alignment allows the distinction of sequencing errors from true variations.

For instance, the 'A' in column 13, the 'T' in column 22 or the '−' in column 25 are most likely sequencing errors because they are not supported by any other read overlapping this position. Column 16 and column 20, however, are more likely to be true polymorphisms since both variants 'A/−' and 'C/T' are supported by a number of reads. Note that sequencing errors as well as true polymorphisms can introduce spurious overlaps, which is one of the main obstacles that needs to be overcome by a consensus method.