• Keine Ergebnisse gefunden

Algorithmic components

7.3 Graph-based Multi-Read Alignment

7.3.1 Algorithmic components

Algorithm 3 ReAlignment loop

1: score=Score(A)

2: repeat

3: score_old =score

4: ReAlign all reads and create A0

5: score=Score(A0)

6: until score≥score_old

or BandedGotoh ) and reintegrates the read in the multi-read alignment. The ac-tual bandwidth b and the used dynamic programming algorithm are congurable parameters. Optionally, the user can specify to include a reference sequence. This sequence is not realigned but used to create the prole stringP.

Pairwise overlap alignments

The assumed input of the algorithm is a set of reads with their estimated begin and end positions. These estimated layout positions can, for example, stem from an assembler's layout module or they are inferred from mate pair information as shown previously in Figure 7.2 on page 91. Especially for the second case, the positions tend to deviate strongly from the true positions depending on the standard deviation of the used paired-end library. If the deviation is small the algorithm can be congured to use the positions of two given reads directly for estimating the alignment diagonal of a banded overlap alignment with ane gap costs (Gotoh, 1982). As mentioned previously, a banded overlap alignment initializes the dynamic programming matrix with zeros and computes only a band of size b around the estimated alignment diagonal wherebis the bandwidth. Besides the accuracy of the layout positions, the choice of b depends on two parameters: the sequencing error rate and the length of the overlap. Hence, the baseline for b is a congurable parameter that is adjusted linearly based on the length of the overlap. Usually, it is unnecessary to compute all possible pairwise overlaps, especially for deep coverage sequencing projects. For that reason we provide a parameter that adjusts how many overlaps are computed per given read. The more error-prone the reads are, the more overlaps one should compute per read. If the deviation is large or in other words, the initial positions are unreliable then we also support a window based computation of pairwise overlaps.

That is, all pairwise overlap alignments of reads lying in a user-dened window are computed with a standard dynamic programming algorithm.

Subsequent to the computation of overlap alignments, we select all overlaps of signicant quality and length. Similar to the bandwidth, both parameters are adapt-able from the command line. The selected overlaps are then subdivided into un-gapped alignments (segment matches) as explained previously in Chapter 6.

Alignment graph construction

We reuse the multiple segment match renement algorithm (Rausch et al., 2008a) introduced in Chapter 6 to subdivide overlapping segment matches into distinct sub-matches so that no segment match begins or ends within another segment match.

We then construct the alignment graph by dening a vertex for each sequence seg-ment and an edge for each segseg-ment match. The weight of this edge depends on the

Figure 7.5: The alignment on the left shows globally related sequences whereas the one on the right shows a simplied multi-read alignment. Note that the direction of the alignment is solely dependent on the edges.

quality of the segment match. Note that the alignment graph is also suitable to represent partially overlapping sequences as shown in Figure 7.5.

Consistency extension

We once again apply the triplet extension (Notredame et al., 2000) to incorporate multiple alignment information in the pairwise edges. However, we do not compute a full triplet extension because in case of hundreds or even thousands of reads this would be too expensive. We limit the triplet extension to a reweighting of the existing edges but we do not insert new edges as described before for the protein multiple alignment.

Graph-based progressive alignment

In the last step, the consistency-enhanced alignment graph is used to progressively align the reads according to a guide tree. This guide tree is constructed from the overlap alignment scores using a sparse distance matrix and the fast UPGMA algo-rithm (Sokal and Michener, 1958). A sparse distance matrix is used instead of an ordinary matrix because for each read only c other reads are expected to overlap wherecis the assembly coverage. The guide tree ensures that the best quality over-laps are aligned rst whereas the dicult and error-prone overover-laps caused by reads with many sequencing errors come in late, when partial alignments along subtrees are already quite large and xed. The progressive alignment itself is completely independent of the nature of the sequences. Given an input alignment graph, it builds a multiple alignment along the guide tree simply by aligning strings of

ver-tices as explained before. There is one notable exception: In case of a multi-read alignment the proles will have only about c vertices at each position where c is again the coverage. Thus, the amount of required memory depends on the coverage and the source sequence length. It is, however, largely independent of the num-ber of reads. This is a key distinction of our method to current multiple sequence alignment tools where the proles grow linearly with the number of sequences. In a nal post-processing step we compute the consensus sequence and convert the nal prole into a multi-read alignment.