• Keine Ergebnisse gefunden

Alignment Data Structures

For diverse alignment tasks, ranging from ordinary protein sequence alignments to multi-read alignments of thousands of reads, specialized data structures are required.

The two main reasons are that (1) no single alignment representation ts the needs of all alignment tasks and that (2) dierent representations allow either a more ecient access to dierent parts of the alignment or a more ecient storage of large-scale alignments.

strength of this specialization is the ecient storage of long gaps.

3. The SumlistGaps specialization stores pairs of integers representing the length of a contiguous sequence of non-gaps and the size of that non-gap plus the preceding gap characters.

The details of the Align data structure can be found in the SeqAn book (Gogol-Döring and Reinert, 2009).

4.1.2 Alignment graphs

The alignment graph introduces an additional abstraction layer for the represen-tation of an alignment. Instead of storing actual aligned sequence characters such as the Align data structure or the FragmentStore, it represents an alignment as an n-partite graph for n sequences as shown in Figure 4.1. Vertices represent non-overlapping sequence segments, edges represent ungapped aligned sequence segments and gaps are implicitly represented by the topology of the graph. For example, the GCT G vertex in Figure 4.1 has no outgoing edges (degree zero) and thus, it is aligned to gaps in all other sequences. The alignment graph is a very compact and versatile description of an alignment. Large-scale alignments can be eciently stored since long segments are represented by only a single vertex. Furthermore, the extension and direction of an alignment is completely dened by the alignment edges. That is, the graph formulation is equally suitable to align globally related sequences or thousands of reads where only subsets are related by mutual overlaps (see Figure 4.2). The properties of an alignment graphG are:

• For a setS ={S0, S1, ..., Sn−1}of n sequences the alignment graphG= (V = {V0∪V1∪...∪Vn−1}, E) is an n-partite graph.

• Each vertex vip ∈ Vi represents a sequence segment in Si of arbitrary length.

We also say that vip covers all positions of the segment. For instance,vip might cover the sequence segment Su1,u2i =siu1siu1+1...siu2−1.

• Every position in Si = si0si1..si|Si|−1 is covered by one and only one vertex vpi ∈Vi.

• Three integers are associated with each vertex: (1) the sequence identier it belongs to, (2) the beginning of the segment and (3) the length of the segment.

Figure 4.1: An alignment graph and the corresponding alignment matrix for three reads.

Vertices represent non-overlapping sequence segments, edges represent ungapped aligned sequence segments, and gaps are implicitly represented by the topology of the graph.

Figure 4.2: The alignment on the left shows globally related sequences whereas the one on the right shows a simplied multi-read alignment. The direction of the alignment solely depends on the alignment edges.

• An edge e = {vpi, vjq} ∈ E with i 6= j indicates that vertex vip can be aligned with vertexvjq. In other words, the sequence substring inSi covered byvip can be aligned without gaps to the substring in Sj covered by vjq.

• The benet of aligning vpi with vjq is given by an edge-weight we.

Besides representing actual alignments, the graph can also be used to store arbitrary match information as illustrated in Figure 4.3. This is, for instance, convenient to store multiple overlapping local alignments as computed by the Waterman and Eggert algorithm (Waterman and Eggert, 1987).

Figure 4.3: A general alignment graph of two sequences with weighted match information.

Only a subset of the edges can be realized in an alignment.

4.1.3 Fragment store

The FragmentStore alignment data structure targets the large-scale storage of multi-read alignments occurring in de novo sequence assembly and resequencing projects.

It was developed together with David Weese. Its main strength is the ecient storage of the alignment of a short read to a large contig or reference sequence. The FragmentStore uses a number of subclasses to store all the additional alignment information required in such projects, such as mate pair information and fragment library characteristics. It also supports the storage of alignments using clipped sequences. Before explaining the data structure in-depth, we want to illustrate the main features by means of a very small example.

Contig − C TA C − − A C G G − − −

→Read1 C T C A C G − A C G

←Read2 ATA C − − A C a a

→Read3 A C T G A

→Read4 g g C TA C − − A C G G C C T g g

The rst row in the multi-read alignment is the putative consensus sequence. Un-derneath the consensus are four aligned reads. Each aligned read has an orientation, shown as an arrow preceding the read name. Clipped sequence characters are shown in lower-case letters. Not shown in this example are mate-pairs, mapping quality information or multiple contigs. The design of the FragmentStore is database ori-ented. Basically, there is one table, called store, for each of the required elements, namely a read store, a mate-pair store, a library store, a contig store, an aligned read store and an annotation store. To link the information in the dierent tables, each read, mate-pair, library and contig has an associated id. This id is used to index the corresponding table, except for the aligned read store that has no such index id. Hence, the aligned read store is the only store that can be arbitrarily sorted. This is very convenient, for instance, to eciently nd all reads belonging

Category Characteristics Storage Directed graph Edges are directed,e1= (u, v)6= (v, u) =e2 Adjacency list Undirected graph Edges are undirected,e={u, v} Adjacency list Automaton Directed edges labeled with characters Edge table WordGraph Directed edges labeled with sequences Edge table Tree Directed edges with parent links, rooted graph Adjacency list HMM Hidden Markov model using a directed graph Adjacency list

Table 4.1: Listing of available graph types.

to a contig or to enumerate all reads in increasing order of their alignment position.

Nevertheless, each element of the aligned read store has a unique id. Although this unique id cannot be used as an index into the aligned read store, it can be used to associate additional information with the aligned read such as a mapping quality or annotation data.