Approaches of MSA Algorithms - Multiple Sequence Alignment with R / submitted by Dipl.-Inf. Enr

According to Pevsner (2009), five different approaches to multiple sequence alignment exist. These approaches will be discussed briefly. Other authors like Chuong B. Do and Katoh (2008) divide the approaches into different groups. Sometimes, the algo-rithms can be assigned to more than one group, therefore, a distinct separation from each other is nearly impossible. Additionally, many of these algorithms are variants.

This means that the basic ideas are equal, but the realization of them differs.

2.4.1 Exact Approach

In principle, the optimal alignment for any sequences could be computed with dynamic programming. In reality, however, such an exact approach is not feasible regarding time or space for more complex problems (cf. Michael S. Waterman, 1995). As stated already, the required computational time is O(2^NL^N). Therefore, these approaches can only be used for small number of sequences. An exact computation of smaller problems is possible, but to align a biological and evolutionary meaningful alignment, longer sequences are necessary. Only if these preconditions are complied, real

sim-2.4 Approaches of MSA Algorithms ilarities or differences in sequence, structure or function can be derived (cf. Gusfield, 1997). Carrillo and Lipman (1988) described this problem and introduced the so called Carillo-Lipman-bound. That’s why heuristics are used, for example in the so-called progressive strategies, but these heuristic approaches can not guarantee to find an optimal solution (cf. Michael S. Waterman, 1995).

2.4.2 Progressive Approach

One of the first, oldest and most widely used methods for MSA is the so called pro-gressive approach. This basic concept was first introduced by Hogeweg and Hesper (1984) and refined by Feng and Doolittle (1987). It is an extension of the pairwise alignment algorithm and all progressive solutions have a few basic steps in common.

The first step is to calculate pairwise sequence alignment scores between all the se-quences which will be saved in a distance matrix. From this point, a first phylogenetic tree (the so-called ”guide tree”) or phylogenetic star (the so-called ”guide star”) is cal-culated by cluster analysis (for example with the Neighbor-Joining-algorithm or UP-GMA). Depending on the chosen grouping method, it is referred to a star alignment or tree alignment. After receiving such a guide tree or star, the algorithm starts with the most similar sequence pair and as the names imply, those algorithms progressively align one sequence per step with the existing alignment. The new sequence, which is added to the alignment, is chosen by computing means out of the guide tree respec-tively guide star. The algorithms stop after adding the last sequence to the alignment (cf Thompson, Plewniak, and Poch, 1999a and Robert C. Edgar, 2004b).

An itemized overview of those algorithms could be described as following (Mount, 2004):

1. Compute the pairwise alignments for all sequences with dynamic programming (more exact) or heuristics (faster)

2. Create a phylogenetic guide tree or guide star

3. Use the guide tree or star to identify the next sequence (take the most closest sequence)

4. Align the new sequence to each of the previous sequences 5. Return to step 3 or stop

This one-by-one technique has the advantage of being very fast (even faster than the adaptation of pair-wise alignments to multiple sequences, which is not scalable for a higher number of sequences and becomes very slow), combined with reason-able sensitivity. Another advantage of such an approach is the (relative) simplicity of implementing such algorithms. And last but not least, gaps are kept within a limit, be-cause of the fact, that the first alignment starts with the most similar sequences. That is a positive and necessary fact, because gaps can not be vanished by this method (Robert C. Edgar, 2004b, cp). In general, the progressive alignments methods are compromises between needed time and accuracy (cf. Thompson et al., 1999a).

There are also many disadvantages of progressive algorithms. A major disadvan-tage of such algorithms is the reliance on a good alignment of the first two sequences.

If there are errors in this first alignment, those errors are propagated throughout the rest of the MSA (cf. Thompson et al., 1999a). Complementary, if a gap is included, it will be penalized in step one and four, which results in an over-alignment (in other words: the output are groups without gaps and blocks with many gaps) (cf. Robert C.

Edgar, 2004b). The next disadvantage is that these algorithms are greedy and will not guarantee an optimal solution to provide the most accurate alignments. Additionally, the whole alignment changes if the order of the sequences are changed. The results strongly depend on the initial order. Another major disadvantage is the fact, that the guided tree is only an approximation of the real phylogenetic tree, because this tree is often unknown (cf. Thompson et al., 1999a).

2.4 Approaches of MSA Algorithms Attempts to solve these disadvantages are semi-progressive approaches (like e.g.

in in PSAlign (cf. Sze, Lu, & Yang, 2006)) or iterative algorithms. A few further exten-sions of the progressive approach like a weight function (e.g. in ClustalW) to counteract those problems exist (cf. Thompson, Higgins, & Gibson, 1994).

2.4.3 Iterative Approach

Barton and Sternberg (1987) presented the first iterative concept. The basic idea behind iterative algorithms is the assumption, that an already existing suboptimal so-lution can be modified in some way so that the resulting soso-lution is optimal. Iterative approaches compute an alignment with a progressive alignment strategy (which is a non optimal solution as described in the previous section). Afterwards, the alignment is modified until a solution converges and no further enhancement are possible. It corrects one of the disadvantages of progressive alignment strategies, more precisely the problem of initializing the alignment and the propagation of errors throughout the rest of the MSA (cf. Lassmann & Sonnhammer, 2005a).

The reiteration of the MSA starts with a pairwise realignment of sequences within subgroups and leads to a realignment of the subgroups. The subgroups are randomly selected or with further information, for example the sequence relations known from the guide tree. Basically, the iterative approach is an optimization method, which can be combined with other (machine learning) approaches, such as genetic algorithms and Hidden Markov Models. Furthermore, the iterative algorithms are often subdivided into deterministic (e.g. Round-Robin, or single-type-, double-type-, or tree-dependant-partitioning) or stochastic methods (e.g. simulated annealing, or evolutionary algo-rithms) (cf. Thompson et al., 1999a). The disadvantage of iterative approaches are computational time and additionally, paired with the required memory complexity. Fur-thermore, a disadvantage is inherited from optimization methods: the process can get trapped in local minima. And last but not least, the parallelization of iterative algorithms remains difficult (cf. Mount, 2004). For those reasons, other algorithms are considered, such as consistency-based or structure-based approaches.

2.4.4 Consistency-Based Approach

The goal of consistency based algorithms is to find the multiple alignment which coin-cide the most with all possible optimal pairwise alignments. This principle works well with biological observations as well. It is solvable with exact methods or heuristics. At bottom, the consistency-based approach uses the law of implication (A → B ∧B → C ⇒A→C) and turn it around.

For better comprehension an example is given: Three sequences, called x,y and z are given. A pairwise alignment x − z and a further pairwise alignment z −y is computed. A match in the position i of x and the position k of z occurs at the same time as a match in the position k of z with the position j of y (xi = z_k ∧ z_k = y_j). A pairwise alignment with xandy, the positioniinx has to match with the positionj in y(xi = y_j).

Consistency-based methods use this principle backwards. An alignment of two sequences x and y take care of the alignment with z (as indication). To guide the pairwise alignment betweenxandyin the right direction, similar to the single steps in a progressive alignment. They adjust the score for a x_i = y_j and thus incorporate a sequence information in the pairwise alignment. Consistency-based alignments follow the pattern prevention is better than cure (cf. C. B. Do, Mahabhashyam, Brudno, &

Batzoglou, 2005).

The consistency-based approach also has a disadvantage. As long as all pairwise alignments are done the proper way, the theoretical concept works well. Figure 2.2 shows an example of a good alignment on the left site. But as soon as one of the pairwise alignments exhibits an inconsistency, the theoretical root idea does not work any more. Figure 2.2 shows this problem on the right side.

Kececioglu (1993) recognized this problem and therefore mapped the MSA prob-lem to a graph probprob-lem (which is unfortunately also NP-complete) and called it the

”Maximum Weight Trace Problem”. It is based on an alignment graph, where the nodes represent the symbols of sequences and the edges illustrates pairwise align-ments. These edges are weighted by quality. To find a fitting alignment, a subset of

2.4 Approaches of MSA Algorithms

Pairwise Alignment 1 (Seq1, Seq2)

Seq1 A

Seq2 B

Pairwise Alignment 2 (Seq1, Seq3)

Seq1 A

Seq3 D C

Pairwise Alignment 3 (Seq2, Seq3)

Seq2 B

Seq3 D C

Pairwise Alignment 1 (Seq1, Seq2)

Seq1 A

Seq2 B

Pairwise Alignment 2 (Seq1, Seq3)

Seq1 A

Seq3 D C

Pairwise Alignment 3 (Seq2, Seq3)

Seq2 B

Seq3 D C

Multiple Alignment Seq1, Seq2, Seq3

Seq1 A

Seq2 B

Seq3 D C

Multiple Alignment Seq1, Seq2, Seq3

Conflict with Pairwise Alignment 3!

Seq1 A

Seq2 B

Seq3 D C

Multiple Alignment Seq1, Seq2, Seq3

Conflict with Pairwise Alignment 2!

Seq1 A

Seq2 B

Seq3 D C

Fully Consistent

⇐⇒

More Reliable

Partly Consistent

⇐⇒

Less Reliable

Figure 2.2: Problems with inconsistencies in consistency-based algorithms (C ´edric Notredame, Higgins, & Heringa, 2000)

edges is necessary (with maximum weight), which is still a meaningful trace (which means it can be reproduced without conflicts).

2.4.5 Structure-Based Approach

Structures diverge slower than sequences (cf. Chothia & Lesk, 1986). This biological effect is used in structure-based approaches. Those alignments are usually specific to proteins, but RNA sequences can be computed with this approach as well. They enrich the sequences with information about the secondary and tertiary structure of proteins (or RNAs). With this structural information, these algorithms have a larger amount of position-specific information, which is used for computing specific positions in an alignment and which ends in a more improved result (cf. Simossis & Heringa, 2005). R. F. Smith and T. F. Smith (1992) was one of the first, who recognized this approach.

The inclusion of structural information is de facto the only similarity which all state-of-the-art algorithms have in common. All other aspects depend on the realizations and implementations. For example, the moment of accumulation differs a lot among the particular algorithms. Some programs add the information in a first step (for exam-ple 3D-Coffee), whereas others enrich the sequences after the first step of the align-ment (like CE) (cf. Shindyalov & Bourne, 1998). Some algorithms apply the structural information during the computation of the alignment (such as 3D-Coffee/Expresso or PRALINE, Simossis and Heringa, 2005, see sections 3.4.2 and 3.5), whereas other algorithms compute the alignment without this structural information and use the infor-mation afterwards to optimize the alignment (e.g. CE, Shindyalov and Bourne, 1998).

Others again (like MAFFT, Katoh, Misawa, Kuma, and Miyata, 2002, see section 3.5) perform a hybrid approach, the additional information is not incorporated in the orig-inal sequences. Instead, enriched homologue sequences are added to the origorig-inal set, further alignment steps are done with the additional information (to order those sequences the right way) and at last, additional sequences are removed again and the alignment is computed in a final step (cf. Simossis & Heringa, 2005).

2.4 Approaches of MSA Algorithms In such approaches, the measure methods itself are different. They vary from the root mean square distance deviation (RMSD), modified RMSDs like iRMSD (intra-molecular RMSD (Armougom, Moretti, Keduas, & Notredame, 2006), like in the special modes 3D-Coffee and Expresso (Cedric Notredame, 2016b)) of T-Coffee, tRMSD (”a structure based clustering method using the iRMSD to drive the clustering” (Cedric Notredame, 2016b)), methods which use APDB (O’Sullivan et al., 2003), or a TM-score rotation matrix (used in TM-align, Zhang and Skolnick, 2005).

Last but not least, it should be mentioned, that structural approaches can be divided by the performed and computed alignment steps (horizontal-first methods, vertical-first methods, or consensus-first methods) (cf. Ma & Wang, 2014).

As it is obvious, this field of research is large, and that is why this subsection can not give an exhaustive overview. A really thoroughly and up-to-date compendium (and comparison) of structural alignment algorithms is given in Ma and Wang (2014), 22 structural aligners are listed and explained.

Alignments which are constructed with structural information have one advantage in common: They are able to detect more distant relationships between sequences, wheras standard procedures are unable to do so. ”As a result, the cases that benefit the most are those that evolution has changed so extensively (<30%identity) that the homology (common ancestry) between them is almost undetectable when compared directly.” (Simossis & Heringa, 2005)

The biggest disadvantage of this approach is the necessity of an availability of the corresponding structural information (gained with X-ray crystallography, NMR spec-troscopy, or dual polarisation interferometry (cf. Cross et al., 2003)). The costs for extracting such structures are expensive, thus this disadvantage is amplified. For this reason, the structures are often not available and thus, the use of secondary or ter-tiary structure information is partially impossible (hence the complete algorithms). In addition, the structure-based alignment divides the scientific community: One the one side, those, who consider this approach as the non plus ultra, like Y. M. Huang and Bystroff (2006): ”So far, the most satisfying way of aligning remote homologues has

been to use structural information whenever possible”. On the other side, the opposite is more critical about this evolution, as the statement of Armougom, Moretti, Keduas, and Notredame (2006) shows: ”The use of structural information, however, carries its own peril, and while the sequence analysis community tends to consider structure based alignments as unambiguous and unquestionable gold standards, a closer look reveals a much less clear cut situation.”.

Im Dokument Multiple Sequence Alignment with R / submitted by Dipl.-Inf. Enrico Bonatesta ; Mag. Dipl.-Ing. Christoph Horejš-Kainrath (Seite 40-48)