MUSCLE - Multiple Sequence Alignment with R / submitted by Dipl.-Inf. Enrico Bonatesta ; Mag. D

MUSCLE is an iterative MSA algorithm, which is – according to EMBL-EBI (2016a) – an accurate MSA tool, which is suitable for medium alignments and has advantages with proteins. The algorithm is written inC++.

The algorithm can be divided in three different stages. The first stage computes a draft progressive alignment, in the second stage an improved progressive alignment is done. Finally, the third stage refines the previous alignments. An overview of the algorithm is given in figure 3.1. We will refer to this figure more often, while explaining the algorithm in detail.

Figure 3.1: All three stages of the MUSCLE algorithm (Robert C. Edgar, 2004b)

3.3 MUSCLE

3.3.1 Stage 1 – Draft Progressive

Similarity and Distance Measure

In a first step, a similarity and distance measure is computed. In stage 1, there are no alignments available, therefore, MUSCLE usesk-mer counting to compute similarity.

Tree construction

A binary tree is constructed either by neighbor-joining or by UPGMA with three different measures. These measures are average linkage, minimum linkage, and a mixture of both methods, which is a weighted mixture of minimum and average linkage. Robert C. Edgar (2004b) states, that UPGMA delivers slightly better benchmark scores and has lower time complexity. Therefore, UPGMA is used by default. The output of this part of the algorithm is a binary treeT REE1.

Progressive alignment

For aligning profiles, a scoring function must be defined for a pair of profile positions, respectively MSA columns, analogous to a substitution matrix for residues. A common scoring function used in ClustalW (Thompson et al., 1994) or MAFFT (Katoh et al., 2002) is PSP:

P SP^xy =X

f_i^xf_j^ylog p_ij pipj

(3.1) This function is the sequence-weighted sum of substitution matrix scores for each pair of letters, selecting one from each column. The variables i and j are amino acids, p_i, p_j, and p_ij are the background probabilities for the amino acids. Finally, f_i^x and f_j^y are the observed frequencies of amino acid i or j in columns x or y in the corresponding sequencesX orY.

MUSCLE implements equation 3.1 based on the 200 PAM matrix and the 240 PAM VTML matrix. Additionally, MUSCLE provides another function, which is called the log-expectation (LE) score. The LE score is derived from the log-average (LA) function,

invented by von ¨Ohsen and Zimmer (2001).

Figure 3.2: Simple binary tree for progressive alignment (Robert C. Edgar, 2004a) The alignment is done at each node of the binary tree in prefix order. Figure 3.2 shows a simple alignment of four sequences. Indels are marked with a shadowed background. The output of stage 1 is now a first draft multiple alignment, which can be found in the root node. This multiple alignment is referred to asM SA1.

3.3.2 Stage 2 – Improved Progressive

In the first stagek-meres were used for measuring similarity and distance, which leads to a non optimal tree. Now, in the second stage, aligned sequences are available.

Hence, a more accurate method for computing similarity and distance is used. Another difference to the first stage is, that this stage is an iterative stage.

Similarity measure and distance measure

In stage 1, the similarity was measured by using k-meres as a first approximation.

Now, in stage 2, alignments are used to compute similarity. This is done by computing a global alignment to get the fractional identityD.

Tree construction and tree comparison

The binary tree is constructed in the same way, as described in stage one. The output is a new binary tree T REE2. In case the current tree has a decreased number of changed nodes compared to the previous iteration or to stage 1, a new iteration is

3.3 MUSCLE started. Otherwise, progressive alignment, as described in the following paragraph, is done. Tree comparison is done in prefix order. The nodes in tree T REE1 are compared to those of T REE2 by using a lookup table with internal node IDs. If the child nodes share the same IDs, the ID from one parent node is assigned to the parent node of the other tree.

Progressive alignment

This step is similar to progressive alignment in step 1. The only difference is, that this step is only done for subtrees whose branching orders changed relative toT REE1.

3.3.3 Stage 3 – Refinement

In this stage, refinement is done by partitioning the tree. Analogous to stage 2, this stage is as well iterative.

Choice of bipartition

T REE2from stage 2 is divided into two disjoint subsets by deleting one edge. Regret-tably, the authors of MUSCLE do not mention, how to choose the edge that should be deleted. After that step, we have two subtrees.

Profile extraction

Despite the fact, that two subtrees exist, two different profiles respectively multiple sequence alignments from each new root node can be extracted. Those columns without residues (only indels and gaps) are discarded. The output of this step are two different profilesX andY.

Re-alignment

For re-alignment, profile alignment, as described in stage 2, is used.

Accept/reject

Finally, a sum-of-pairs (SP) score of the multiple alignment is computed for sequences sandt in columnsx:

SP =X

t>s

S(s[x], t[x]) (3.2)

The affine gap penalty with gap opening penalty g, gap length λ, and gap extension penaltyeis used: g+λe

Stage 3 is repeated until all edges have been chosen for deletion or a user-defined number of iterations has been reached.

Modes

MUSCLE allows three different modes:

• MUSCLE:includes all three stages

• MUSCLE-prog: includes stages 1 and 2 with default options

• MUSCLE-fast: includes only stage 1 with the fastest options

Im Dokument Multiple Sequence Alignment with R / submitted by Dipl.-Inf. Enrico Bonatesta ; Mag. Dipl.-Ing. Christoph Horejš-Kainrath (Seite 54-58)