Heuristic Approximations - Advanced stochastic protein sequence analysis

2.5 Summary

3.1.2 Heuristic Approximations

FASTA

In 1988 William Pearson and David Lipman presented a program suite for improved bio-logical sequence comparison [Pea88]. These tools (FASTA, FASTP, and LFASTA) can be used for analyzing both protein and DNA sequences. In the following the basic version of the heuristic alignment method for protein sequences – FASTA – is explained.

Generally, FASTA is an approximation of the Smith-Waterman algorithm for global se-quence alignments. Based on a four stage approach, local high scoring alignments are cre-ated, starting from exact short sub-sequence matches, through maximal scoring ungapped extensions, up to the final identification of gapped alignments most likely to be homolo-gous to the query sequence. Note that in the original publication of the FASTA algorithm [Pea88], the single steps were not explicitly named. Rainer Merkl and Stephan Waack intro-duced comprehensible names in their description of the algorithm (cf. [Mer03, pp. 123ff]), which are adopted for clarity in the explanations given here.

1. Basically, the major speedup of the alignment calculation process is gained in the first stage of the FASTA approach – hashing. Here, for allk-tuples of the query sequence starting at positioniwithin the input, the starting positionsjofk-tuples within partic-ular database entries exactly matching are searched. These pairs(i, j)are called hot-spots and are very efficiently obtained by hashing techniques. In order to retrieve the ten best diagonal sequences within a (hypothetical) DP matrix the relative positions of all hot-spots are evaluated using a simple scoring scheme. The actual limitation to the first ten diagonals is part of the FASTA heuristic. The diagonals extracted now contain mutually supporting word matches without gaps serving as seeds for further processing.

2. In the second step – scoring 1 – the ten diagonal sequences scored maximally are pro-cessed further. Within the diagonal sequences optimum local ungapped alignments are obtained using a PAM or BLOSUM scoring scheme. Here, exact word matches from the first step are extended possibly joining several seed matches. The alignments produced here are called initial regions.

3. In the third step – scoring 2 – gapped alignments of joined initial regions (based on the ungapped initial regions obtained in the second step of FASTA) are tried to be created, allowing for gap costs. The resulting alignment scores ordered from one ton are called initn.

4. In the final phase of FASTA – alignment – the highest scoring candidate matches in a database search are realigned by means of the conventional Smith-Waterman algo-rithm. Here, the evaluation of the DP matrix is restricted to a small corridor around the initial region of step two scored with init1 producing the final alignment and the appropriate score.

The principles of the FASTA approach can easily be summarized graphically. In figure 3.6 the four phases of the algorithm are illustrated.

init1 initn opt k-tuple

~s1

~s₂

2 4

1 3

Figure 3.6: The four phases of the FASTA algorithm (adopted from [Mer03, p.125]): 1) Determination of positions of identical sub-strings (k-tuples) and scores for diagonals; 2) Determination of locally best scoring diagonals – best: init1; 3) Merging of locally optimal alignments – score: initn; 4) Final sequence alignment in small corridor of DP matrix around init1 – score: opt.

BLAST

The second important technique for approximation of the Smith-Waterman algorithm for local sequence alignments is the Basic Local Alignment Search Tool – BLAST of Stephen Altschul and colleagues [Alt90]. In fact, in the last decade BLAST has become the major tool for sequence analysis and most experimental evaluations in molecular biology research these days start with a BLAST run on one of the large sequence databases.¹

The basic idea of this heuristic approximation is the extension of high-scoring matches of short sub-strings of the query sequence. Similar to the initial step of FASTA described in the previous section, BLAST starts with the localization of short sub-sequences contained in both the query sequence and the database sequences which produce significant scores.

Such pairs are called segment-pairs or hits. Based on these hits, locally optimal pairs of sequences are searched containing one hit – so-called High-Scoring Segment-Pairs (HSP).

The boundaries of HSPs are determined in such a way that extensions or shortenings of the string would decrease the score.

Retrieving high-scoring alignments of a query sequence from a database is performed in a multi-stage approach. Initially all sub-strings consisting of w residues (w N, M) are extracted from the query sequence. Based on these so-calledw-mers all further steps are performed with respect to all database entries. As in the description of the FASTA algorithm, the names of the particular BLAST steps are not part of the original publication but adopted from [Mer03, p.129f].

• In the first step of BLAST – localization of hits – the database entry is inspected for high-scoring matches of all w-mers of the query sequence. Note, that BLAST does not require exact matches in the first stage, only significant scores.

• Based on the hits extracted in the first phase, in the second stage of BLAST – iden-tification of HSPs – pairs of hits located on the same diagonal of a (hypothetical)

1Due to the overwhelming success of the tool and the resulting importance of BLAST alignments, according to the common speech of molecular biologists even the original etymology of the word ’blast’ (→blow up with explosive [Swa86]) seems to be enhanced towards ’perform an alignment of the query sequence against a database’ . . .

Smith-Waterman matrix with a spatial context shorter thanAare obtained. This dis-tance can be measured by analyzing the differences on positions of the first symbols of twow-mers. Both hits are extended to an HSP and if the score of an HSP exceeds a thresholdS_g an extension with gaps is initiated.

• Starting from a residue pair (so-called seed) the alignment is extended in both direc-tions by means of standard DP. Here, only those cells of the DP matrix are considered where the calculated score is higher than the current maximum score minus a thresh-oldX_g. Compared to the FASTA algorithm, the matrix area evaluated by BLAST is dynamically adjusted.

• In the final output stage, the resulting alignment containing gaps is returned if the calculated score (E-value) is below the threshold given.

The BLAST algorithm is likewise best summarized in a graphical manner which is shown in figure 3.7.

matrix cells evaluated

gapped alignment HSP

seed

~s₂

~s1

1 2

Figure 3.7: Illustration of the BLAST algorithm (adopted from [Mer03, p.130]): 1) Localization of hits (marked by ’+’) and extension to High-Scoring Segment-Pairs (HSP) if the distance of two hits on the same diagonal is< A; 2) Calculation of a gapped alignment (conjunction of HSPs across gaps) for HSP containing a score> Sg. The starting point for the alignment is the pair of residues designated by seed. Only those cells of the DP matrix are considered whose scores differ from the maximum by no more thanXg(area shaded grey in the right sketch).

Im Dokument Advanced stochastic protein sequence analysis (Seite 42-45)