• Keine Ergebnisse gefunden

Improving the divide-and-conquer approach to sum-of-pairs multiple sequence alignment

N/A
N/A
Protected

Academic year: 2022

Aktie "Improving the divide-and-conquer approach to sum-of-pairs multiple sequence alignment"

Copied!
7
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

P e r g a m o n

Appl. Math. Lett. Vol. 10, No. 2, pp. 67-73, 1997 Copyright©1997 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0893-9659/97 $17.00 + 0.00 PII: S0893-9659(97)00013-X

Improving the Divide-and-Conquer Approach to Sum-of-Pairs

Multiple Sequence Alignment

J . S T O Y E

Research Center for Interdisciplinary Studies on Structure Formation (FSPM) University of Bielefeld

Postfach 10 01 31, D-33501 Bielefeld, Germany stoye~mathematik, uni-bielefeld, de

S. W . P E R R E Y

Department of Mathematics, Massey University Palmerston North, N e w Zealand

S. W. Perrey©massey. ac. nz

A. W . M. DRESS

Research Center for Interdisciplinary Studies on Structure Formation (FSPM) University of Bielefeld

Postfach 10 01 31, D-33501 Bielefeld, Germany dress~mathematik, uni-bielefeld, de

(Received May 1996; accepted August 1996)

A b s t r a c t - - W e consider the problem of multiple sequence alignment: given k sequences of length at most n and a certain scoring function, find an alignment that minimizes the corresponding "sum of pairs" distance score.

We generalize the divide-and-conquer technique described in [1,2], and present new ideas on how to use efficient search strategies for saving computer memory and accelerating the procedure for three or more sequences. Resulting running times and memory usage are shown for several test cases.

K e y w o r d s - - - M u l t i p l e sequence alignment, Dynamic programming, Divide-and-conquer.

1. I N T R O D U C T I O N

M u l t i p l e s e q u e n c e a l i g n m e n t is a n i m p o r t a n t p r o b l e m in c o m p u t a t i o n a l m o l e c u l a r biology, a n d m a n y a l g o r i t h m s have b e e n p r e s e n t e d in this area of research (for a r e c e n t c o m p a r i s o n see [3]).

Since t h e p r o b l e m of c o m p u t i n g o p t i m a l a l i g n m e n t s w i t h respect to t h e "sum-of-pairs" c r i t e r i o n a n d m o s t of its v a r i a n t s are N P - h a r d [4], m a n y a p p r o x i m a t i v e a l g o r i t h m s have b e e n p r o p o s e d (e.g., [5-8]), U n f o r t u n a t e l y , a l m o s t all of these m e t h o d s e i t h e r e x h i b i t a p r o h i b i t i v e c o m p u t a t i o n a l c o m p l e x i t y or yield biologically u n p l a u s i b l e results. W i t h o u r a l g o r i t h m , we t r y to c o n t r i b u t e t o w a r d s i m p r o v i n g this s i t u a t i o n .

The authors wish to thank R. Giegerich and U. T6nges for helpful comments on an earlier version of this paper.

Part of this work is supported by the German Ministry for Education, Science, Research, and Technology (BMBF) under Grant Number 01 IB 301 B4.

Typeset by ~4A/~-TEX 67

(2)

68 J. STOYE et al.

2. T H E P R O B L E M

Let us consider a finite alphabet ,4, k sequences $1, s 2 , . . . , 8k over ,4 of length n l , n2 . . . . , nk, respectively, and an additional letter, say ' - ' , not contained in ,4, which symbolizes gaps. An alignment of sl, s 2 , . . . , sk is given by a k x N matrix M = (mij)l<_i<_k,l<_j<_N for some N <

~-~i=1 ni, with entries k mij E ,4 (.J { - } subject to the following constraints: it does not contain any column consisting of gaps only, and for each i = 1 , 2 , . . . , k, the row ( m i l , m i 2 , . . . , miN) reproduces the sequence si upon eliminating all of its gap letters.

The weighted sum of pairs multiple sequence alignment problem can now be described as follows (cf. [9]): given 81, S 2 , . . . ,

Sk,

and given a scoring function D : (,4 tA {_})2 __, R, defined on all possible pairs of letters, find an optimal alignment M, i.e., an alignment t h a t minimizes

w(M) := ~ ~P,q" Z D(mpj,mqj) ,

l<p<q<k

j = l

where the Otp,q are sequence dependent (nonnegative) weight factors reflecting, e.g., phylogenetic relationship. The resulting score is also denoted by Wop(Sl,..., sk).

It is well known t h a t the multiple sequence alignment problem can be solved optimally by dynamic programming [10,11] with running time proportional to 2 k • I-Ii=l ni, searching for a k shortest alignment path in a directed graph with I-L=1 ni vertices. Faster variants and a number k of speedups of dynamic programming have, therefore, been proposed (e.g., [9,12]). Unfortunately, most of these approaches still need exceedingly large computing time and computer memory when applied to more than, say, six protein sequences of average length 300.

3. T H E B A S I C A L G O R I T H M

Our algorithm, previously described in [1,2] for the restricted case of three sequences, works according to the well-known divide-and-conquer principle: each of the given k sequences is being cut at an appropriately chosen site somewhere near to its midpoint, this way reducing the original alignment problem to the two subproblems of aligning the two resulting groups of prefix and suffix subsequences, respectively. These will be handled by the same procedure in a recursive manner.

The recursion stops when the remaining subsequences are short enough (e.g., shorter t h a n a pregiven threshold L) to be aligned by a standard procedure--in our implementation, we use SSA [13,1a].

T h e m a i n problem is to find a tuple of ideal slicing sites (cl,c2,...,ck) so that the simple concatenation of the two optimal alignments of the prefixes z Sl(< ci), s2(< c2),..., sk(< ck) and the suffixes Sl(>

Cl), 82(> C2),.-.,

8k(~ >

Ck)

forms an optimal alignment of the original sequences.

Obviously, for any fixed site Cl (1 < cl _< nl), there exists a ( k - 1)-tuple (c2(~1),... ,ck(~l)), such t h a t (51,c2(~x),...,ck(~l)) forms a k-tuple of ideal slicing sites. Unfortunately, finding these points requires searching the whole k-dimensional hypercube, requiring as much time as the standard dynamic programming procedure. So, of course, this is not the method of choice.

Instead, our algorithm tries to find so-called C-optimal slicing sites t h a t are based on pairwise sequence comparisons, only. More precisely, we use the dynamic programming procedure which we apply to all pairs of sequences (sp, sq). The resulting score matrices for pairwise alignment give rise to additional cost matrices

Csp,Sq

lap, Cq] "= Wop

(Sp(~

Cp),

8q(~

Cq)) + Wop

(8p(> Cp), 8q(>

Cq)) - Wop (Sp,

8q),

IHere, sp (<_ cp) denotes the prefix subsequence of sp with indices running from 1 to cp, and s n (> cp) denotes the suffix subsequence of sp with indices running from cp + 1 to np, i _< p _< k.

(3)

Divide-and-Conquer Approach 69 which contain the additional charge imposed by forcing the alignment path to run through a particular vertex (Cp, Cq) (1 <_ p < q <_ k). The calculation of C~,,s~ can be performed by computing forward and reverse matrices in a similar way as it is described in [15,16], respectively.

Note t h a t there exists, for every fixed 5p, at least one slicing site Cq(Sp) with Cs,,sq[Sp, cq(~p)] = 0. This follows from the facts t h a t the vertices on an optimal pairwise alignment path are precisely those with no additional cost, and that every alignment path meets at least once every position of the two sequences.

To search for a good k-tuple of slicing sites, we t r y to estimate the multiple additional cost imposed by forcing the multiple alignment path of the sequences through the particular vertex (cl, c 2 , . . . , ck) in the whole (k-dimensional) hypercube associated with the corresponding align- ment problem. To this end, we use a weighted sum of additional costs over all projections (Cp, cq) as such an estimate: we put

C(C1,C2,'",Ck) :---- ~ °lp,q'Csp,sq[Cp, Cq], l<p<q<_k

where the Olp,q a r e the same sequence dependent weight factors as above.

Our proposition, is now that C-optimal slicing sites, i.e., (Cl, c 2 , . . . , ck) that minimize C(Cl, c2, . . . . ck), result in very good, if not optimal multiple alignments. In conclusion, this leads to the following general procedure.

A l g o r i t h m d ~ c-.align (81, S 2 , . . . , 8k, L) I f m a x { n l , n 2 , . . . , n k } ~_ L,

t h e n return the optimal alignment of sl, s 2 , . . . , sk (using, e.g., HSh);

e l s e return the concatenation of

d Sic-align ( s i ( < Cl), s2(_ < c ~ ) , . . . , sk(_< ca), L), and d ~ c-align (sl (> C1 ), 82 ( > C2) . . . . , 8k(> Ck) , L ) , where (Cl, c 2 , . . . , Ck) := calc-cut(sl, s 2 , . . . , sk).

In the following section, we describe how to realize calc-cut, which computes a k-tuple of C-optimal slicing sites.

4 . E F F I C I E N T L Y C A L C U L A T I N G T H E S L I C I N G S I T E S

In a naive implementation, the search calc-cut for C-optimal slicing sites (Cl, c 2 , . . . , ck) needs time O(k2n 2 + n k - 1 ) , where n := max{n1, n 2 , . . . , nk}: the computation of all pairwise additional cost matrices takes O(k2n 2) time and, for given cl, all possible combinations of c 2 , . . . , ck have to be checked to find the tuple t h a t minimizes C in (..0(nk-1).

We reduce this running time and the required memory (@(k2n 2) for the naive version) by the following approach: we precalculate an estimation C for C(cl, c 2 , . . . , ck), which allows us to prune the search space enormously. Because the multiple additional cost C(cl, c2 . . . . ,ck) is a sum of nonnegative numbers ap,q. Csp,sq [%, Cq], it is possible to exclude a tuple of slicing sites (Cl, c 2 , . . . , ck), whenever one of the summands O~p,q • C s p , s q [Cp, Cq] is larger than the minimum found so far. In particular, for fixed Cl, any Cp with C~l,q • Cs~,s, [cl, cp] > C can never lead to a smaller sum C.

With this in mind, a tuple of C-optimal slicing sites can be calculated as follows.

F u n c t i o n calc-cut ( sl , s2, . . . , Sk ) 1. Fix cx := [(n1/2)].

^

2. Calculate and save columns col~q[j] := Csl,sq[~l,j] (2 < q < k, 1 < j < nq).

3. Locate slicing sites c2, . . . , gk such t h a t col el 1,q[~q] = 0 (2 < q < k).

(4)

7 0 J . S T O Y E et al.

4. Calculate the estimate

a:=

l<_p<q<k 2<p<q<k

where to calculate the entries Csp,Sq[Sp, 5q] (2 < p < q < k) only the memory for one column col ep~q of order O(n) is needed at a time, since no part of the matrices C~p,sq has to be saved.

Cl "

5. Calculate lower and upper bounds lq and Uq such t h a t ~l,q • col 1,q~] >- C, for all j < lq, and for all j > Uq (2 _< q < k). The intermediate segment eol~l,q[lq],... ,col~lq[Ua] forms the relevant part of each column col ~l,q.

6. Given these bounds, compute and save the relevant parts of the matrices Cs.,sq, defined by Csp,sq[i,j] with lp < i < up and lq < j <_ uq.

7. Search for better slicing sites (~l,C2(~z),...,ck(Ol)) within the relevant parts of the columns col ~l,q and of the matrices Cs~,sq. Thereby, the sum C can be computed step by step and the search can be stopped, if an intermediate result is larger t h a n C. Dur- ing this search, with decreasing values of C, the relevant part of the columns col ~l,q can possibly be further reduced, decreasing the search space even more.

Obviously, the worst case time and space complexity of this approach still remain O(k2n2-{ - n k- 1) and (9(k2n2), respectively, for the (very improbable) case where the bounds lp and Up can never be increased or decreased, respectively. But for biological sequence families, the effect is enormous:

for calculating the first tuple of slicing sites in the recursion, which takes far the longest time of all cut-site computations, the reduction from n to the length r := maXp=2 . . . k{Up -- Ip + 1} of the longest of the remaining relevant parts of the columns, is usually greater than 100 : 1 for small k, yielding memory savings for the matrices of at least four orders of magnitude, and reducing the expected time and space complexity to O(k2n 2 + r k-l) and O(kn + k2r2), respectively. More detailed results concerning exact running times and memory usage are presented in Section 6.

5. F U R T H E R I M P R O V E M E N T S

In the actual implementation of d~c-align, the program MSA is called for every single group of subsequences containing at least one subsequence of length at most L. It would be more elegant to calculate all the slicing sites first (which is possible without the knowledge of any of the subalignments) and then call MSA only once with the extra information concerning the slicing sites. Although HSh is able to handle fixed parts of the alignment of length one or longer, it still needs to be extended so as to accept fixed slicing sites, which from this point of view, can be seen as fixing an alignment of zero length. Upon our request, the authors of MSA are working on such an extension [17].

Another advantage expected from this approach, is that, with such a feature, MSh's ability of dealing with biologically more reasonable quasinatural gap costs [18] and with a score function, which does not penalize gaps at the beginning and the end of the alignment, will be carried over automatically to our procedure.

To improve the quality of the alignments further, we propose a windowing approach to correct the obtained alignments in the proximity of division sites. We suggest to choose a window width W depending on the threshold L. Laying such a window across each slicing site, one may search for an optimal realignment of the corresponding subsequences, again using for instance MSh.

6. R E S U L T S

To determine exact time and memory usage, our algorithm has been applied to several sets of sequences obtained by a stochastical mutation process on random sequences of different lengths n and alphabet size 20 denoting the set of amino acids. The sequences have a palrwise identity

(5)

D i v i d e - a n d - C o n q u e r A p p r o a c h 7 1

between 15 and 25 percent. All results shown are averages over ten different sequence sets. As scoring function D, we used the PAM-250 distance m a t r i x [19] with positive entries between 0 and 25, and g a p p e n a l t y 15.

From earlier m e a s u r e m e n t s [2], we obtained a value of L := 40 for the stopping criterion to g u a r a n t e e near-to-optimal results: the score of the alignments calculated in these tests differs by at m o s t 1 percent from the score of the optimal alignment (in cases where we were able to check this with HSh). T h e values of the weight factors O~p,q are calculated from the pairwise optimal alignment scores Wop(Sp, Sq), according to the formula given in [2].

Figures 1 and 2 show plots of running time and m e m o r y usage, respectively, for sets of k = 3 , . . . , 9, sequences with different average length n on a Silicon Graphics workstation with a M I P S R4000 CPU.

225

200

175

150

125 E

100

Q .

75

5 0

2 5

I k = 9 - - k = 8 . . . k = 7 . . . k = 6 ...

k = 5 . . . k = 4 . . . k = 3 . . .

/

i I i t

i S,, ..o..

. j t , : l - . , , " ' . . o . . . .

I 4 m

250 5 0 0 7 5 0 1000 1250 1500 1750 2 0 0 0

n

F i g u r e 1. R u n n i n g t i m e s o f d #~ c-align.

5 i i I I I I I

k=9 - -

4.5 ] k=8 . . .

/

k=7 . . .

k=6 ...

4 k=5 . . .

k=4 . . .

~

3 . 5 k = 3 . . .

==g 2.5 i

~ 2 .

1

0 . 5

0 i i , l

0 2 5 0 5 0 0 7 5 0 1000 1250 1500 1750 2 0 0 0

n

F i g u r e 2. M e m o r y u s a g e o f d ~ c-align.

As can be seen from these results, very fast calculation of high-quality alignments with rather m o d e r a t e m e m o r y usage is possible with d ~ c-align for up to seven sequences. In particular, the near-to-linear growth of running time for sets with three and four sequences allows us to align

(6)

72 J. STOYE et al.

t h r e e sequences of l e n g t h u p t o 9000, a n d four sequences of l e n g t h u p t o 6000 w i t h i n 10 m i n u t e s , as f u r t h e r m e a s u r e m e n t s have shown. I n c o n t r a s t , t h e c o m p a r a t i v e l y long r u n n i n g t i m e for eight a n d m o r e sequences d e p e n d s (in a d d i t i o n to longer MSh-runs) o n t h e s m a l l e r r e d u c t i o n ( r / n ) i n t h e s e cases (see T a b l e 1), since t h e m u l t i p l e a d d i t i o n a l cost C is a s u m of (k2) s u m m a n d s , of w h i c h o n l y o n e is t e s t e d a g a i n s t 0 w h e n d e t e r m i n i n g t h e b o u n d s of t h e r e l e v a n t regions.

Table 1. Ratio of relevant part to original size of columns col el averaged over 100 1 , q ' sets of k -- 3 , . . . , 9, sequences.

k 3 4 5 6 7 8 9

average - r 0 . 0 0 9 0 . 0 2 3 0 . 0 7 3 0 . 1 7 0 0 . 2 4 2 0 . 3 8 9 0.608

n

F i n a l l y , we have i n v e s t i g a t e d w h e t h e r t h e a l i g n m e n t scores i m p r o v e u p o n u s i n g t h e w i n d o w i n g a p p r o a c h m e n t i o n e d i n Section 5. T a b l e 2 shows t h e relative error of o u r score for four sets of r a n d o m sequences w i t h different k a n d n (i.e., t h e p e r c e n t a g e of t h e o p t i m a l score b y which t h e score o b t a i n e d b y o u r a l g o r i t h m exceeds t h e o p t i m a l one), for different values of L a n d W .

Table 2. Effect of windowing

k n L W - - 0 W = I L 4

60 0.00% 0.00%

3 I00 40 0.12% 0.00%

20 0.47% 0.35%

60 0.21% 0.18%

3 300 40 0.21% 0.18%

20 0.36% 0.35%

60 0.28% 0.12%

4 100 40 0.28% 0.18%

20 0.34% 0.18%

60 0.41% 0.41%

5 100 40 0.46% 0.42%

20 0.69% 0.69%

on alignments obtained with d ~4 c-align.

w =

!L

2

o.0o%

0.0o%

0.25%

0.18%

0.18%

0.32%

0.12%

0.17%

0.18%

0.41%

0.42%

0.50%

W = L 0.00%

O.00%

0.25%

0.08%

0.17%

0.17%

0.00%

0.09%

0.17%

0.41%

0.42%

0.50%

R E F E R E N C E S

1. A.W.M. Dress, G. Fiillen and S.W. Perrey, A divide and conquer approach to multiple alignment, In Proc. of the Third Conference on Intelligent Systems for Molecular Biology, I S M B 95, Menlo Park, CA, pp. 107-113, AAAI Press, (1995).

2. U. T6nges, S.W. Perrey, J. Stoye and A.W.M. Dress, A general method for fast multiple sequence alignment, G e n t 172 (GC33-GC41), (1996).

3. M.A. McClure, T.K. Vasi and W.M. Fitch, Comparative analysis of multiple protein-sequence alignment methods, Mol. Biol. Evol. 11 (4), 571-592, (1994).

4. L. Wang and T. Jiang, On the complexity of multiple sequence alignment, J. Comp. Biol. 1 (4), 337-348, (1994).

5. D. Gusfield, Efficient methods for multiple sequence alignment with guaranteed error bounds, Bull. Math.

Biol. 55 (i), 141-154, (1993).

6. J. Kececioglu, The maximum weight trace problem in multiple sequence alignment, In Proc. of the ~ th Syrup.

on Combinatorial Pattern Matching, L N C S 684, pp. 106-119, (1993).

7. J.D. Thompson, D.C. Higgins and T.J. Gibson, CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucl. Acids Res. 22 (22), 4673-4680, (1994).

8. V. Bafna, E.L. Lawler and P.A. Pevzner, Approximation algorithms for multiple sequence alignment, In Proc.

of the 5 th Syrup. on Combinatorial Pattern Matching, L N C S 807, pp. 43-53, (1994).

9. H. Carillo and D. Lipman, The multiple sequence alignment problem in biology, S I A M J. Appl. Math. 48 (5), 1073-1082, (1988).

10. S.B. Needleman and C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol. 48, 443-453, (1970).

(7)

Divide-and-Conquer Approach 73 11. M.S. Waterman, Introduction to computational biology, In Maps Sequences and Genomes, Chapman • Hall,

London, UK, (1995).

12. M. Murata, J.S. Richardson and J.L. Sussman, Simultaneous comparison of three protein sequences, Proc.

Natl. Acad. Sci. USA 82, 3073-3077, (1985).

13. D.J. Lipman, S.F. Altschul and J.D. Kececioglu, A tool for multiple sequence alignment, Proc. Natl. Aead.

Sci. USA 86, 4412-4415, (1989).

14. S.K. Gupta, J.D. Kececioglu and A.A. Sch~ffer, Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment, J. Comp. Biol. 2 (3), 459-472, (1995).

15. E.W. Myers and W. Miller, Optimal alignments in linear space, C A B I O S 4 (1), 11-17, (1988).

16. D. Naor and D.L. Brutlag, On near-optimal alignments of biological sequences, J. Comp. Biol. 1 (4), 349-366, (1994).

17. A.A. Sch~ffer, Personal communication.

18. S.F. Altschul, Gap costs for multiple sequence alignment, J. Theor. Biol. 138, 297-309, (1989).

19. M.O. Dayhoff, R.M. Schwartz and B.C. Orcutt, A model of evolutionary change in proteins, In Atlas of Protein Sequences and Structure, (Edited by M.O. Dayhoff), Volume 5, Suppl. 3, pp. 345-352, National Biomedical Research Foundation, Washington, DC, (1978).

Referenzen

ÄHNLICHE DOKUMENTE

Using fluorescence correlation spectroscopy and theory, it is shown that the cooperative diffusion coefficient of the matrix polystyrene chains can be measured by

This is a place where children and families, who both share things in common and are also different in many ways, come together.. Every child comes to the daycare centre with

The fidelity of the hybridization reac- tions can be improved significantly and as an example of a set of 24 words of 16-mers we show that the optimal set has unique

Hammerschmidt (Hrsg.): Proceedings of the XXXII Intemational Congress for Asian and North African Studies, Hamburg, 25th-30th Augusl 1986 (ZDMG-Suppl.. century locally

The existence of pseudopotentials is considered in [3], furthermore the fact th at the Liouville equation cannot be solved by inverse scattering methods. [2]

Because the morbidity rates and survival probabilities are relatively constant in different regions, it is possible to use the estimation of the morbidity rates and/or

Motivated by the problem that rate stationarity of the underlying processes is crucial to many statis- tical analysis techniques, the multiple filter test (MFT) tests the

As for the conductivity sensor, the result of calibration shows that a set of coefficient for the conversion from the frequency to the conductivity decided at the time of the