multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

(1)

Nucleic Acids Research, 1994, Vol. 22, No. 22 4673-4680

CLUSTAL W: improving the sensitivity of progressive

multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

Julie D.Thompson, Desmond G.Higgins+ and Toby J.Gibson*

European Molecular Biology Laboratory, Postfach 102209, Meyerhofstrasse 1, D-69012 Heidelberg, Germany

Received July 12, 1994; Revised and Accepted September 23, 1994 ABSTRACT

The sensitivity of the commonly used progressive multiplesequencealignment method hasbeengreatly improved for the alignment of divergent protein sequences. Firstly, individual weightsareassignedto each sequence in apartial alignmentinorder to down- weight near-duplicate sequences and up-weight the mostdivergentones. Secondly,aminoacidsubstitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receivelocally reducedgappenaltiestoencouragethe opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which isfreely available.

INTRODUCTION

The simultaneous alignment ofmanynucleotideoramino acid sequencesisnowanessential tool inmolecularbiology. Multiple alignments areused to find diagnostic patterns to characterise proteinfamilies;todetectordemonstratehomology betweennew sequencesandexisting families ofsequences;tohelp predict the secondary andtertiary structures ofnewsequences; to suggest oligonucleotide primers for PCR; as an essential prelude to molecularevolutionary analysis. Therateof appearanceofnew sequence data is steadily increasing and the development of efficient andaccurateautomatic methodsformultiple alignment is, therefore, ofmajor importance. The majority ofautomatic multiplealignments arenowcarriedoutusingthe 'progressive' approach ofFengand Doolittle (1). Inthis paper, wedescribe anumberofimprovementstotheprogressivemultiplealignment methodwhichgreatlyimprovethesensitivitywithoutsacrificing any of the speed and efficiency which makes this approach so

practical. The new methods are made available in a program called CLUSTALW, which is freely available and portableto a wide variety ofcomputers and operating systems.

In order toalign just two sequences, it is standardpracticeto usedynamicprogramming (2).This guaranteesamathematically optimal alignment, given a table of scores for matches and mismatches between all amino acids or nucleotides [e.g. the PAM250 matrix(3) orBLOSUM62 matrix (4)] and penalties for insertions or deletions of different lengths. Attempts at generalising dynamic programmingto multiple alignments are limitedtosmallnumbersof shortsequences(5).Formuchmore than eight or so proteins of average length, the problem is uncomputable givencurrentcomputer power. Therefore, all of the methods capable ofhandling larger problems in practical timescales make use of heuristics. Currently, themost widely usedapproachis toexploitthefact thathomologous sequences areevolutionarilyrelated. Onecanbuild up amultiple alignment progressively by aseriesofpairwise alignments, followingthe branching order in aphylogenetictree (1). One firstaligns the most closely related sequences, gradually adding in the more distant ones. Thisapproachissufficientlyfasttoallow alignments ofvirtuallyanysize. Further,insimplecases, thequalityof the alignmentsisexcellent, asjudged bytheabilitytocorrectlyalign correspondingdomains from sequences of knownsecondaryor tertiarystructure(6). Inmoredifficult cases, thealignmentsgive good starting pointsfor furtherautomaticormanual refinement.

This approach works well when the data set consists of sequencesof differentdegreesofdivergence. Pairwisealignment of very closely related sequences can be carried out very accurately. Thecorrect answer mayoften be obtained using a widerangeofparametervalues(gappenalties andweightmatrix).

By the timethemostdistantly relatedsequencesarealigned,one alreadyhasasampleofalignedsequences whichgivesimportant information about thevariabilityateachposition. Thepositions of the gaps thatwereintroducedduringtheearly alignmentsof thecloselyrelatedsequencesarenotchangedas newsequences are added. This is justified because the placement of gaps in

*Towhomcorrespondence should be addressed

'Presentaddress: European BioinformaticsInstitute, HinxtonHall,Hinxton, Cambridge CB10 1RQ, UK

\.j 1994

Oxford University

Press

(2)

Research, 22,

alignments between closely related sequences is much more accurate than between distantly related ones. When all of the sequences are highly divergent (e.g. less than -25-30% identity between any pair of sequences), this progressive approach becomes much less reliable.

There are two major problems with the progressive approach:

the local minimum problem and the choice of alignment parameters. The localminimumproblem stemsfrom the 'greedy' nature of the alignment strategy. The algorithmgreedily adds sequences together, following the initial tree. There is no guarantee that the global optimal solution, as defined by some overallmeasure of multiple alignment quality (7,8), or anything close to it, will be found. More specifically, any mistakes (misalignedregions) made early in the alignment process cannot be corrected later as new information from other sequences is added. This problem isfrequentlythoughtofasmainly resulting from an incorrectbranching order in the initial tree. The initial trees arederived from a matrix of distances between separately aligned pairsof sequences and are much less reliable than trees fromcomplete multiple alignments. In our experience, however, the real problem is caused simply by errors in the initial alignments. Even if thetopology of the guide tree is correct, each alignmentstepin the multiple alignment process may have some percentage of the residues misaligned. This percentage will be verylow onaverage for very closely related sequences but will increase as sequences diverge. It is these misalignments which carrythrough from the early alignment steps that cause the local minimum problem. The only way to correct this is to use an iterativeor stochastic sampling procedure (e.g. 7,9,10). We do not directly address this problem in this paper.

The alignmentparameter choice problem is, in our view, at least as serious as the local minimum problem. Stochastic or iterative algorithms will bejust asbadly affected as progressive ones if the parameters are inappropriate: they will arrive at a false global minimum. Traditionally, one chooses one weight matrix and two gap penalties (one for opening a new gap and onefor extending an existing gap) and hope that these will work well over all parts of all thesequencesin the data set. When the sequences are all closely related, this works. The first reason isthat virtually all residue weight matrices give most weight to identities. When identities dominate an alignment, almost any weight matrix willfind approximatelythe correct solution. With very divergent sequences, however, the scores given to non- identical residues will become critically important; there will be moremismatchesthan identities. Different weight matrices will be optimal at different evolutionary distances or for different classes of proteins.

The second reason is that the range of gap penalty values that will findthecorrect or best possible solution can be very broad forhighly similar sequences (11). As more and more divergent sequences are used, however, the exact values of the gap penalties becomeimportant for success. In each case, there may be a very narrow range of values which will deliver the best alignment.

Further, in proteinalignments, gaps do notoccur randomly (i.e.

with equalprobabilityatall positions). Theyoccur far more often between themajorsecondary structuralelements ofa-helicesand fl-strands than within (12).

The major improvements described in this paper attempt to address thealignmentparameterchoiceproblem. We dynamically varythe gappenalties in a position- andresidue-specificmanner.

The observed relative frequencies ofgaps adjacent to each of

penalty aftereach residue. Short stretchesof hydrophilicresidues (e.g. 5 or more) usually indicate loop or random coil regions and the gap opening penalties âre locally reduced in these stretches. Inaddition, the locations of thegaps found in the early alignments arealso given reducedgapopening penalties. Ithas been observed in alignments between sequences of known structure that gaps tend not to be closer than roughly eight residues ônaverage (12). Weincrease thegap opening penalty within eight residues of exising gaps. The two main series of amino acid weight matrices that âre used today âre the PAM series (3) and the BLOSUM series (4). In each case, there is

a range of matrices to choose from. Some matrices ^are appropriate for aligning very closely related sequences where

most weight by far is given to identities, with only the most

frequent conservative substitutionsreceiving high scores. Other matriceswork betterat greaterevolutionary distanceswhere less importance is attached to identities (13). We choose different weight matrices, as the alignment proceeds, depending on the estimated divergence of thesequencestobealignedateachstage.

Sequencesareweightedto correctfor unequal samplingacross

allevolutionary distancesinthe dataset(14). This down-weights

sequences that are very similar to other ^sequences in the data

set and up-weights the most divergent ones. The weights are

calculated directly from the branch lengths in the initial guide

tree (15). Sequence weighting has already been shown ^to be effective in improving the sensitivity of profile searches (15,16).

In the original CLUSTAL programs (17-19), the initialguide

trees,usedtoguidethe multiplealignment,werecalculatedusing

the UPGMA method (20). We now usethe Neighbour-Joining method (21) whichis^morerobust against the effects of unequal evolutionary rates in different lineages and which gives better estimates ofindividual branch lengths. This is useful because it is these branch lengths which are used to derive the sequence

weights.Wealsoallowuserstochoosebetween fastapproximate alignments (22) or full dynamic programming for the distance calculations used to make the guide tree.

The new improvements dramatically improve the sensitivity of the progressive alignment method for difficult alignments involving highly diverged sequences. We show one very

demandingtestcaseofover60SH3domains (23) which includes

sequencepairs with aslittle as 12% identity and where thereis onlyone exactly conservedresidue across allof the ^sequences.

Using default parameters, we can achieve an alignment that is almost exactly correct, according to available structural information (24). Using the program in a wide variety of situations, wefind that it will normally find^thecorrectalignment in allbut the most difficult and pathological ofcases.

MATERIAL AND METHODS The basic alignment method

The basic multiple alignment algorithm consists of three main

stages: (i) all pairs ofsequencesarealigned separatelyinorder

tocalculateadistancematrix giving the divergence of each pair ofsequences; (ii) a guide tree is calculated from the distance matrix; (iii) the sequences areprogressively aligned according to the branching order in the guidetree. An example using 7 globin sequences of known tertiary structure (25) is given in Figure 1.

The distance matrix/pairwise alignments

Inthe original CLUSTALprograms,thepairwise distanceswere

the20 amino acids (12) are used to locally adjust the gap opening calculated

using

^afast

approximate

method

(22).

This allows very

(3)

Hba-Horse Myg_Phy-a

Hba_Hun HbbbHorse

Hbb_Hunun GIb5_Petna

Lgb2_Luplu

.086 Hbb_Human: 0.221

.226

061 Hbb_Horse: 0.225

.01 Hba_Hunuan: 0.194 Hba-jHorse: 0.203

~~Myg..Pbyca: 0.411 389 Gib5Petma: 0398

Lgb2..Lupiu: 0.442

-VEK9 W~ 1001.3?MLO

--V__y,2UVLVIaVWaKSphtAinooDzL3Lvx4I1aLqcVDM1LK VwaIl*I3KKRSM1 tI3E0DLVKVY*IA0TIV0

A UWVXSVNDAX Df --UOmnZDL"QVD ^"

VI- ⁼ ^~~ ~~~~~~~~~~~~~~~~~~~~~~~~PM]Qv4?GazA=:vvAL

E--- ZL LDKV1LhSV --1 ---

LOW P)LDl^IV ^T KY---

LAV -^-- -II-A CI A---

V I I? DA-- -

. . .

Figure 1. The basic progressive alignmentprocedure,illustrated usingasetof 7globins of knowntertiarystructure.Thesequence names arefrom SwissProt (38): Hba_Horse: horseca-globin;Hba-Human:humanca-globin;Hbb_Horse:

horse 3-globin; HbbHuman: human ,B-globin; Myg.Phyca: sperm whale myoglobin; Glb5.Petma: lamprey cyanohaemoglobin; Lgb2_Luplu: lupin leghaemoglobin. In the distance matrix, themeannumberofdifferencesperresidue isgiven.Theunrootedtreeshowsallbranchlengths drawntoscale.Inthe rooted tree,allbranchlengths (mean number of differencesperresidue along each branch)

aregivenaswellasweightsforeachsequence. Inthe multiplealignment, the approximate positionsof the 7a-helicescommontoall7proteinsareshown.

Thisalignmentwasderivedusing CLUSTALWwith defaultparametersand the PAM(3) series ofweightmatrices.

large numbers of sequences to be aligned, even on a

microcomputer. Thescores are calculated asthe number of k- tuple matches(runs ofidentical residues, typically ¹ or2long forproteins or2-4 long for nucleotide sequences) inthe best alignmentbetweentwosequencesminusafixedpenaltyforevery gap.Wenowofferachoice between this method and the slower but more accurate scores from full dynamic programming alignments using two gap penalties (for opening or extending gaps) and a full amino acid weight matrix. These scores are

calculatedasthenumber of identities in the bestalignmentdivided bythe numberofresiduescompared(gap positionsareexcluded).

Both of these scores areinitially calculatedaspercentidentity

scores and are converted to distances by dividing by 100 and subtractingfrom 1.0togivenumberofdifferencespersite.We donot correctformultiplesubstitutionsinthese initial distances.

InFigure 1wegive the 7 x7 distance matrix betweenthe 7globin

sequences calculated using the full dynamic programming method.

The guide tree

Thetreesusedtoguide the final multiple alignmentprocess are

calculated fromthe distance matrix ofstep 1using theNeighbour- Joiningmethod (21). This produces unrootedtreeswithbranch lengthsproportionaltoestimated divergence along eachbranch.

The rootis placed by a 'mid-point' method (15) at a position where themeansof thebranch lengths^oneitherside oftheroot areequal. Thesetreesarealso usedtoderiveaweight foreach

sequence(15). Theweightsaredependentuponthe distancefrom therootof thetreebutsequenceswhichhaveacommonbranch with othersequences share the weight derived from theshared branch. In the example in Figure 1, the leghaemoglobin (Lgb2-Luplu) gets a weight of 0.442, which is equal to the length of the branch from the root to it. The human (-globin (Hbb_Human) gets a weight consisting of the length of the branch leadingtoit that isnotshared withanyother^sequences (0.081) plushalf thelength of the branch shared with the horse

3-globin (0.226/2) plus

^one quarter the

length

of the branch shared by all four haemoglobins (0.061/4) plus one fifth the branch shared between the haemoglobins and myoglobin (0.015/5) plusonesixththe branch leadingtoallthe vertebrate globins (0.062). This sums to a total of0.221. Incontrast, in the normal progressivealignmentalgorithm,allsequenceswould be equally weighted. The rootedtree with branch lengths and

sequence weights for the 7 globins is given in Figure 1.

Progressive alignment

Thebasic procedure atthis stage isto usea series ofpairwise alignments to align larger and larger groups of sequences,

following the branching order in the guide tree. You proceed from the tips of the rootedtreetowardsthe root. Intheglobin example in Figure ¹ you align the sequences in thefollowing order: humanvs.horse,B-globin; humanvs. horsect-globin; the 2 oa-globins vs. the 2

0-globins;

the myoglobin vs. the haemoglobins; the cyanohaemoglobinvs.the haemoglobins plus myoglobin; the leghaemoglobin vs. all the rest. At each stage a full dynamic programming (26,27) algorithm is used with a

residue weight matrix and penalties for opening andextending

gaps. Eachstepconsists ofaligning twoexisting alignmentsor sequences. Gapsthatarepresent in olderalignmentsremainfixed.

In the basicalgorithm, newgapsthatareintroducedateachstage get full gap opening and extension penalties, even ifthey are

introduced inside old gap positions (see the section on gap

penalties below for modifications to this rule). In order to calculate the score between a position from one sequence or

alignmentandonefromanother,theaverageof all thepairwise weight matrix scores from the amino acids in the two sets of

sequences isused, i.e. ifyou align2 alignments with 2 and 4

sequencesrespectively,thescoreateachpositionis the average of 8(2x4) comparisons.This is illustrated inFigure2. Ifeither setofsequencescontainsoneor moregapsinoneof thepositions being considered, eachgapversus aresidue is scored aszero.

The default aminoacid weightmatriceswe usearerescoredto haveonly positivevalues.Therefore,this treatment ofgapstreats thescoreofaresidueversus agapashavingthe worstpossible

score. When sequences are weighted (see Improvements to progressive alignment, below), each weight matrix value is

Hbb_Human 1I

Hbb_Horse 2 Hba_Hunan 3 Hba_Horse 4 MygYPhyca 5 GlbS_Petma 6 Lgb2_Luplu 7

.17 59 59 .77 .81 .87

.60

.59 .13 .77 .75 .75 .82 .73 .74 .86 .86 .88

.90

Pairwisealignment

Caculate

^ditnemti

l

UnotdNeighbor-joiningt

RootedNJtree(guide tree) andsequenceweights

Progressive alignmentn

Alignfollowing .aav theguidetree

_51i

_- L=- -==

(4)

1 peeksava 1

2 geekaa 1 3 padktnv kaa 4 aadktnv a

5 egewql hv 6 aaektk sa

Without sequenceWeights:

Score- m(t.T) + Y(t,1) + Y(1,-) + Y(1, ) + K(k,v) + M(k,1) + M(k,v)

* M(k, 1) /S

Withsequence WeightsWi:

Bore -* (tIV)*j5

+ M(t,L)Wj.w6

* Kl1, )1SW

+ Y(1,1)*WzWe

+ Klk,v)WtW5

+ Y(lck,1)U%tW6

* K(k,v)1I4W5

* MMl,1) 1We/S

Gapopein peralty

QLSQEEMVLALWDKVN--EEEVGGEALGRLLVVYPWTQRFFDSFGDL VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLS VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLS Figure2.Thescoringschemeforcomparingtwopositionsfromtwoalignments.

Twosections ofalignmentwith 4 and 2 sequencesrespectivelyareshown. The scoreof thepositionwith amino acidsT,L,K,Kversusthepositionwith amino acidsVandIisgivenwith and without sequenceweights.M(X,Y)istheweight matrix entry foramino acid X versus amino acid Y. W,, is the weight for sequencen.

multiplied by the weights from the2 sequences, as illustrated in Figure 2.

Improvements to progressive alignment

All of the remaining modifications apply only to the fmal progressive alignment stage. Sequence weighting is relatively straightforward and is already widely used in profile searches (15,16). The treatment ofgap penalties is more complicated.

Initial gap penalties are calculated depending on the weight matrix, the similarity of the sequences and the length of the sequences. Then,anattempt ismadetoderivesensible local gap opening penaltiesateveryposition ineachprealignedgroupof sequences thatwill varyas new sequencesare added. Theuse ofdifferentweight matricesasthealignmentprogresses isnovel and largely by-passes the problem ofinitial choice ofweight matrix. The final modification allowsus to delay theaddition of very divergent sequences until the end of the alignment process, when all of the more closely related sequences have already been aligned.

Sequenceweighting

Sequenceweightsarecalculateddirectlyfrom theguidetree.The weights are normalised such that the biggestone is set to 1.0 and the rest are all less than 1.0. Groups of closely related sequences receive loweredweights because they contain much duplicated information. Highlydivergent sequences without any close relativesreceivehighweights. These weights areused as simple multiplication factorsforscoring positions from different sequences or prealigned groups ofsequences. The method is illustrated inFigure2. Inthe globin example inFigure 1, the two at-globins get down-weighted because they are almost duplicate sequences (as do the two

3-globins);

they receive a combinedweightofonlyslightly more than if asingle a-globin was used.

Initial gap penalties

Initially, two gap penalties are used: a gap opening penalty (GOP), whichgivesthecostofopeninga newgapofanylength, and agapextensionpenalty (GEP), which gives the cost of every item in agap. Initialvalues can be setbytheuserfrom a menu.

Thesoftwarethenautomatically attemptstochooseappropriate gap penalties for each sequence alignment, depending on the following factors.

Figure3. The variation in local gap opening penalty is plotted for a section of alignment. The inital gap opening penalty is indicated by a dotted line. Two hydrophilic stretches are underlined. The lowestpenalties correspondtothe ends of thealignment, the hydrophilic stretches and thetwopositionswith gaps. The highest values are within 8 residues of thetwogappositions. Therestofthe variation is causedbytheresidue specific gap penalties(12).

Dependenceontheweight matrix. It has been shown(16,28) that varyingthe gappenaltiesused withdifferentweight matricescan improve the accuracy of sequence alignments. Here, we use the average score for two mismatched residues (i.e. off-diagonal values in the matrix) as a scaling factor for the GOP.

Dependence on the similarity ofthe sequences. The per cent identity of the two (groups of) sequences to bealigned is used toincrease the GOP for closely related sequences and decrease it formore divergent sequences on alinear scale.

Dependence on thelengthsofthe sequences. Thescoresfor both true andfalsesequencealignments grow with thelengthof the sequences. We use the logarithm ofthe length of the shorter sequence to increase the GOP withsequencelength.Usingthese threemodifications, the initial GOP calculated by theprogramis:

GOP - [GOP + log[min(N,M)]} *(averageresidue mismatch score) * (per cent identity scaling factor)

where N, M arethe lengths of the two sequences.

Dependence on thedifference inthe lengthsof the sequences.

The GEPis modifieddependingon thedifference betweenthe lengths of the two sequencesto bealigned. Ifone sequenceis much shorter than theother, the GEP is increasedtoinhibit too many long gaps in the shorter sequence. The initial GEP calculated by the program is:

GEP - GEP * [1.0 +

Ilog(N/M)I]

where N, Mare the lengths of the two sequences.

Position-specific gap penalties

In most dynamic programming applications, the initial gap opening and extension penalties are applied equally at every position in the sequence, regardless ofthe locationof a gap, except for terminal gaps which are usually allowed at no cost.

In CLUSTAL W, before any pair of sequences or prealigned groups of sequences are aligned, we generate a table ofgap opening penaltiesforeverypositioninthetwo (setsof) sequences.

Anexampleisshown inFigure3.Wemanipulatetheinitialgap opening penaltyin aposition-specific manner, in order tomake gaps more orless likely atdifferent positions.

(5)

The local gap penalty modification rules are applied in a hierarchical manner. The exact details ofeach rule are given below. Firstly, if there is a gap at aposition, the gap opening and gap extension penalties are lowered;the other rules do not apply. This makes gaps more likely at positions where there are alreadygaps.Ifthere isno gap at aposition,then the gap opening penalty is increased if the position is within 8 residues of an existinggap. This discouragesgaps that are tooclose together.

Finally, at any position within a run ofhydrophilicresidues, the penalty is decreased. These runs usually indicate loop regions inprotein structures. Ifthere isno runofhydrophilic residues, the penalty is modified using a table of residue-specific gap propensities (12). Thesepropensitieswerederived by counting thefrequency of each residueateither end ofgapsinalignments ofproteins ofknown structure. Anillustrationof theapplication of these rules from one part of the globin example in Figure 1 is given in Figure 3.

Lowered gappenalties at existing gaps. If there are already gaps at aposition,thenthe GOP is reducedinproportiontothe number ofsequenceswithagap atthisposition and the GEP is lowered by a half. The new gap opening penalty is calculated as:

GOP - GOP * 0.3 * (no. ofsequences without a gap/no. of sequences).

Increased gappenalties nearexisting gaps. Ifa position does nothave any gaps but is within 8 residues ofanexisting gap, the GOP is increasedby:

GOP - GOP * t2 + [(8 - distance from gap) * 2]/8j Reduced gap penalties in hydrophilic stretches. Any run of 5 hydrophilic residues is considered to bea hydrophilic stretch.

The residuesthat are to be considered hydrophilic may be set by the userbutareconservatively set toD, E, G, K, N, Q, P, R orS bydefault. If, at anyposition, there are no gaps and any of thesequenceshas suchastretch, the GOP is reducedbyone third.

Residue-specific penalties. If there is no hydrophilic stretch and theposition doesnotcontainany gaps,thenthe GOP ismultiplied byoneof the20 numbers in Table 1,dependingontheresidue.

Ifthere isamixture of residuesat aposition, themultiplication factor is the average of all the contributions fromeachsequence.

Weightmatrices

Twomain series ofweight matricesareofferedtotheuser: the Dayhoff^PAMseries(3)andthe BLOSUM series(4).Thedefault is the BLOSUMseries. Ineachcase,there isachoice of matrix ranging from strictones,useful forcomparingveryclosely related sequencestovery 'soft'onesthatareuseful for comparing very distantlyrelated sequences. Depending onthe distance between the twosequencesor groups of sequencestobecompared, we switchbetween 4different matrices. Thedistances aremeasured directly from theguidetree. The ranges of distances and tables used with thePAM series ofmatrices are: 80-100% :PAM20, 60-80%:PAM60, 40-60%:PAM120, 0-40%:PAM350.

The range used with the BLOSUM series is: 80-100%:

BLOSUM80, 60-80%:BLOSUM62, 30-60% :BLOSUM45,

Divergent sequences

Themostdivergentsequences (mostdifferenton average from all of the othersequences) areusually themostdifficulttoalign correctly.Itis sometimesbetter to delaytheincorporation of these sequences until all of the more easily aligned sequences are mergedfirst. This may give a better chance of correctly placing the gaps and matching weakly conservedpositions againstthe restof the sequences. Achoice is offered to set acut-off (default is40% identityorless withanyothersequence) that willdelay thealignment ofthedivergentsequencesuntilallof theresthave been aligned.

Software and algorithms Dynamicprogramming

Themostdemandingpartof the multiple alignmentstrategy, in termsofcomputerprocessing andmemory usage,is thealignment oftwo(groupsof)sequences ateachstepinthe finalprogressive alignment. Tomake itpossibletoalignverylongsequences (e.g.

dynein heavy chainsat ^- 5,000 residues)inareasonableamount ofmemory, we usethememoryefficientdynamic programming algorithm of Myers and Miller (26). This sacrifices some processing time butmakes very largealignments practical invery littlememory. Onedisadvantage of this algorithm isthatit does notallow differentgapopening and extension penalties at each position.Wehave modifiedthealgorithmsoas toallowthis and the details are described in a separate paper (27).

Menus/file formats

Six differentsequence input formats^aredetectedautomatically and read by the program: EMBL/Swiss Prot, NBRF/PIR, Pearson/FASTA (29), GCG/MSF (30), GDE (Steven Smith, Harvard University Genome Center) and CLUSTAL format alignments. The lastthreeformats allowuserstoread incomplete alignments (e.g. for calculating phylogenetictrees orforaddition ofnew sequences to an existing alignment). Alignment output mayberequested in standard CLUSTALformat (self-explanatory blocked alignments) or in formats compatible with the GDE, PHYLIP (31) orGCG (30) packages. The program offers the usertheabilitytocalculateNeighbour-Joining phylogenetictrees fromexistingalignments with optionstocorrectformultiple hits (32,33) and to estimate confidence levels using a bootstrap resampling procedure (34). Thetreesmaybe output in the 'New Hampshire' format that iscompatible withthe PHYLIPpackage (31).

Alignment to an alignment

Profilealignment isusedtoaligntwoexistingalignments (either of which may consist ofjust one sequence) or to add a series ofnewsequencestoanexisting alignment.This is usefulbecause one may wish to build up a multiple alignment gradually, choosingdifferentparametersmanuallyorcorrectingintermediate errors asthealignment proceeds. Often, justa few sequences causemisalignmentsin theprogressive algorithmandthesecan beremovedfrom the process and then addedatthe endbyprofile alignment.Aseconduseis whereonehasahighqualityreference alignmentandwishestokeepit fixed whileaddingnewsequences automatically.

Portability/availability

Thefullsourcecode of thepackageisprovidedfreetoacademic users. Theprogram will run on any machine witha full ANSI conforming C compiler. It has been tested on the following 0-30%:BLOSUM30.

(6)

hardware/software combinations: Decstation/Ultrix, Vax or ALPHA/VMS, Silicon Graphics/IRIX. The source code and documentationareavailablebyE-mail from theEMBLfile server (sendthe words HELP and HELP SOFTWARE on twolines tothe internetaddress: Netserv@EMBL-Heidelberg.DE) orby anonymousFTP fromFTP.EMBL-Heidelberg.DE. Queriesmay be addressed by E-mail to Des.Higgins@EBI.AC.UK or

Gibson@EMBL-Heidelberg.DE.

RESULTS AND DISCUSSION Alignment of SH3 domains

The -60 residue SH3 domain was chosen to illustrate the performance ofCLUSTALW, as there is areference manual alignment (23) and the fold isknown (24). SH3 domains, with a minimum similarity below 12% identity, are poorly aligned by progressive alignmentprograms such as CLUSTAL V and PILEUP: neither program can generate the correct blocks corresponding to the secondary structure elements.

Figure4shows analignment generated by CLUSTALWof theexample setofSH3 domains. The alignmentwasgenerated in two steps. After progressive alignment, five blocks were produced, corresponding to structural elements, with gaps inserted exclusively inthe known loop regions. The fl-strands inblocks 1, 4 and 5 were allcorrectly superposed. However, four sequences in block 2 and one sequence in block 3 were misaligned by 1-2 residues (underlinedinFigure 4). Asecond progressive alignmentofthe aligned sequences, including the gaps, improved this alignment: A singlemisaligned sequence, H P55, remainsinblock2 (boxed in Figure 4), while block 3 is now completely aligned. This alignment corrects several errors(e.g.P85A, P85B and FUS1) inthe manualalignment (23).

The SH3alignment illustrates several features of CLUSTAL W usage. Firstly, inapractical application involving divergent sequences,theinitialprogressive alignment islikelytobeagood but not perfect approximation to the correct alignment. The alignmentquality canbeimprovedinanumberofways. Ifthe blockstructureofthealignmentappearstobe correct,realignment of the alignment will usually improve most of the misaligned blocks: theexisting gaps allow the blocks to 'float' cheaply to a locally optimal position without disturbing the rest of the alignment. Remaining sequences which aredoubtfully aligned can then be individually tested by profile alignment to the remainder:themisaligned H_P55SH3 domaincanbecorrectly aligned byprofile(with GOP c8).Theindelregions in the final alignment canthen be manually cleanedup: usually the exact alignmentintheloopregions isnotdeterminable,and mayhave nomeaning in structural terms. It is thendesirabletohave a single gap per structuralloop. CLUSTAL W achieved this for two of the four SH3 loop regions (Figure 4).

If theblockstructureof thealignmentappears suspect,greater interventionby the user may be required. The most divergent sequences, especially ifthey have large insertions (which can bediscerned with the aid of dotmatrixplots),should beleftout oftheprogressive alignment.If there aresetsofclosely related sequences that are deeply diverged from other sets, these can be separately aligned and then merged by profile alignment.

Incorrectly determined sequences, containing frameshifts, can alsoconfoundregionsofanalignment: thesecanbe hardtodetect but sometimes they have been grouped within the excluded

Table 1. Pascarella andArgos residuespecificgap modification factors

A 1.13 M 1.29

C 1.13 N 0.63

D 0.96 p 0.74

E 1.31 Q 1.07

F 1.20 R 0.72

G 0.61 S 0.76

H 1.00 T 0.89

I 1.32 V 1.25

K 0.96 Y 1.00

L 1.21 W 1.23

The valuesarenormalised arounda meanvalueof1.0 for H. The lower the value, the greater the chanceofhavinganadjacentgap.Thesearederived from theoriginaltable of relative frequenciesof gapsadjacenttoeach residue(12) bysubtraction from 2.0.

individually compared to the alignment as having apparently nonsense segments with respect to the other sequences.

Finding the best alignment

In caseswhere allof the sequences in a data setareverysimilar (e.g. nopair less than 35% identical), CLUSTALWwillfind analignment which is difficulttoimprove by eye.Inthis sense, thealignment is optimal with regardtothealternative of manual alignment.Mathematically, this is vague andcanonly be puton a more systematic footing by finding an objective function (a measure ofmultiplealignmentquality) that exactly mirrors the information used by an 'expert' to evaluate an alignment.

Nonetheless, ifanalignment is impossible toimprove by eye, then the program has achieved a very useful result.

In more difficult cases, as more divergent sequences are included, it becomesincreasingly difficult to find goodalignments andtoevaluate them. Whatwefind with CLUSTALWis that thebasic block-likestructureof thealignment(correspondingto themajor secondarystructureelements) isusuallyrecovered,with someof themostdivergent sequences misaligned in smallregions.

This is a very useful starting point for manual refinement, as it helps define the major blocks of similarity. The problem sequences canbe removed from the analysis and realigned to the rest ofthe sequencesautomaticallyorwith different parameter settings.Anexamination of the tree used to guide thealignment willusuallyshow which sequenceswillbe mostunreliably placed (thosethatbranch off closesttothe rootand/or those that align toothersingle sequencesat averylow level of sequence identity rather thanalignto agroup ofprealigned sequences). Finally, one cansimplyiterate themultiple alignment process by feeding an outputalignment back into CLUSTAL W and repeating the multiple alignment process (using the same or different parameters). The SH3 domainalignment in Figure4wasderived inthiswayby2passesusing default parameters. In the second pass, the local gappenalties are dominated bytheplacement of theinitialmajor gap positions. Thealignmentwill eitherremain unchangedor willconverge rapidly (after 1 or 2 extrapasses) on a better solution. If the placement of the initial gaps is approximately correct but some ofthe sequences are locally misaligned, this works well.

Comparison with other methods

Recently, severalpapers haveaddressedtheproblem ofposition- specific parametersformultiplealignment.In one case(35),local divergent sequences: then they maybe revealed when they are gap

penalties

areincreased ina-helicaland

0-strand regions

when

(7)

ASV_vSRC ttfvalydyesrte----t41sfk---itgjr1qivnnt---igdwwlahslttg---qtgyipsnyvapsd

RSV_vSRC ttfvalydyeswte---tdlsfk---kgirlqivnnt---g4nl1ahslttg---qtgyipsnyvapa4$

H_csRC1 ttfvalyt'esrte---td'lsfk^---kgerlqivnnt ---e*gdwwlahslstg---qtgyipsnyvapsd Xl1cSRC1 ttfvalyzdyesrte---tdlofk--- lrqivnnt---e6g4ww1arslssg---qtgyipsnyvaps~

HnSRC ttfvaly4yesrte^----tklsfk---I..kg#rlqivnntrkvd---vrqgdww1ahslstg---qtgyipsnyvaps4

Xl cSRC2 ttfvalydyeeorte---td1sfr---kger1qivnnt---sgdwlarslssg---qtgyipenyvapst

ASV_vYES tvfvaly4tyeartt----dGlsfk---kglrfqiinnt -gwasagkgisyaa

CcYES tvfvalydyeartt----d4lsfk---.cg4Wrfqiinnt---*g4wwearsiatg---ktgyipsnyvapa4

HcYESl tifvalyd(yeartt---edl.sfk---kg fiin-gdwwearsiatg---kngyipsnyvapa

Xl cYES tVfValyttyeartt---e41sfr---kg~rfgiinnt---eogdwwearsiatg---ktgyipsnyvapad

Xl cFYN tlfvaly4yearte----d:dlsfq--- ~g~,kfqilnssa--e---gdwwearslttg---gtgyipsnyvapv H cFYN tlfvalyIayearte---ddlafh---kgekfqilnss---eog4wwearslttg---etgyipsnyvapv M_cFGR tifvalydyeartg----ddltft---tg4,kfhilnnt---1ty4wwearslssg---hrgyvpsnyvapv H_cFGR tlfialyd4yearte----d4ltft--- q"kfhilnnt---og4nwearslseg---ktgcipsnyvap4 Ha_STK tifv&ly4yearie ----e4lsfk--- ger1qiinta---dgdwwyarslitn---segyipatyvapek

*RHOK iivvaly4tyeaihh---e4lsfq---.kg4qxuvvlees---gewwkarslatr---kegyipsnyvarv4q

HHCK tivvalydyeaihr----e4lsfq----kgdqinvvleea---gewwkarslatk---kegyipsnyvarvn

*HLYN divvalypydgihp---ddlsfk---kg~kkvleeh---gewwkakslltk---kegfipsnyvakln HBLK rfvvalfalyaa'vnd--- 4lqvl---kgklqvlrst---gawwlarslvtg---regyvpsnfvapve H_LSKT nlvialhsyepshd----gqd1gfe---kguMq1ri1eqs---gewwkaqslttg---qegfipfnfvakan HILCK nlvialhsyepshd---gdlgfe- --tgqljerilIeqs---gewwkaqsttg---qegfipfnfvakan

FSVvABL nlfvalyafvasgd---tlisit--kg:~klrvlgynh---ngewceaqtkng---qgvvpsnyitpvn Din ABLI qlfvalydfqagge----ng1s1k---kg01qvrilsynk---sgewceahssgn---vgwvpsnyvtpln CcTKL klvvalydyepthd---gd1g1k---qgM'k1rv1ees---gewwraqslttg---qegliphnfvaxnvn

Ce_sem5/1 mneavael4fqagsp---delsfk---rgn__t1kv1nk4d---efhwykaeld--g---negfipsnyirmnte ce_sem5/2 kfvqaifdfnpqes----g:*1afk---tgdvit1in---kd4pnnwegq1n--n---rrgifpsnyvcpyn Din_SRCl rvvvs1y4yksr e---sdlsfmn---kgdrmnevi4dt---sdnwrvvn1ttr---gegliplnfvaeer

ASV GAGCRK eyvtralfdfkgn4d

g1pk--gilkirlk-ewnem5--rzivyec

CSpca elvialydygeksp----revtink---.kg4i1t11n---k--- kv]evn--d---rqgfvpaayvkklq

DmnSpca ecvvalydyteksp----revsmnk--- cgdvltlln---snnkdwwkvevn--d---rqgfvpaayikkia%

DinSpcb phvkslfpfgqmm---gtrn11kskt---nddwwcvrkdn-g---vegfvpanyvreve;

H_PLC rtvkalyaykakrs----delfc---rga1ihnvs ---kepggwwkgdygt-r---iqqyfpsnyvedis R_PLCII cavkalfdykaqre---d*ltft---ksaiiqnve---kdggwwrgdygg-k---kqlwfpsnyveemni

EPLCII cavkalfdykaqre---deltft---ksaiiqnve ---qeggwwrgdygg-k---kqlwfpsnyveeumv H-PLCI cavkalfdykaqre----d*ltfi---ksaiignve---kqeggwwrgdygg-k---kqlwfpsnyveeinv H_RASA/GAp rrvrailpytkvpd----±d Ia-^--kg4mfivhn---ele:dgwmwvtnlrtd---eqgliveidlveevg

Ac M4ILE pqvkalydlydaqtg----dilt fk---eg4tiivhq---kdPagwwege1n--g---krgwvpanyvqdi Ac-MILC eqaralydfaaenp----de1tfn---egavvtvin---ksnpd1wwegeln--g ---grgvfpasyvelip

H_HS1 isavlydyqgegs---d:elafd---pdavitdie---v4egvwvrgrch--g---hfglfpanyvklle H VAV gtakarydfcar4r----ees01sk---egdjiiki1nkk---gqqgwwrgeiyg---rvgwfpanyveedy Din_SRC2 klvvalyi1gkaie;g---gd1svge--kn_aeyevidds---gehwwkvkdialg---nvgyipsnyvqaea R-CSK teciakynfhgtae---qdlpfc^---kg4lvltiv-avtk---dpnwykaknikvg---regiipanyvgkre

H-NCK/l vvvnakfayvaqqe---1dik---Icner1w1lds---kswwrvrns-nmn---ktgfvpsnyverkn H_NCK/2 inpayvkfnymnaere---dels ij---ozgtkgaizmIka---dgwwrgsyn--g---qvgwfpsnyvteeg HNCK/3 hvvqalypfsssnd----ee1nfe---k-g_4vmndviekp ---enalpewwkcrkin-g---vglvpknyvtvznq H_NCF1/l qtyraianyektsg----sBeMals---tg4vvevveks---sgwwfcqznk--a---krgwipasf1ep,l4

H-NCF1/2 epyvaikaytaveg---devsll---egeavevihk -l---1dgwwvirkd--d---vtgyfpenmylqksg H_NCF2/1 eahrvlfgfvpetk---eelqvnu---pgnivfvlkkg---ndnwatvmfn--g---qkglvpcnylepve H_NCF2/2 sqvealfsyeatgp---ed1efq---eg4ii1v1skvn---eewlegeckg---kvgifpkvfvedca Y-ABPI pwataey4lydaaed---ne1tfv---en4eqkiinie---fv4jddlgelkd-g---skglfpsniyvslgn Y_EEMl/l kvikaky7syqaqts----ke1sfmn---egeWffyvsgd---e~kdwykasnp'stg---kegvvpktyfevft4

YBEEMl/2 lyaivlydfkaeka---deltty---vg466lficahh---ncewfiakpigrlg---gpglvpvgfvsiid

CPBO/85 itaialy4yqaagd---deisfd---pd4iitnie---mi4dgwwrgvck--g---ryglfpanyvelrg- YCDC25 g'ivvaay4fnypikk-dss-sq1lsvq---ggtiyilnkn----esagwwdglvidasngkv---nrgwfpqnfgrplr Y_SCD25 dvvectyqyftksr---nklslr---vgdliyvltkg---sngwwdgv1irhsannn=ns1ail----drgwfppsftrsil y-Fus1 ktytviqdyeprlt---diiiris---l1g*kvkilath---tgcvknqsivvakrlegvpdlea

OC_CACb favrtnvgynpspgd~vpvmilg,aJ±fr---pkdflhikeky---tndwwiglvkctkegibv---nedrgfipspgvcldl

DinDL lyva1lf4ydpnrdd-glp-sr1pf--g41i1hvtnas---cdd-ewwqarrvlgdneieqgvsrwr

H P55 mnfmraqfd$ydpkkdn-lip-c a 1k-f gdiiqiinkI---dsnwwqgrvegsske ---saglipspelqewr E P85A fgyralypfrrerp---edlell---pg4vlvvsraalqalgvaigniirc-pqevgwmpglnertr---qrgdfpgtyveflg EP85B ycqyralydykkere---ediTlh---lgdiltvnkgslvalgfsdgq*aJ&-peiiigwlngynettg---ergdfpgtyveyig

H_P8BE ycyralydykkere---edidlh---lg4iltvnkgslvalgfsdgp4a±&.-pe4igwlngynettg---ergdfpgtyveyig Sp_STEE fqttaisdyenssn--- kt---ag4tiiviev1--- ""-4dgwcdgics--e---krgwfptscidssk

HAtk kkvvalydymupina----nalqlr---kgeyfilees---nl1pwwrardkn-g----q-egyipsnyvteae

Figure4. CLUSTAL Walignmentofasetof SH3 domains taken from Musacchioetal. (23). Secondary structureassignmentsfor the solved Spectrin(24) and Fyn (39)domainsareaccordingtoDSSP(40).Thealignmentwasgeneratedintwostepsusingdefaultparameters.Afterfull multiplealignment,^thealignedsequences wererealigned. Segments whichwerecorrectlyalignedin the second pass areunderlined. Thesingle misaligned segmentinH-P55 and the misaligned residue inH_NCKI2areboxed. The sequencesarecolouredtoillustratesignificantfeatures. AllG(orange)and P(yellow)arecoloured. Other residuesmatchingafrequent

occurrenceofapropertyinacolumnarecoloured: hydrophobic = blue;hydrophobic tendency = lightblue;basic = red; acidic = purple; hydrophilic = green;

unconserved =white. Thealignmentfigurewaspreparedwith the GDE sequenceeditor(S.Smith, HarvardUniversity)andCOLORMASK(J.Thompson,EMBL).

the3-D structuresofone or moreofthe sequences are known. number of available sequences and their evolutionary

In a second case (36), a hidden Markov model was used to relationships. Itwill alsodependonthedecisionmakingprocess estimate position-specific gappenalties andresidue substitution duringmultiple

aligrnment

(e.g. whentochange weightmatrix) weightmatrices when large numbers ofexamples ofaprotein and the accuracy and appropriateness ofourparameterisation.

domainwereknown.With CLUSTAL W, we attempt to derive Inthelongterm,thiscanonlybe evaluatedbyexhaustivetesting the same information purely from the set of sequences to be of setsof sequences where thecorrectalignment(or parts ofit) aligned. Therefore, we can apply the method to any set of areknownfromstructuralinformation. Whatisclear, however, sequences. The success of this approach will depend on the isthat themodifications describedheresignificantly improvethe

(8)

sensitivity oftheprogressivemultiplealignment approach.This is achieved with almost no sacrifice in speed and efficiency.

There are several areas where further improvements in sensitivity and accuracycanbe made. Firstly, the residue weight matrices and gap settings can be made moreaccurate as more and moredataaccumulate, while matrices forspecific sequence types can be derived [e.g. for transmembrane regions (37)].

Secondly, stochasticoriterativeoptimisation methodscanbe used torefine initialalignments(7,9,10). CLUSTALWcould berun with several sets ofstarting parameters and in each case, the alignments refinedaccordingto anobjective function. The search foragoodobjective function that takes intoaccountthe sequence- andposition-specificinformation used in CLUSTAL Wis akey areaofresearch. Finally, the average number ofexamples of eachprotein domainorfamily isgrowingsteadily. Itisnotonly important that programs can cope with thelargevolumes of data that arebeing generated, they should be abletoexploit thenew information to make the alignments more and more accurate.

Globally optimalalignments(accordingto anobjective function) may not always be possible, but the problem may be avoided ifsufficiently large volumes of data become available. CLUSTAL W is a step inthis direction.

ACKNOWLEDGEMENTS

Numerous people have offered advice and suggestions for improvements to earlier versions of the CLUSTAL programs.

D.H. wishestoapologisetoallof the irate CLUSTALV users who hadtolive with the bugs and lack of facilities forgetting treesinthe NewHampshireformat. We wishtospecificallythank JeroenCoppieters who suggestedusingaseries ofweightmatrices andSteven HenikoffforadviceonusingtheBLOSUM matrices.

We aregrateful to ReinAasland, PeerBork, Ariel Blocker and BertrandSeraphin forproviding challenging alignmentproblems.

T.G. and J.T. thank Kevin Leonard for supportand encourage- ment. Finally, wethankallof thepeople who have been involved with various CLUSTAL programsoverthe years,namely Paul Sharp, Rainer Fuchs and Alan Bleasby.

16. Luithy, R.,Xenarios, I. and Bucher, P. (1994) Protein Sci. 3, 139-146.

17. Higgins, D.G. and Sharp, P.M. (1988) Gene73, 237-244.

18. Higgins, D.G. and Sharp, P.M. (1989) CABIOS 5, 151-153.

19. Higgins,D.G., Bleasby, A.J.and Fuchs, R. (1992) CABIOS 8, 189-191.

20. Sneath, P.H.A. and Sokal, R.R. (1973) Numerical Taxonomy. W.H.

Freeman, San Francisco.

21. Saitou, N. andNei, M. (1987) Mol. Biol. Evol. 4, 406-425.

22. Bashford, D., Chothia, C. and Lesk, A.M. (1987) J. Mol. Biol. 196, 199-216.

23. Musacchio, A., Gibson, T.,Lehto, V.-P. and Saraste, M.(1992). FEBS Lett. 307, 55-61.

24. Musacchio, A., Noble, M., Pauptit, R.,Wierenga,R. andSaraste, M.(1992).

Nature, 359, 851 -855.

25. Bashford, D., Chothia, C. and Lesk, A.M. (1987). J. Mol. Biol. 196, 199-216.

26. Myers, E.W. and Miller, W. (1988). CABIOS 4, 11-17.

27. Thompson, J.D. (1994). CABIOS submitted forpublication.

28. Smith, T.F., Waterman, M.S. andFitch, W.M.(1981)J. Mol. Evol. 18, 38-46.

29. Pearson, W.R. andLipman, D.J. (1988) Proc. Natl.Acad.Sci. USA. 85, 2444-2448.

30. Devereux, J., Haeberli, P. and Smithies, 0.(1984)Nucleic Acids Res. 12, 387-395.

31. Felsenstein,J. (1989) Cladistics5, 164-166.

32. Kimura, M. (1980) J. Mol. Evol. 16, 111-120.

33. Kimura, M. (1983) The NeutralTheory of Molecular Evolution. Cambridge University Press, Cambridge.

34. Felsenstein, J. (1985)Evolution39,783-791.

35. Smith,R.F. andSmith, T.F. (1992)ProteinEngng5,35-41.

36. Krogh, A., Brown, M.,Mian, S., Sjolander,K.andHaussler,D. (1994) J. Mol. Biol. 235-1501-1531.

37. Jones,D.T., Taylor, W.R. andThornton, J.M. (1994) FEBSLett. 339, 269-275.

38. Bairoch,A.andBockmann,B.(1992)Nucleic Acids Res.20,2019-2022.

39. Noble, M.E.M., Musacchio, A., Saraste, M., Courtneidge, S.A. and Wierenga,R.K. (1993)EMBO J. 12, 2617-2624.

40. Kabsch, W. andSander,C. (1983) Biopolymers 22,2577-2637.

REFERENCES

1. Feng, D.-F. andDoolittle, R.F. (1987)J. Mol. Evol. 25, 351-360.

2. Needleman, S.B. and Wunsch, C.D. (1970) J. Mol. Biol.48,443-453.

3. Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C.(1978)InAtlasof Protein Sequence andStructure, vol. 5, suppl. 3 (Dayhoff, M.O., ed.), pp 345-352.

NBRF, Washington.

4. Henikoff, S. and Henikoff, J.G. (1992) Proc. Natl. Acad. Sci. USA 89, 10915-10919.

5. Lipman,D.J.,Altschul,S.F. andKececioglu, J.D. (1989) Proc. Nad. Acad.

Sci. USA86,4412-4415.

6. Barton, G.J. andStemnberg, M.J.E. (1987)J. Mol. Biol. 198, 327-337.

7. Gotoh, 0.(1993) CABIOS 9, 361-370.

8. Altschul, S.F. (1989)J. Theor. Biol. 138, 297-309.

9. Lukashin, A.V., Engelbrecht, J. and Brunak, S. (1992) Nucleic Acids Res.

20, 2511-2516.

10. Lawrence, C.E.,Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald,A.F.

andWooton,J.C. (1993) Science 262, 208-214.

11. Vingron, M. and Waterman, M.S. (1993) J. Mol. Biol. 234, 1-12.

12. Pascarella, S. and Argos, P. (1992) J. Mol. Biol. 224, 461-471.

13. Collins, J.F. andCoulson, A.F.W. (1987) In Nucleic Acid and Protein Sequence Analysis,APracticalApproach(Bishop, M.J. and Rawlings,C.J., eds), chapter 13, pp. 323-358.

14. Vingron, M. and Sibbald, P.R. (1993) Proc. Natl. Acad. Sci. USA 90, 8777-8781.

15. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CABIOS 10, 19-29.