• Keine Ergebnisse gefunden

Principles of Sequence Alignment

2.5 Summary

3.1.1 Principles of Sequence Alignment

In the following the fundamentals of Dynamic Programming are briefly summarized with respect to the application of protein sequence analysis. First, the general scheme which produces scored alignments for two strings is introduced. Since the general meaning of sequence data to be aligned is rather important, several different scoring techniques, i.e.

definitions of costs for the particular edit operations, exist. Mostly they are summarized in scoring matrices defining the substitution costs for exchanging symbols (amino acids).

Thus, the principles of such scoring schemes are outlined. Although DP techniques guar-antee optimal alignments scored depending on the substitution scheme actually used, they cannot provide the general decision whether two sequences are related. Even completely

unrelated sequences (e.g. random data) will be aligned optimally in the mathematical sense.

The final classification result based on alignment scores needs to be extracted by means of the analysis of statistical significance of the score. Thus, in the third part of this section, the scoring process is briefly discussed in a more statistical manner.

Dynamic Programming

In his survey of the principles of Dynamic Programming, Sean Eddy gives an interesting explanation of the etymology of the term ’Dynamic Programming’ [Edd04]. The technique itself was formalized in the early 1950s by Richard Bellman, who was working as a math-ematician at RAND Corporation on optimal decision processes. He was searching for an impressive name that would shield his work from US Secretary of Defense Charles Wil-son, who apparently was rather negatively minded concerning mathematical research. Since Bellmann’s work involved time series and planning, he opted for ’dynamic’ and ’program-ming’ which initially had nothing to do with computers. He also liked the impossibility of using the term ’dynamic’ in a pejorative sense. Bellman figured ’Dynamic Programming’

was something not even a congressman could object to. For obvious reasons, the explana-tions regarding Dynamic Programming given in this thesis are focused on its applicaexplana-tions to bioinformatics tasks implying string alignments. However, the original work of Richard Bellman is much more universal [Bel57].

The trivial solution of the alignment problem is the scoring of all possible alignments for two given strings and the selection by means of the optimal, that means the maximal score. Unfortunately, this is computationally not feasible, since there are approximately 22N different alignments for two sequences of lengthN [Edd04]. For sequences consisting of different numbers of residues the situation gets even worse. Thus, more sophisticated methods were developed. The general goal of all DP based techniques discussed here is the mutual alignment of two strings whereas partial results are calculated just-in-time, i.e. they are available at the time they are required for further calculations. Generally, each step of the algorithm reuses results of preceding steps enabling a recursive definition of DP.

The global alignment problem is divided into sub-problems lowering the complexity. For this reason, DP algorithms consist of four parts:

1. A recursive definition of the optimal score,

2. A Dynamic Programming matrix for storing optimal scores of sub-problems,

3. A bottom-up approach for filling the matrix by solving the smallest sub-problem first, and

4. A technique for traceback of the matrix to obtain the global optimal solution.

By means of a practical example in the following the DP algorithm will briefly be outlined.

Therefore, two hypothetical protein sequences~s1 = ACDEF of length N = 5 and~s2 = GAHCDFE consisting ofM = 7residues are considered.

Recursive definition of the optimal score: The global alignment of both sequences

~s1 and~s2 can end in three different ways: (i) the residuess1N ands2M are aligned to each other; (ii)+(iii) either final residue is aligned to a gap character whereas the end-residue of the remaining sequence was already aligned before. The optimal alignment will be the highest scoring of these three cases. Since the global alignment problem breaks into inde-pendently optimizable pieces, the solution of this sub-problem can be generalized recur-sively. As an example the scores of the three cases mentioned can be defined in terms of the optimal alignment of the preceding subsequences (prefixes). The scoreS of case (i) above is the scoreS(s1N, s2M)for alignings1N ands2M plus the scoreS(~s11...N−1, ~s21...M−1)for an optimal alignment of everything else up to this point. For the remaining cases (ii) and (iii) above, gap-penaltiesγ, i.e. negative scores penalizing the insertion of gaps, are added to the scoresS(~s11...N−1, ~s21...M)andS(~s11...N, ~s21...M−1), respectively. Consequently, the recursive definition of the optimal alignment score can be formulated as follows:

S(s1i, s2j) = max





S(s1i−1, s2j−1) +σ(s1i, s2j), S(s1i−1, s2j) +γ,

S(s1i, s2j−1) +γ,

whereσ(s1i, s2j)designates the score for aligning two residuess1i ands2j. For simplicity of argumentation here identical scoresσfor all substitutions of any residues, and identical gap-penalties γ are assumed. In fact, the actual adjustment of these scores is rather cru-cial regarding the quality of the resulting alignment. Thus, substantial differences between various alignment algorithms are mostly based on sophisticated selections of these scores (see pages 28ff for details). The recursion is followed until trivial alignment problems with obvious solutions are reached (aligning nothing to nothing produces a scoreS(, )of zero).

DP matrix for storing optimal scores of sub-problems: A principle problem for recursive algorithms is the very probable wastage of computational time due to multiple calculations of sub-problems occurring more than once. The deeper the recursion is moved into, the worse the situation will become. One obvious solution for such problems is to keep track of intermediate results. The fundamental difference between simple recursion and DP techniques is the use of a matrix for memorizing results, so that each sub-problem is solved only once. Structurally, the DP matrix consists of a tabular arrangement of the sequences to be aligned (cf. e.g. figure 3.2). Step-by-step every cell is filled with local alignment scores as described in the following.

Bottom-up approach matrix filling: Based on the recursive definition of the optimal alignment score, the boundary conditions for the leftmost column and the topmost row of the DP matrix are already known:S(, ) = 0; S(s1i, ) =γ; S(, s2j) =γ. For all trivial alignment cases the scores mentioned are used for filling the relevant parts of the matrix.

Now, in a bottom-up way, from smallest problems to progressively bigger problems, the matrix is filled by evaluating the recursive definition of alignment scores. Since all results are stored inside the matrix, redundant calculations are avoided. For the global alignment of both strings at every step a traceback-pointer is kept designating the predecessor which leads to the locally optimal score of the appropriate step.

Traceback of the matrix to obtain the global optimal solution: The final global alignment of both strings considered can be obtained after the DP matrix has been filled completely. Starting in cell(N, M), i.e. at the lower-right corner of the matrix, determines the beginning of the traceback for uncovering the global alignment. The traceback follows the pointer yielding the maximal score for the next step. Note, that equal scores may occur.

Here, the actual successor in the global alignment will be determined application dependent.

The algorithm stops, when the traceback arrives at the upper-left corner of the matrix.

In figure 3.2 the optimum alignment of the exemplary sequences~s1 =ACDEF and~s2 = GAHCDFE is outlined. They are arranged in tabular form as described above and the DP matrix is filled using the recursive definition of the alignment score. Here an exemplary scoring system of +5 for a match, -2 for a mismatch and -6 for each insertion or deletion is applied. The red cells represent the final global optimal path for the string alignment which is shown below the matrix yielding the score of -1.

E C

A H

G D F

E C

A H

G D F

−6 +5 −2

F C

A D E

A C D E

F N=5

4 3 2 1 0

1

0 2 3 4 5 6 7=M

−12 −18 −24 −30 −36 −42

−6

−12

−18

−24

−30

−13 −19 −25 −31

−4 −8 −14 −20

−14

−8

−10 −6 −5 −9

−20 −16 −12 −8 −3

−26 −22 −18 −14 −9 +2

−2

−3

−4

−15

−1 +1 +3

−2

−7

−1

−6 0

−2 +5

+5 −6

Optimum alignment score: −1 i

j (~s2)

(~s1)

Figure 3.2: Global sequence alignment by applying the Dynamic Programming technique: By means of a recursive definition of the alignment score the DP matrix is filled and the global optimal path is extracted by traceback (red cells). The final alignment is shown below the matrix. Scoring system:

match +5, mismatch -2, insertion/deletion -6.

According to the authors first using this dynamic programming approach for global align-ments of protein sequences, this technique is mostly referred to as Needleman-Wunsch al-gorithm [Nee70, Got82]. Its alal-gorithmic complexity can be approximated byO(N M)for both memory and time requirements.

The previously described global alignment algorithm is widely used for protein classifi-cation tasks. Here, complete sequences are mutually aligned which is very useful for seg-mented data, i.e. sequences containing well defined start and end positions. Contrary to this, in molecular biology applications the alignment of parts of sequences is often demanded.

Methods tackling the problem of finding partial matches are usually called local-alignment

techniques. The local version of Dynamic Programming techniques was developed in the early 1980s and is usually known as Smith-Waterman algorithm according to its inventors [Smi81, Got82]. The method is closely related to the algorithm described above. Basically, there are two major differences. First, a boundary of the locally optimal scores is introduced allowingS(s1i, s2j)to become zero if all other options have values less than 0:

S(s1i, s2j) = max







 0,

S(s1i−1, s2j−1) +σ(s1i, s2j), S(s1i−1, s2j) +γ,

S(s1i, s2j−1) +γ.

Reaching the boundary, i.e. the 0-case, implies the termination of a local alignment. Due to the zero option the initial filling of the DP-matrix is slightly changed. Instead of inserting theγ-values, these cells are filled with 0.

The second difference of local alignment methods compared to the general DP approach is that now an alignment can start anywhere in the matrix. Instead of taking the value in the bottom right corner for the best score, now the global maximum is searched over the whole matrix. The traceback ends if a cell containing 0 is met. Basically, multiple local maxima can occur and in fact one refinement of standard local alignment techniques is to process all high-scoring fragments of a pairwise alignment. However, the basic technique remains the same. Thus, all appropriate argumentation given here is directed to the principle procedure.

In figure 3.3 the local alignment of the sequences given above is shown.

E C

A H

G D F

0 0

0 0

0 0 0

0 0 +5

+10

C D

+5 A

C D E F N=5

4 3 2 1 0

1

0 2 3 4 5 6 7=M

+1 +4

+4 +9 +3

+9

+5 +5

C D

0

+8 +6

0 0 0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0

0 0 0 0 0

0 0

Optimum alignment score: +10

j (~s2)

i

(~s1)

Figure 3.3: Local sequence alignment by applying the Smith-Waterman algorithm: The final alignment starts at the global maximum within the whole matrix and ends at the first cell containing a value of 0.

Again, the optimal path is extracted by traceback (red cells). The final alignment is shown below the matrix. Scoring system: match +5, mismatch -2, insertion/deletion -6.

The alignment algorithms outlined here are the base for numerous refined approaches.

Important enhancements are the possibility of finding multiple local matches or of finding overlapping matches. Since the focus of the thesis concentrates on probabilistic models of sequence families no exhaustive review of such refinements will be given here. Instead, the reader is referred to the tremendous amount of specialized publications addressing such refinements. A good starting point for this is the standard work of Richard Durbin and colleagues [Dur98]. However, due to their enormous importance for nowadays’ molecu-lar biology research, in section 3.1.2 two of the most prominent heuristic approaches for bounding the computational complexity of DP techniques will be presented. In the follow-ing, the parameterization of the general DP framework, i.e. the adjustment of the particular scores for edit operations, is explained.

Scoring Matrices

For simplicity of argumentation the basic sequence to sequence alignment techniques the previous sections dealt with were described using fixed scores for the substitution of residues.

When applying these sequence alignment techniques to practical applications of molec-ular biology, this simplification does not usually hold. Instead, the plain algorithms are used as some kind of general frameworks for sequence alignment. Depending on the actual application, i.e. the particular data inspected and the target domain of the investigations, these frameworks need to be configured wisely. Configuration here means, the adjustment of specified substitution scores for the edit operations involved in Dynamic Programming.

The alignment procedures outlined above used identical scores (+5 for match and -2 for mismatch) for all arbitrary residue alignments. According to biological realities this adjust-ment is insufficient since it assumes the same probabilities for substituting any amino acid with any other. Obviously, this does not hold true for real applications. It is quite unreal-istic to score the substitution of residues containing similar biochemical properties in the same way as the substitution of amino acids that completely differ. Thus, usually the sub-stitution scores are adjusted in a per residue manner. These specific scores are arranged in tabular form in so-called substitution matrices. Recently, Shuichi Kawashima and Minoru Kanehisa compiled 71 mutation matrices [Kaw00].

According to the two main categories for sequence alignment applications, two general types of scoring matrices were created:

Reconstruction of evolutionary processes: By means of scores contained in matri-ces addressing the reconstruction of evolutionary promatri-cesses mutation rates are rep-resented. Usually, they are constructed based on the analysis of sequences and their reconstructed ancestors.

Comparison of protein domains: Here, the scores are based on the composition of protein domains at hand or close relatives of them. Usually, these entries are calcu-lated by means of substitution frequencies which can either be measured in wet lab experiments or via information theoretical approaches.

Throughout the years, a large variety of specific matrices were created for both categories.

In figure 3.4 the most prominent matrices for both types are shown. These matrices are

sym-C 12 G -3 5 P -3 -1 6 S 0 1 1 1 A -2 1 1 1 2 T -1 0 0 1 1 3 D -5 1 -1 0 0 0 4 E -5 0 -1 0 0 0 3 4 N -4 0 -1 1 0 0 2 1 2 Q -5 -1 0 -1 0 -1 2 2 1 4 H -3 -2 0 -1 -1 -1 1 1 2 3 6 K -5 -2 -1 0 -1 0 0 0 1 1 0 5 R -4 -3 0 0 -2 -1 -1 -1 0 1 2 3 6 V -2 -1 -1 -1 0 0 -2 -2 -2 -2 -2 -2 -2 4 M -5 -3 -2 -2 -1 -1 -3 -2 0 -1 -2 0 0 2 6 I -2 -3 -2 -1 -1 0 -2 -2 -2 -2 -2 -2 -2 4 2 5 L -6 -4 -3 -3 -2 -2 -4 -3 -3 -2 -2 -3 -3 2 4 2 6 F -4 -5 -5 -3 -4 -3 -6 -5 -4 -5 -2 -5 -4 -1 0 1 2 9 Y 0 -5 -5 -3 -3 -3 -4 -4 -2 -4 0 -4 -5 -2 -2 -1 -1 7 10 W -8 -7 -6 -2 -6 -5 -7 -7 -4 -5 -3 -3 -2 -6 -4 -5 -2 0 0 17

C G P S A T D E N Q H K R V M I L F Y W

PAM250

A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

BLOSUM62

Figure 3.4: Two major scoring-matrices: PAM250 (left) and BLOSUM62 (right). The amino acids of PAM250 are ordered according to their biochemical properties – related amino acids are adja-cent.

metric, implying identical scores for both directions of residue substitutions. In addition to this, occasionally non-symmetric matrices are used (cf. e.g. [Lin01]). On the left-hand side the widely used PAM250 matrix [Day78] is shown for evolutionary motivated scoring ma-trices. It is one member of the family of PAM mama-trices. Generally, PAM is an acronym for

“point accepted mutations” or “percent accepted mutations” and designates a measure for evolutionary divergence (distance) between two amino acid sequences. The general defini-tion of PAM is as follows (cf. [Mer03, p.112]):

“Two sequences~s1 and~s2 differ by one PAM unit, if~s2 developed from~s1 due to a sequence of accepted point mutations whereas per 100 residues one point mutation occurred on average.”

PAMn matrices are created by comparing protein sequences diverging by n PAM units.

The substitution scores are adjusted according to the expected frequencies of the appro-priate residue change which is measured in various experiments. In figure 3.5 the global alignment of the two exemplary sequences given above using the PAM250 substitution ma-trix is illustrated. Obviously, the alignment has slightly changed at its end resulting in a global score of +3.

At the right-hand side of figure 3.4, the BLOSUM62 (block substitution) matrix as one prominent member of the second general category of scoring matrices is shown [Hen92].

Such matrices are created by counting the substitution frequencies of all amino acids when analyzing sequences of related protein domains. For the creation of the BLOSUM62 matrix Steven and Jorja Henikoff analyzed blocks (conserved parts of protein families extracted from multiple alignments without gaps) contained in the BLOCKS database [Hen99, Hen00]. By means of a heuristic procedure of block analysis, the scores are ex-tracted using the following definition [Mer03, p.114]:

“Letf(ai)be the frequency of a residueai occurring at all positions within blocks of the BLOCKS database. Letf(ai, aj)be the frequency of the column-wise occurrences of pairsai, aj. Then the scoreSai,aj can be defined as:

Sai,aj = f(ai, aj) f(ai)f(aj).”

A common refinement of the extraction process is the restriction of the analysis to sequences with similarities less than or equal to a certain percentage. For the BLOSUM62 matrix this implies that all sequences analyzed for estimating the substitution scores must not have similarities larger than 62 percent.

E C

A H

G D F

−6

E

D

G A H C F

+12 0

+1 −4 −10

+2 +6

0

−6

−8

−5

−4

0 0

−5

−28

+1

−6

A C D E F

−6 +4

−6 +2 −6 +9

+3 +4

−7

+9

−6

−10

−17

−23

−5

−11

−17

−4 A

C D E F N=5

4 3 2 1 0

1

0 2 3 4 5 6 7=M

−12 −18 −24 −30 −36 −42

−6

−12

−18

−24

−30

−16 −22 −34

−1 −4 −10 −16

−11

Optimum alignment score: +3 i

j (~s2)

(~s1)

Figure 3.5: Global sequence alignment by applying the Dynamic Programming technique using the PAM250 scoring matrix and -6 for deletion/insertion.

In correspondence with the biological relevance, substitution matrices were studied in detail in recent years. From a more theoretical point of view, the underlying statistics of scoring schemes is well defined and formulated in terms of information theory (for details cf. e.g. [Alt91, Hen96b]).

Gap penalties

In addition to the scoring of substituting and matching residues of two sequences to be aligned, the general Dynamic Programming approach contains penalties for deleting and inserting symbols – so-called gap penalties γ. In the explanations and examples given so far, fixed penalties were assumed. For the alignment of the two artificial protein sequences

~s1 =ACDEF and~s2 =GAHCDFE, the gap penalties were adjusted to a value of -6. Due to the trivial alignment task all gaps had the length of one. Principally, gaps span arbitrary

numbers of residues within a particular alignment. Thus, the mathematically correct def-inition of gap penalties is a cost function γ(g) of the gap length g. Note that in order to keep consistency, the gap penalties for single insertions/deletions will still be calledγ and the correct definition of the cost function depending on the gap lengthg will similarly be defined asγ(g).

Applying constant gap penalties as in the examples given above implies a linear cost function for gaps of lengthg:

γ(g) =gγ.

Usually, i.e. for alignments of sequences containing several hundreds of residues, an alter-native definition of the gap costs as an affine function is used:

γ(g) =γ−(g−1)e,

whereγ is called the gap-open penalty andeis called the gap-extension penalty. Depend-ing on the expectations for the characteristics of gaps in an alignment, both penalties are set individually taking into account probabilities for gaps and their lengths. Usually,eis set to something less than the gap-open penaltyγ, resulting in smaller costs for long insertions and deletions than when using linear penalty functions. The reason for this is that the “bi-ological” probability for single long gaps in protein sequence alignments is higher than for many small gaps.

Usually, gap penaltiesγare adjusted independent of the residues the gap contains. Addi-tionally, specialized residue specific penalties can be used if certain expectations about gap characteristics are available (e.g. due to prior knowledge).

Alignment Scores

Both sequences used so far for explaining pairwise alignment approaches do not originate from any real protein – they are artificial examples. Nevertheless, all algorithms delivered solutions to the general alignment problem by revealing the optimal sequences of edit opera-tions resulting in alignment scores. In the next step, by means of these scores a classification decision needs to be taken which is biologically reasonable giving evidence for homology.

Log-odd Scores: Formally, the biologically driven decisions regarding homology via alignment scores can be performed using a well defined statistical framework by comparing two hypotheses. By means of the null hypothesis two sequences are assumed to be non-homologue which implies a random alignmentR. The probability of a random alignment of two sequences~s1 and~s2 is just the product of the probabilitiesp(·)for the occurrences of each amino acids1i ands2i:

P(~s1, ~s2|R) =

N

Y

i=1

p(s1i)

N

Y

i=1

p(s2i). (3.2)

Contrary to this, in the match hypothesis, aligned pairs of residues occur with a joint prob-abilityp(s1i, s2i)which can be understood as the probability that both residuess1i ands2i

were derived independently from their common (unknown) ancestor. The probability of this match alignmentM can be formulated as follows:

P(~s1, ~s2|M) =

N

Y

i=1

p(s1i, s2i). (3.3)

The homology decision can now be based on the ratio of both likelihoods which is called the odds ratio:

P(~s1, ~s2|M) P(~s1, ~s2|R) =

N

Q

i=1

p(s1i, s2i)

N

Q

i=1

p(s1i)

N

Q

i=1

p(s2i)

=

N

Y

i=1

p(s1i, s2i) p(s1i)p(s2i).

In order to obtain an additive scoring scheme (which is easier to handle) the logarithm is taken resulting in the log-odds ratioS(~s1, ~s2), a sum of individual scoresS(s1i, s2i):

S(~s1, ~s2) =

N

X

i=i

S(s1i, s2i) (3.4)

S(s1i, s2i) = log p(s1i, s2i)

p(s1i)p(s2i). (3.5)

In fact, scoring schemes like BLOSUM or PAM are usually based on log-odds ratios where biological expertise is considered for the adjustments of particular probabilities of the amino acids.

Extended Bayesian Alignment Scores: The general idea of log-odds scoring as defined in equations 3.4 and 3.5 is the comparison of two hypotheses which generally means the comparison of two models: the random model R and the match model M. Basically, the posterior probability P(M|~s1, ~s2) that the match model M is correct, is required for the assessment of sequential relationships. By means of Bayes’ rule, the prior probabilities P(M)andP(R) = 1−P(M), and equations 3.4 and 3.5, this posterior probability can be derived as follows [Dur98, p.36 f]:

P(M|~s1, ~s2) = P(~s1, ~s2|M)P(M) P(~s1, ~s2)

= P(~s1, ~s2|M)P(M)

P(~s1, ~s2|M)P(M) +P(~s1, ~s2|R)P(R)

=

P(~s1,~s2|M)P(M) P(~s1,~s2|R)P(R)

1 + PP(~(~ss1,~s2|M)P(M)

1,~s2|R)P(R)

.

When using the following definitions:

S0 =S+ log

P(M) P(R)

where S = log

P(~s1, ~s2|M) P(~s1, ~s2|R)

, (3.6)

then the posterior probability that the match model is correct, results in P(M|~s1, ~s2) =σ(S0), σ(x) = ex

1 +ex. Equation 3.6 implies adding the prior log-odds ratio,log

P(M) P(R)

, to the general alignment score S, which corresponds to multiplying the likelihood ratio by the prior odds ratio. If all expressions used are converted into real probabilities (which is difficult for the scoring scheme itself), the resulting value can principally be compared to 0, indicating whether the sequences are related. Especially when searching a database, i.e. analyzing a large number of alignments for significant matches, the prior odds ratio becomes important. Here, care needs to be taken, since the larger the number of sequences compared, the larger the proba-bility of significant hits (falsely) obtained by chance. Thus, one solution is the comparison of P(M|~s1, ~s2) with log(N) (instead of 0), where N is the number of sequences in the database.

Statistics of Alignment Scores: Considering significance of alignment scores can also be performed using a classical statistical framework. The scores delivered by an align-ment approach depend on several parameters. Among others these are the length of the sequences and the characteristics of the scoring scheme itself. As stated above, for database searches the number of sequences analyzed is important for the distribution of the scores.

Thus, it is reasonable to determine an expectation valueEfor the number of alignments ac-tually scored withS. Depending on the scoring scheme used which implies specific values for the two constantsκandλ, this expectation value for two sequences of lengthN andM (where for database searches usuallyN M) can be described as follows (E-value):

E(S) =κM N e−λS. (3.7)

The number of alignments producing a score≥Sis determined by a Poisson distribution:

P(S ≥S0) = 1−e−E(S0).

Thus, the distribution of the alignment scores can be described by an extreme value distri-bution (EVD):

F(x) = 1−ee

x−µ β .

If the probability of the number of sequences randomly producing the same or a larger score than the actually observed score is small, then the observation is considered significant. As an example for database search applications, an E-value of 9 is interpreted as follows: It is expected that for a database of a given size M, nine alignments will be randomly scored with the same alignment score. Thus, the smaller the E-value is, the more significant the alignment score will be.

Generally the description of the scores’ distribution by means of an EVD is analytically proved only for local alignments without gaps [Kar90, Alt96], but in various simulations strong evidence for the validity of the theory for gapped local and global alignments could be given. This has been only a brief overview of alignment statistics relevant for this thesis.

A more detailed explanation is given by David Mount in [Mou04].