Sequence Alignment

A fundamental task in computational biology is the alignment of DNA or amino acid sequences. The primary objective of biological sequence alignment is to find positions in the sequences that are homol-ogous; that is, the symbols at those positions are derived from the same position in some anchestral sequence. It may be possible due to evolution that the two homologous positions have different states and are located at different positions. An alignment is biologically correct if it matches up all positions that are truly homologous. Unfortunately, the biological truth is in most cases unknown. Therefore, a guess is made by treating sequence alignment as an optimization problem. The corresponding objective function assigns a score to each alignment according to a scoring scheme. Then an alignment is sought that maximizes this score. But which scoring scheme should be used to predict biologically correct alignments? For this, the scoring scheme needs to be analyzed over all possible parameter values.

This chapter introduces an algebraic statistical model for pairwise sequence alignment. The in-terpretation of their marginal probabilities in the tropical algebra will lead to a formalization of the alignment problem as an optimization problem, and the interpretation in the polytope algebra will allow to analyze scoring schemes over all possible parameter values.

5.1 Sequence Alignment

We take a finite alphabetΣ withl letters and an additional symbol “−”, denoted as blank, and call Σ∪ {−}theextended alphabet. We consider two sequencesσ¹=σ¹₁. . . σ¹_mand σ²=σ₁². . . σ_n² over the alphabetΣ.

An alignment of the sequences σ¹ andσ² is a pair of aligned sequences (µ¹, µ²) over the extended alphabetΣ∪ {−}such that both sequencesµ¹and µ²have the same length and are copies of σ¹and σ²with inserted blanks, respectively. An alignment (µ¹, µ²) does not allow blanks at the same position.

It follows that the aligned sequences have length at mostm+n.

Example 5.1.Consider the sequences σ¹ = ACGTAGC and σ² = ACCGAGACC. An alignment of these sequences is given by

µ¹=A C−G−T A− G C µ²=A C C G A G A C −C

An alignment of maximal length is

A C G T A G C − − − − − − − − −

− − − − − − − A C C G A G A C C

♦ An alignment of a pair of sequences (σ¹, σ²) can also be represented by a string h over the edit alphabet {H, I, D}. The stringhis called an edit string and the letters of the edit alphabet stand for homology (H), insertion (I), and deletion (D). A letter I stands for an insertion (indel) in the first sequence σ¹, a letter D is a deletion (indel) in the first sequence σ¹, and a letter H is a character change (mutation or mismatch) including the identity change (match). We write #H, #I, and #D for the respective number of instances of H, I, and D in an edit string for an alignment of the pair (σ¹, σ²). Then we have

#H+ #D=m and #H+ #I=n. (5.1)

Example 5.2.Reconsider the sequences σ¹ =ACGTAGC and σ² =ACCGAGACC. An alignment of these sequences including the edit string is given by

h =H H I H I H H I D H µ¹= A C − G − T A − G C µ²= A C C G A G A C − C

We have #H= 6, #I= 3, and #D= 1. ♦

Proposition 5.3.A string over the edit alphabet {H, I, D} represents an alignment of an m-letter sequence σ¹ and ann-letter sequenceσ² if and only if (5.1) holds.

Proof. Given an alignment of the pair (σ¹, σ²). We form an edit stringhfrom left to right. Each symbol in σ¹either corresponds to a symbol in σ², in which case we record anH in the edit string, or it gets deleted, in which case we record aD. This shows that the first equation in (5.1) holds. Each symbol in σ² either corresponds to a symbol inσ¹, in which case we already recorded anH in the edit string, or it gets inserted, in which case we record aI. This shows that the second equation in (5.1) holds.

Conversely, each edit string hwith the property (5.1), when read from left to right, produces an

alignment of the pair (σ¹, σ²). ⊓⊔

We write A^m,n for the set of all strings over the edit alphabet {H, I, D} that satisfy the equa-tions (5.1). We callA^m.nthe set of all alignments of the sequencesσ¹ andσ² in spite of the fact that it only depends onmandnand not on the specific sequences. The cardinality of the setA^m,nis called Delannoy number(Fig. 5.2).

Proposition 5.4.The cardinality of the set A^m,ncan be computed as the coefficient of the monomial x^myⁿ in the generating function _{1−x−y−xy}¹ .

Proof. Consider the expansion of the generating function 1

1−x−y−xy = X∞ m=0

X∞ n=0

am,nx^myⁿ. (5.2)

5.1 Sequence Alignment 101

The coefficients are characterized by the linear recurrence

am,n=am−1,n+am,n−1+am−1,n−1, m≥0, n≥0, m+n≥1, (5.3) with initial conditionsa0,0= 1,am,−1= 0, anda−1,n= 0. The same recurrence holds for the cardinality ofA^m,n. To see this, note that for nonnegative integersmandnwithm+n≥1, each string inA^m,n is either a string inA^m−1,n−1 followed by an H, or a string in A^m−1,nfollowed by a D, or a string in A^m,n−1 followed by anI (Fig. 5.1). Moreover,A^0,0 has only one element, the empty string, andA^m,n is the empty set ifm <0 or n <0. Thus the coefficientam,n and the cardinality of A^m,n satisfy the same initial conditions and the same recurrence. It follows that they must be equal. ⊓⊔

h =. . . H σ¹=. . . σm¹

σ²=. . . σ_n²

h = . . . D

σ¹ = . . . σm¹

σ² =. . . σ²_n−. . .−

h = . . . I

σ¹ =. . . σ¹m−. . .− σ² = . . . σ²_n

Fig. 5.1.Three possibilities for strings inAm,n.

am,n0 1 2 3 4 5 6 7 8 9 10

0 1 1 1 1 1 1 1 1 1 1 1

1 1 3 5 7 9 11 13 15 17 19 21

2 1 5 13 25 41 61 85 113 145 181 221

3 1 7 25 63 129 231 377 575 833 1,159 1,561

4 1 9 41 129 321 681 1,289 2,241 3,649 5,641 8,361 5 1 11 61 231 681 1,683 3,653 7,183 13,073 22,363 36,365 6 1 13 85 377 1,289 3,653 8,989 19,825 40,081 75,517 134,245 7 1 15 113 575 2,241 7,183 19,825 48,639 108,545 224,143 433,905 8 1 17 145 833 3,649 13,073 40,081 108,545 265,729 598,417 1256,465 9 1 19 181 1,159 5,641 22,363 75,517 224,143 598,417 1,462,563 3,317,445 10 1 21 221 1,561 8,361 36,365 134,245 433,905 1,256,465 3,317,445 8,097,453

Fig. 5.2.The first hundred Delannoy numbers.

Thealignment graph of anm-letter sequence and ann-letter sequence is a directed graph G^m,n on the set of nodes {0,1, . . . , m} × {0,1, . . . , n} and three classes of edges: edges (i, j) → (i, j+ 1) are labelledI, edges (i, j)→(i+ 1, j) are labelledD, and edges (i, j)→(i+ 1, j+ 1) are labelled H.

Proposition 5.5.The set of all alignmentsA^m,n corresponds one-to-one with the set of all paths from the node(0,0) to the node(m, n)in the alignment graphG^m,n.

Proof. Given an alignment inA^m,nby the edit stringh. The stringhprovides a path inG^m,nstarting from the node (0,0). By (5.1), this path terminates at the node (m, n).

Conversely, given a path inG^m,nfrom (0,0) to (m, n). The labelling of the path provides a stringh over the edit alphabet that satisfies (5.1). By Prop. 5.3, the stringhis an edit string corresponding to

an alignment inA^m,n. ⊓⊔

Example 5.6.Consider the sequences σ¹=ACGandσ² =ACC. The edit stringh=HHID provides the alignment

H H D I A C G − A C − C

This alignment can be traced by the solid path in the alignment graphG^3,3 (Fig. 5.3). ♦

− A C C

Fig. 5.3.The alignment graphG3,3 and the path corresponding to the alignment in Ex. 5.6.

Im Dokument Algebraic Statistics (Seite 111-114)