• Keine Ergebnisse gefunden

Sequence Alignment

Im Dokument Algebraic Statistics (Seite 111-114)

Sequence Alignment

A fundamental task in computational biology is the alignment of DNA or amino acid sequences. The primary objective of biological sequence alignment is to find positions in the sequences that are homol-ogous; that is, the symbols at those positions are derived from the same position in some anchestral sequence. It may be possible due to evolution that the two homologous positions have different states and are located at different positions. An alignment is biologically correct if it matches up all positions that are truly homologous. Unfortunately, the biological truth is in most cases unknown. Therefore, a guess is made by treating sequence alignment as an optimization problem. The corresponding objective function assigns a score to each alignment according to a scoring scheme. Then an alignment is sought that maximizes this score. But which scoring scheme should be used to predict biologically correct alignments? For this, the scoring scheme needs to be analyzed over all possible parameter values.

This chapter introduces an algebraic statistical model for pairwise sequence alignment. The in-terpretation of their marginal probabilities in the tropical algebra will lead to a formalization of the alignment problem as an optimization problem, and the interpretation in the polytope algebra will allow to analyze scoring schemes over all possible parameter values.

5.1 Sequence Alignment

We take a finite alphabetΣ withl letters and an additional symbol “−”, denoted as blank, and call Σ∪ {−}theextended alphabet. We consider two sequencesσ111. . . σ1mand σ212. . . σn2 over the alphabetΣ.

An alignment of the sequences σ1 andσ2 is a pair of aligned sequences (µ1, µ2) over the extended alphabetΣ∪ {−}such that both sequencesµ1and µ2have the same length and are copies of σ1and σ2with inserted blanks, respectively. An alignment (µ1, µ2) does not allow blanks at the same position.

It follows that the aligned sequences have length at mostm+n.

Example 5.1.Consider the sequences σ1 = ACGTAGC and σ2 = ACCGAGACC. An alignment of these sequences is given by

µ1=A C−G−T A− G C µ2=A C C G A G A C −C

An alignment of maximal length is

A C G T A G C − − − − − − − − −

− − − − − − − A C C G A G A C C

♦ An alignment of a pair of sequences (σ1, σ2) can also be represented by a string h over the edit alphabet {H, I, D}. The stringhis called an edit string and the letters of the edit alphabet stand for homology (H), insertion (I), and deletion (D). A letter I stands for an insertion (indel) in the first sequence σ1, a letter D is a deletion (indel) in the first sequence σ1, and a letter H is a character change (mutation or mismatch) including the identity change (match). We write #H, #I, and #D for the respective number of instances of H, I, and D in an edit string for an alignment of the pair (σ1, σ2). Then we have

#H+ #D=m and #H+ #I=n. (5.1)

Example 5.2.Reconsider the sequences σ1 =ACGTAGC and σ2 =ACCGAGACC. An alignment of these sequences including the edit string is given by

h =H H I H I H H I D H µ1= A C − G − T A − G C µ2= A C C G A G A C − C

We have #H= 6, #I= 3, and #D= 1. ♦

Proposition 5.3.A string over the edit alphabet {H, I, D} represents an alignment of an m-letter sequence σ1 and ann-letter sequenceσ2 if and only if (5.1) holds.

Proof. Given an alignment of the pair (σ1, σ2). We form an edit stringhfrom left to right. Each symbol in σ1either corresponds to a symbol in σ2, in which case we record anH in the edit string, or it gets deleted, in which case we record aD. This shows that the first equation in (5.1) holds. Each symbol in σ2 either corresponds to a symbol inσ1, in which case we already recorded anH in the edit string, or it gets inserted, in which case we record aI. This shows that the second equation in (5.1) holds.

Conversely, each edit string hwith the property (5.1), when read from left to right, produces an

alignment of the pair (σ1, σ2). ⊓⊔

We write Am,n for the set of all strings over the edit alphabet {H, I, D} that satisfy the equa-tions (5.1). We callAm.nthe set of all alignments of the sequencesσ1 andσ2 in spite of the fact that it only depends onmandnand not on the specific sequences. The cardinality of the setAm,nis called Delannoy number(Fig. 5.2).

Proposition 5.4.The cardinality of the set Am,ncan be computed as the coefficient of the monomial xmyn in the generating function 1−x−y−xy1 .

Proof. Consider the expansion of the generating function 1

1−x−y−xy = X m=0

X n=0

am,nxmyn. (5.2)

5.1 Sequence Alignment 101

The coefficients are characterized by the linear recurrence

am,n=am−1,n+am,n−1+am−1,n−1, m≥0, n≥0, m+n≥1, (5.3) with initial conditionsa0,0= 1,am,−1= 0, anda−1,n= 0. The same recurrence holds for the cardinality ofAm,n. To see this, note that for nonnegative integersmandnwithm+n≥1, each string inAm,n is either a string inAm−1,n−1 followed by an H, or a string in Am−1,nfollowed by a D, or a string in Am,n−1 followed by anI (Fig. 5.1). Moreover,A0,0 has only one element, the empty string, andAm,n is the empty set ifm <0 or n <0. Thus the coefficientam,n and the cardinality of Am,n satisfy the same initial conditions and the same recurrence. It follows that they must be equal. ⊓⊔

h =. . . H σ1=. . . σm1

σ2=. . . σn2

h = . . . D

σ1 = . . . σm1

σ2 =. . . σ2n−. . .−

h = . . . I

σ1 =. . . σ1m−. . .− σ2 = . . . σ2n

Fig. 5.1.Three possibilities for strings inAm,n.

am,n0 1 2 3 4 5 6 7 8 9 10

0 1 1 1 1 1 1 1 1 1 1 1

1 1 3 5 7 9 11 13 15 17 19 21

2 1 5 13 25 41 61 85 113 145 181 221

3 1 7 25 63 129 231 377 575 833 1,159 1,561

4 1 9 41 129 321 681 1,289 2,241 3,649 5,641 8,361 5 1 11 61 231 681 1,683 3,653 7,183 13,073 22,363 36,365 6 1 13 85 377 1,289 3,653 8,989 19,825 40,081 75,517 134,245 7 1 15 113 575 2,241 7,183 19,825 48,639 108,545 224,143 433,905 8 1 17 145 833 3,649 13,073 40,081 108,545 265,729 598,417 1256,465 9 1 19 181 1,159 5,641 22,363 75,517 224,143 598,417 1,462,563 3,317,445 10 1 21 221 1,561 8,361 36,365 134,245 433,905 1,256,465 3,317,445 8,097,453

Fig. 5.2.The first hundred Delannoy numbers.

Thealignment graph of anm-letter sequence and ann-letter sequence is a directed graph Gm,n on the set of nodes {0,1, . . . , m} × {0,1, . . . , n} and three classes of edges: edges (i, j) → (i, j+ 1) are labelledI, edges (i, j)→(i+ 1, j) are labelledD, and edges (i, j)→(i+ 1, j+ 1) are labelled H.

Proposition 5.5.The set of all alignmentsAm,n corresponds one-to-one with the set of all paths from the node(0,0) to the node(m, n)in the alignment graphGm,n.

Proof. Given an alignment inAm,nby the edit stringh. The stringhprovides a path inGm,nstarting from the node (0,0). By (5.1), this path terminates at the node (m, n).

Conversely, given a path inGm,nfrom (0,0) to (m, n). The labelling of the path provides a stringh over the edit alphabet that satisfies (5.1). By Prop. 5.3, the stringhis an edit string corresponding to

an alignment inAm,n. ⊓⊔

Example 5.6.Consider the sequences σ1=ACGandσ2 =ACC. The edit stringh=HHID provides the alignment

H H D I A C G − A C − C

This alignment can be traced by the solid path in the alignment graphG3,3 (Fig. 5.3). ♦

− A C C

Fig. 5.3.The alignment graphG3,3 and the path corresponding to the alignment in Ex. 5.6.

Im Dokument Algebraic Statistics (Seite 111-114)