Evolutionary Models - Algebraic Statistics

Algorithm 7.2Felsenstein forward algorithm.

Require: Rooted binary tree T with leaf set [n] and root r, observed sequence σ ∈ Σⁿ, Σ common node alphabet

Ensure: Evaluation of tableM foreach leafℓinT do

M[ℓ, τ]←0 mark nodeℓ end for repeat

take unmarked nodevinT whose descendentsa, bare already marked foreach symbolτ inΣdo

M[v, τ]← L

τ_aw^vaτ τ_a⊙M[a, τa]

⊙ L

τ_bw^vbτ τ_b⊙M[b, τb] end for

mark nodev untilrootr is marked

One problem with the maximum likelihood approach ismodel selection. Suppose we haventaxa,d parameters,mstates, and a vector u∈N^mfor observed data. Then we may consider all rooted binary trees withnleaves. Each such tree leads to an algebraic statistical modelf :R^d→R^m. The task is then to select a “good” model for the data; that is, a model whose likelihood function attains the largest value. However, the number of trees grows exponentially with the number of taxa and henceforth this approach is only viable for a handful of taxa.

7.6 Evolutionary Models

The above general model for the observed nucleotides allows the substitution matrices to be arbi-trary. However, there are practical reasons for constraining the form of these matrices. This leads to evolutionary models that are described by time-continuous Markov chains.

A time-continuous Markov chain is based on a matrix of rates which describes the substitution of nucleotides at an infinitesimally small time interval. A rate matrix is an n×n real-valued matrix Q = (qij) with rows and columns indexed by a common alphabet Σ such as the DNA alphabet Σ={A,C,G,T} that satisfies the following conditions,

• all off-diagonal entries are non-negative,

qij ≥0, i6=j,

• all row sums are zero, X

j∈Σ

qij= 0, i∈Σ,

• all diagonal entries are negative,

qii <0, i∈Σ.

A rate matrix is aninfinitesimal generator matrix for a time-continuous Markov chain capturing the notion of instantaneous rate of mutation.

A rate matrixQgives rise to a probability distribution matrixΦ(t),t≥0, by exponentiation. The matrix exponential for the matrixQtis then×nreal-valued matrix

Φ(t) = exp(Qt) = X∞ k=0

k!Q^kt^k, t≥0. (7.12)

The entry ofΦ(t) in rowiand columnj equals the conditional probability that the substitutioni→j occurs at a time interval of lengtht≥0.

Theorem 7.14.Each rate matrixQhas the following properties:

• Chapman-Kolmogorov equation:

Φ(s+t) =Φ(s)·Φ(t), s, t≥0.

• The matrix Φ(t)is the unique solution of the differential equation Φ^′(t) =Φ(t)·Q=Q·Φ(t), Φ(0) =I, t≥0.

• Higher derivatives at the origin:

Φ^(k)(0) =Q^k, k≥0.

• The matrix Φ(t)is stochastic (non-negative entries with row sums equal to one) for eacht≥0.

Proof. The matrix exponential exp(A) is well-defined because the series converges componentwise for any square matrixA. If twon×nmatricesAandBcommute, we obtain (AB)^k=A^kB^k,k≥0. Then the basic exponentiation identity holds

exp(A+B) = exp(A) exp(B). (7.13)

ButtQandsQcommute for any valuess, tand thusΦ(s+t) =Φ(s) +Φ(t).

The matrix exponential exp(tQ) can be differentiated term-by-term, since the power series exp(tQ) has infinite radius of convergence. We have

Φ^′(t) = X∞ k=1

t^k−1Q^k

(k−1)! =Φ(t)·Q=Q·Φ(t), t≥0.

The uniqueness is a standard result for systems of ordinary linear differential equations. Iterated dif-ferentiation leads to the identity in item three.

Finally, when tapproches 0, the Taylor series expansion givesΦ(t) =I+tQ+O(t²). Thus, whent is sufficiently small, qij(t)≥0 implies Φij ≥0 for alli6=j. But by the first part,Φ(t) =Φ(t/m)^m for allmand thusqij(t)≥0 implies Φij≥0,i6=j, for allt≥0. MoreoverQ has row sums equal to zero and thusQ^mhas the same property for each integerm≥1. Hence, the matrixΦ(t) is stochastic. ⊓⊔ Example 7.15 (Maple). The matrix exponential for a matrix can be calculated as follows,

> with(LinearAlgebra):

> Q := Matrix( [[-2,1,1],[1,-2,1],[1,1,-2]] ):

> MatrixExponential(Q,t);

7.6 Evolutionary Models 155 The matrix exponential exp(A) for an×ncomplex-valued matrixAis invertible, since by the Chapman-Kolmogorov equation, exp(A)·exp(−A) = exp(A+(−A)) = exp(0) =Iand thus the inverse of exp(A) is exp(−A). The matrix exponential provides a mapping exp :Mn(C)→Gln(C) from the vector space of alln×ncomplex matrices (Mn(C),+,0) to the general linear group of degreen, i.e., the group of alln×ninvertible complex matrices (GLn(C),·,I). This mapping is surjective which shows that each invertible complex matrix can be written as the exponential of some other complex matrix.

Moreover, the matrix exponential exp(A) for an×nreal-valued symmetric matrixAcan be easily computed. For this, notice that any n×n real-valued symmetric matrix can be diagonalized by an orthogonal matrix; that is, there is a real-valued orthogonal matrix U such that D=U^TAU is a diagonal matrix. For each diagnonal matrixDwith main diagonal entriesd1, . . . , dn, the corresponding matrix exponential exp(D) is a diagonal matrix with main diagonal entriese^d¹, . . . , e^dⁿ. SinceU U^T = I, it follows from the definitions that exp(A) = Uexp(D)U^T. Since the formation of determinants commutes with matrix multiplication, we obtain

det(exp(A)) = det(exp(D)) =Y

e^dⁱ =e^tr(^A⁾, (7.14) where tr(A) denotes the trace of the matrixA. By taking the natural logarithm, we obtain

tr(A) = log(det(exp(A))).

An evolutionary model forntaxa over an alphabetΣ is specified by a rooted binary treeT withn leaves together with a rate matrixQover the alphabetΣand an initial distribution for the root of the tree (which is often assumed to be uniform on the alphabet). The transition probability matrix for the edgeklof the tree T is given by the matrix exponential exp(Qtkl), where the socalled branch lengths tkl are to be determined. The corresponding algebraic statistical modelR^d→R^mis a specialization of the hidden tree Markov model to these matrices.

The Felsenstein hierarchy is a nested family of evolutionary models. It can be seen as a cumulative result of experimentation and development of many special time-continuous Markov models with rate matrices that incorporate biologically meaningful parameters.

The simplest model is theJukes-Cantor model(JC) developed in the late 1960s. This model is highly structured with equal transition probabilites among the nucleotides and with uniform root distribution.

It is based on theJukes-Cantor rate matrix

where the parameterα >0 indicates that all substitutions occur at the same rate. The corresponding transition probability matrix is

exp(tQ) =1

Note that all transition probabilities converge to ¹₄ as t approaches infinity. Thus in the equilibrium distribution all nucleotides are substituted with the same rate. Put a = e^−4αt. Then the transition probability matrix has the form

where a is the probability of a mutation from nucleotide i to another nucleotide j and 1−3a is the probability of staying at the nucleotide. Since the rows must sum to 1, we have 0< a <1/3.

The expected number of mutations over timet is the quantity 3αt=−1

4 ·tr(Q)·t=−1

4 ·log(det(Φ(t))), (7.18)

where the last equation follows from (7.14). This number is called the branch lengthof the model and is used to label the edges of a phylogenetic tree. Although the JC model does not capture the biology very well, it is easy to work with and is often used for quick calculations.

We make a change of variables in the JC model. For this, assume that the tree T has m edges and the leaves are indexed by the set [n]. Let Φ⁽ⁱ⁾ =Φ(ti) denote the transition probability matrix associated to thei-th edge, 1 ≤i≤m. Instead of using the parameters αi and ti, we introduce new parameters

πi= 1

4(1−e^−4αⁱ^tⁱ) and µi= 1

4(1 + 3e^−4αⁱ^tⁱ). (7.19) The parameters are the entries of the transition probability matrix for thei-th edge,

Φ⁽ⁱ⁾= exp(tiQ) =

The entries satisfy the linear constraint

µi+ 3πi= 1. (7.21)

The branch length of thei-th edge can be recovered from the matrixΦ⁽ⁱ⁾as follows, 3αiti=−1

4·log(det(Φ⁽ⁱ⁾)). (7.22)

The JC model hasmparameters by (7.21) and 4ⁿstates and is thus given by the mappingf :R^m→R⁴ⁿ.

7.6 Evolutionary Models 157

Example 7.16.Reconsider the 1,3 claw treeT (Fig. 7.9). We take the JC model with uniform root distribution. This model is given by the mappingf :R³→R⁶⁴with five distinct marginal distributions (Fig. 7.13).

The probability of observing the same symbol at all three leaves is p123=1

4(µ1µ2µ3+ 3π1π2π3). (7.23) The probabilities of seeing the same nucleotide at two distinct leavesi, jand a different one at the third leaf are The probability of noticing three distinct letters at the leaves is

pdis=1

Fig. 7.13.Labellings of the 1,3 claw tree forp123,p12, andpdis, respectively.

The nucleotides fall biochemically into two families:purines(adenine and guanine) andpyrimidines (cytosine and thymine). Substitutions within a family are calledtransitionsand substitutions between families are calledtransversions.

The Kimura two-parameter model K2P (1980) is an extension of the Jukes-Cantor model that distinguishes between transitions and transversions by assigning a common rate to transversions and a different common rate to transitions but still assumes an equal root distribution. The model K2P is based on the rate matrix

whereα, β >0 and the diagonal entries are determined by the constraint that the row sums are equal to zero.

The Kimura three-parameter model K3P (1981) is a generalization of the K2P model that allows two classes of transversions to occur at different rates. Its rate matrix is of the form



whereα, β, γ >0. The corresponding matrix exponential is



The strand symmetric model CS05 (2004) extends the K2P model by assuming that the root dis-tribution fulfillsπA=πTandπC=πG. The CS05 model has the rate matrix

TheFelsenstein model F81 (1981) is an extension of the JC model that allows a non-uniform root distribution. The rate matrix is

The Hasegawa model HKY85 (1985) is a compromise between the models F81 and CS05. Its rate matrix is

7.6 Evolutionary Models 159

The Tamura-Nei model TN93 (1993) is an extension of the HKY85 model including an extra pa-rameter for the two types of transversions. It has rate matrix



Thesymmetric model SYM (1994) is given by a symmetric rate matrix and assumes uniform root distribution. Its rate matrix is

TheREV model (1984) is the most general DNA model. Its only restriction is symmetry. The rate matrix is

Example 7.17 (Singular). Reconsider the JC model of the 1,3 claw tree f : R³ → R⁶⁴ studied in Ex. 7.16. Since the model has only five different marginal probabilities, we may consider the JC model as the image of the simplified mappingf^′:R³→R⁵with marginal probabilities given by (7.23)-(7.27).

The following program computes invariants of the simplified model,

> ring r = 0, (x(1..3),y(1..3),p(1..5)), dp;

> ideal i = i1+i2+i3+i4+i5, x(1)+3*y(1)-1, x(2)+3*y(2)-1, x(3)+3*y(3)-1;

> ideal j = std(i);

> ideal k = eliminate(j,x(1)*x(2)*x(3)*y(1)*y(2)*y(3));

The output provides three (large) phylogenetic invariants in the ringR[p1, . . . , p5]. ♦ Given a set of observed DNA sequences. We may ask the question whether or not these sequences come from a particular evolutionary model Mgiven by a tree T and a root distribution π. For this, we compute the pattern frequencies (ˆpσ) from the observed taxa. These pattern frequencies can serve as estimates for the marginal probabilities (pσ) of the model. Then we evaluate each of the model invariants using the pattern frequencies. If the sequence data come from the model, we will expect the evaluated invariants not to differ significantly from zero. These evaluated invariants give us a score of the model. Then we may choose the evolutionary model with the minimal score among all models as the ”true” evolutionary model.

Im Dokument Algebraic Statistics (Seite 165-172)