Fully Observed Tree Markov Model - Algebraic Statistics

Yn i=2

(2i−3).

Proof. Lethn denote the number of inequivalent labelled rooted binary trees withn≥2 leaves. There is one labelled rooted binary tree withn= 2 leaves, i.e.,h2= 1. This tree hasn−1 = 1 internal nodes and 2n−2 = 2 edges (Fig. 7.7).

⑧⑧⑧⑧⑧⑧⑧⑧

❄❄

'&%$

!"#1 '&%$ !"#2

Fig. 7.7.Labelled rooted binary tree with two leaves.

Each labelled unrooted binary tree T with n ≥ 3 leaves can be converted into a labelled rooted binary tree with n by bisecting one of its edges and taking this new node as the root (Fig. 7.8). By Prop. 7.2, each such constructed tree has (n−2) + 1 =n−1 internal nodes and (2n−3) + 1 = 2n−2 edges. Each labelled rooted binary tree withn+ 1 can be constructed in this way and the constructed trees are pairwise inequivalent. Since the number of labelled unrooted binary trees with n leaves is gn and each such tree has 2n−3 edges, the number of labelled rooted binary trees with n leaves is hn=gn·(2n−3) =Qn−1

i=2(2i−3)(2n−3) =Qn

i=2(2i−3) as required. ⊓⊔

'&%$

!"#3

✁✁✁✁✁✁✁✁✁ '&%$

!"#1

'&%$

!"#2

❂❂❂❂❂❂

❂❂❂

Fig. 7.8.Labelled rooted binary tree with three leaves.

The proofs are constructive and provide a method for generating both all labelled unrooted binary trees and all rooted binary trees with a small number of leaves.

7.2 Fully Observed Tree Markov Model

We assume that not only the nucleotides for the taxa can be observed but also the nucleotides for the intermediates represented by the interior nodes of a phylogenetic tree. Two individuals in a phylogenetic tree share the same lineage up to their most recent common ancestor. After the split in lineages, it is a reasonable first approximation to assume that the random mechanism by which substitutions occur

are operating independently on the genomes that are no longer shared. Equivalently, the nucleotides exhibited by the two individuals are conditionally independent of the corresponding nucleotide exhibited by their most recent common ancestor.

Example 7.4.Consider the four taxa tree in Fig. 7.1. Let Xi denote the random variable over the DNA alphabet corresponding to the individuali, 1≤i≤7. In view of the dependence structure given by the tree, a joint probability such as

P(X1=A, X2=G, X3=T, X4=A, X5=A, X6=T, X7=T) can be computed as

P(X1=A)

·P(X4=A|X1=A)P(X2=G|X1=A) (7.1)

·P(X5=A|X2=G)P(X3=T|X2=G)

·P(X6=T|X3=T)P(X7=T|X3=T).

Thus, for a given tree, the joint probabilities of the individuals exhibiting a particular set of nucleotides are determined by the probability distribution of the root and the transition probabilities corresponding

to the edges. ♦

We describe formally the tree model. For this, letT be a rooted binary tree with rootr. Each node i ∈N(T) corresponds to a random variable Xi with values in an alphabetΣi. Each edge kl∈E(T) is associated to a matrixθ^kl with positive entries, where the rows are indexed by Σk and the columns are indexed byΣl. The parameter space Θof the tree model is given by the collection of the matrices θ^kl withkl∈E(T). Thus the dimension of the parameter space isd=P

kl∈E(T)|Σk| · |Σl|. A stateσ of the tree model is a labelling of the nodes of tree; i.e.,σ= (σi)i∈N(T), whereσi∈Σi. The state space of the model is the Cartesian product of the alphabetsΣi with i∈N(T). Thus the cardinality of the state set ism=Q

i∈N(T)|Σi|.

Thefully observed toric tree Markov model for the treeT is the mappingFT :R^d→R^mdefined as

FT :θ = (θ^kl)kl∈E(T)7→p= (pσ), (7.2)

where

pσ= 1

|Σr| · Y

kl∈E(T)

θ_σ^kl_k_σ_l (7.3)

for each state σ = (σi)i∈N(T). Note that the factor 1/|Σr| corresponds to a uniform distribution of states at the root (Ex. 7.4).

We consider a subsetΘ1of the parameter spaceΘgiven by all positive matricesθ^klwhose row sums are 1. The matricesθ^kl can then be viewed as transition probability matrices along the branches. The dimension of the parameter spaceΘ1 is therefored=P

kl∈E(T)|Σk| ·(|Σl| −1). Thefully observed tree Markov model for the treeT is given by the restriction of the mappingFT :R^d→R^mto the parameter spaceΘ1.

Tree models in phylogenetics have usually the same alphabet Σ for the edges, but the transition probability matrices remain distinct and independent.

7.2 Fully Observed Tree Markov Model 145

Example 7.5.Consider the 1,n claw tree T in Fig. 7.9. This is a tree with no internal nodes other than the rootrandnleaves; that is,N(T) ={r,1, . . . , n} andE(T) ={r1, . . . , rn}.

Assume that all nodes have the alphabet Σ={0,1}. The fully observed toric tree Markov model FT hasd= 4·nparameters given by the 2×2 matrices

θ^ri=

θ^ri₀₀θ^ri₀₁ θ^ri₁₀θ^ri₁₁

, 1≤i≤n.

The model hasm= 2ⁿ⁺¹ states which are given by the binary stringsσrσ1. . . σn∈Σⁿ⁺¹.

The fully observed tree Markov model FT hasd= 2·nparameters defined by the 2×2 matrices θ^ri=

θ^ri₀₀1−θ^ri₀₀ θ^ri₁₀1−θ^ri₁₀

, 1≤i≤n.

The coordinates of the mappingFT :R²ⁿ→R²ⁿ⁺¹ are the marginal probabilities pσrσ1...σn= 1

2θ_σ^r1_r_σ₁· · ·θ_σ^rn_r_σ_n.

♦

?>=<

89:;r

❄❄

❄

⑧⑧⑧⑧⑧⑧⑧⑧⑧

?>=<

89:;1 ?>=<89:;2 ?>=<89:;3

Fig. 7.9. The 1,3 claw tree.

We provide maximum likelihood estimates for the fully observed tree Markov model. For this, letT be a rooted binary treeTwithnleaves. Note that the tree has 2n−1 nodes. Assume that all individuals share a common alphabetΣof cardinalityq. Given a sequence of observationsσ¹, σ², . . . , σ^N inΣ²ⁿ⁻¹. In the corresponding data vector u = (uσ), the entry uσ provides the number of occurrences of the stateσ∈Σ²ⁿ⁻¹. Thus we haveP

σuσ =N. Let v_ij^kl denote the number of occurrences of ij ∈Σ² as a consecutive pair for the edge klin the tree (Fig. 7.10). The vector v = (vij) provides the sufficient statistic of the model (Prop. 4.11).

Proposition 7.6.The maximum likelihood estimate of the data vector u in the fully observed tree Markov model is the parameter vectorbθ= (bθ^kl_ij)in Θ1 with coordinates

θb^kl_ij = v^kl_ij P

s∈Σv^kl_is, kl∈E(T), ij∈Σ².

Proof. The log-likelihood function of the toric model can be written as follows,

GFED

@ABCk:i

GFED

@ABCl:j

Fig. 7.10. Labelled edgek→l.

ℓ(θ) =X

v_i1^kllogθ_i1^kl+. . .+v_iq^kllogθ_iq^kl .

The log-likelihood function of the fully observed tree Markov model is obtained by restriction to the set Θ1 of positive matrices whose row sums are all equal to one. Therefore,ℓ(θ) is the sum of expressions

v_i1^kllogθ^kl_i1+. . .+v^kl_i,q−1logθ^kl_i,q−1+v_iq^kllog 1−

q−1X

s=1

θ^kl_is

! .

These expressions have disjoint sets of unknowns for different sets of the indicesk,l, andi. To maximize ℓ(θ) overΘ1, it is sufficient to maximize the above concave functions over a (q−1)-dimensional simplex consisting of all non-negative vectors (θ_it^kl)t of coordinates summing to 1. By equating the partial derivatives of these expressions to zero, the unique critial point has the required coordinates. ⊓⊔ Example 7.7 (Maple). Consider the rooted binary tree T in Fig. 7.11. Assume that each node is associated with the binary alphabetΣ={1,2}. For this, we initialize

> restart: with(combinat): with(linalg):

> V12 := array(1..2,1..2);

> V13 := array(1..2,1..2);

> V24 := array(1..2,1..2);

> V25 := array(1..2,1..2);

> V36 := array(1..2,1..2);

> V37 := array(1..2,1..2);

> T12 := array([[0,0],[0,0]]);

> T13 := array([[0,0],[0,0]]);

> T24 := array([[0,0],[0,0]]);

> T25 := array([[0,0],[0,0]]);

> T36 := array([[0,0],[0,0]]);

> T37 := array([[0,0],[0,0]]);

generate 128 statesσ∈Σ⁷uniformly at random,

digs := rand(1..2): M := randmatrix(128, 7, entries = digs):

and calculate the sufficient statistic

> for i from 1 to 128 do

> T12[M[i,1],M[i,2]] := T12[M[i,1],M[i,2]] + 1;

7.2 Fully Observed Tree Markov Model 147

> T13[M[i,1],M[i,3]] := T13[M[i,1],M[i,3]] + 1;

> T24[M[i,2],M[i,4]] := T24[M[i,2],M[i,4]] + 1;

> T25[M[i,2],M[i,5]] := T25[M[i,2],M[i,5]] + 1;

> T36[M[i,3],M[i,6]] := T36[M[i,3],M[i,6]] + 1;

> T37[M[i,3],M[i,7]] := T37[M[i,3],M[i,7]] + 1;

> od:

This allows to estimate the parameters according to Prop. 7.6, V12 := array([[T12[1,1]/(T12[1,1]+T12[1,2]),

?>=<

Fig. 7.11.Rooted tree with four leaves.

Im Dokument Algebraic Statistics (Seite 155-160)