• Keine Ergebnisse gefunden

Fully Observed Tree Markov Model

Im Dokument Algebraic Statistics (Seite 155-160)

Yn i=2

(2i−3).

Proof. Lethn denote the number of inequivalent labelled rooted binary trees withn≥2 leaves. There is one labelled rooted binary tree withn= 2 leaves, i.e.,h2= 1. This tree hasn−1 = 1 internal nodes and 2n−2 = 2 edges (Fig. 7.7).

⑧⑧⑧⑧⑧⑧⑧⑧

❄❄

❄❄

❄❄

❄❄

'&%$

!"#1 '&%$ !"#2

Fig. 7.7.Labelled rooted binary tree with two leaves.

Each labelled unrooted binary tree T with n ≥ 3 leaves can be converted into a labelled rooted binary tree with n by bisecting one of its edges and taking this new node as the root (Fig. 7.8). By Prop. 7.2, each such constructed tree has (n−2) + 1 =n−1 internal nodes and (2n−3) + 1 = 2n−2 edges. Each labelled rooted binary tree withn+ 1 can be constructed in this way and the constructed trees are pairwise inequivalent. Since the number of labelled unrooted binary trees with n leaves is gn and each such tree has 2n−3 edges, the number of labelled rooted binary trees with n leaves is hn=gn·(2n−3) =Qn−1

i=2(2i−3)(2n−3) =Qn

i=2(2i−3) as required. ⊓⊔

'&%$

!"#3

✁✁✁✁✁✁✁✁✁ '&%$

!"#1

'&%$

!"#2

❂❂❂❂❂❂

❂❂❂

Fig. 7.8.Labelled rooted binary tree with three leaves.

The proofs are constructive and provide a method for generating both all labelled unrooted binary trees and all rooted binary trees with a small number of leaves.

7.2 Fully Observed Tree Markov Model

We assume that not only the nucleotides for the taxa can be observed but also the nucleotides for the intermediates represented by the interior nodes of a phylogenetic tree. Two individuals in a phylogenetic tree share the same lineage up to their most recent common ancestor. After the split in lineages, it is a reasonable first approximation to assume that the random mechanism by which substitutions occur

are operating independently on the genomes that are no longer shared. Equivalently, the nucleotides exhibited by the two individuals are conditionally independent of the corresponding nucleotide exhibited by their most recent common ancestor.

Example 7.4.Consider the four taxa tree in Fig. 7.1. Let Xi denote the random variable over the DNA alphabet corresponding to the individuali, 1≤i≤7. In view of the dependence structure given by the tree, a joint probability such as

P(X1=A, X2=G, X3=T, X4=A, X5=A, X6=T, X7=T) can be computed as

P(X1=A)

·P(X4=A|X1=A)P(X2=G|X1=A) (7.1)

·P(X5=A|X2=G)P(X3=T|X2=G)

·P(X6=T|X3=T)P(X7=T|X3=T).

Thus, for a given tree, the joint probabilities of the individuals exhibiting a particular set of nucleotides are determined by the probability distribution of the root and the transition probabilities corresponding

to the edges. ♦

We describe formally the tree model. For this, letT be a rooted binary tree with rootr. Each node i ∈N(T) corresponds to a random variable Xi with values in an alphabetΣi. Each edge kl∈E(T) is associated to a matrixθkl with positive entries, where the rows are indexed by Σk and the columns are indexed byΣl. The parameter space Θof the tree model is given by the collection of the matrices θkl withkl∈E(T). Thus the dimension of the parameter space isd=P

kl∈E(T)k| · |Σl|. A stateσ of the tree model is a labelling of the nodes of tree; i.e.,σ= (σi)i∈N(T), whereσi∈Σi. The state space of the model is the Cartesian product of the alphabetsΣi with i∈N(T). Thus the cardinality of the state set ism=Q

i∈N(T)i|.

Thefully observed toric tree Markov model for the treeT is the mappingFT :Rd→Rmdefined as

FT :θ = (θkl)kl∈E(T)7→p= (pσ), (7.2)

where

pσ= 1

r| · Y

kl∈E(T)

θσklkσl (7.3)

for each state σ = (σi)i∈N(T). Note that the factor 1/|Σr| corresponds to a uniform distribution of states at the root (Ex. 7.4).

We consider a subsetΘ1of the parameter spaceΘgiven by all positive matricesθklwhose row sums are 1. The matricesθkl can then be viewed as transition probability matrices along the branches. The dimension of the parameter spaceΘ1 is therefored=P

kl∈E(T)k| ·(|Σl| −1). Thefully observed tree Markov model for the treeT is given by the restriction of the mappingFT :Rd→Rmto the parameter spaceΘ1.

Tree models in phylogenetics have usually the same alphabet Σ for the edges, but the transition probability matrices remain distinct and independent.

7.2 Fully Observed Tree Markov Model 145

Example 7.5.Consider the 1,n claw tree T in Fig. 7.9. This is a tree with no internal nodes other than the rootrandnleaves; that is,N(T) ={r,1, . . . , n} andE(T) ={r1, . . . , rn}.

Assume that all nodes have the alphabet Σ={0,1}. The fully observed toric tree Markov model FT hasd= 4·nparameters given by the 2×2 matrices

θri=

θri00θri01 θri10θri11

, 1≤i≤n.

The model hasm= 2n+1 states which are given by the binary stringsσrσ1. . . σn∈Σn+1.

The fully observed tree Markov model FT hasd= 2·nparameters defined by the 2×2 matrices θri=

θri001−θri00 θri101−θri10

, 1≤i≤n.

The coordinates of the mappingFT :R2n→R2n+1 are the marginal probabilities pσrσ1...σn= 1

σr1rσ1· · ·θσrnrσn.

?>=<

89:;r

❄❄

❄❄

❄❄

❄❄

⑧⑧⑧⑧⑧⑧⑧⑧⑧

?>=<

89:;1 ?>=<89:;2 ?>=<89:;3

Fig. 7.9. The 1,3 claw tree.

We provide maximum likelihood estimates for the fully observed tree Markov model. For this, letT be a rooted binary treeTwithnleaves. Note that the tree has 2n−1 nodes. Assume that all individuals share a common alphabetΣof cardinalityq. Given a sequence of observationsσ1, σ2, . . . , σN inΣ2n−1. In the corresponding data vector u = (uσ), the entry uσ provides the number of occurrences of the stateσ∈Σ2n−1. Thus we haveP

σuσ =N. Let vijkl denote the number of occurrences of ij ∈Σ2 as a consecutive pair for the edge klin the tree (Fig. 7.10). The vector v = (vij) provides the sufficient statistic of the model (Prop. 4.11).

Proposition 7.6.The maximum likelihood estimate of the data vector u in the fully observed tree Markov model is the parameter vectorbθ= (bθklij)in Θ1 with coordinates

θbklij = vklij P

s∈Σvklis, kl∈E(T), ij∈Σ2.

Proof. The log-likelihood function of the toric model can be written as follows,

GFED

@ABCk:i

GFED

@ABCl:j

Fig. 7.10. Labelled edgek→l.

ℓ(θ) =X

kl

X

i

vi1kllogθi1kl+. . .+viqkllogθiqkl .

The log-likelihood function of the fully observed tree Markov model is obtained by restriction to the set Θ1 of positive matrices whose row sums are all equal to one. Therefore,ℓ(θ) is the sum of expressions

vi1kllogθkli1+. . .+vkli,q−1logθkli,q−1+viqkllog 1−

q−1X

s=1

θklis

! .

These expressions have disjoint sets of unknowns for different sets of the indicesk,l, andi. To maximize ℓ(θ) overΘ1, it is sufficient to maximize the above concave functions over a (q−1)-dimensional simplex consisting of all non-negative vectors (θitkl)t of coordinates summing to 1. By equating the partial derivatives of these expressions to zero, the unique critial point has the required coordinates. ⊓⊔ Example 7.7 (Maple). Consider the rooted binary tree T in Fig. 7.11. Assume that each node is associated with the binary alphabetΣ={1,2}. For this, we initialize

> restart: with(combinat): with(linalg):

> V12 := array(1..2,1..2);

> V13 := array(1..2,1..2);

> V24 := array(1..2,1..2);

> V25 := array(1..2,1..2);

> V36 := array(1..2,1..2);

> V37 := array(1..2,1..2);

> T12 := array([[0,0],[0,0]]);

> T13 := array([[0,0],[0,0]]);

> T24 := array([[0,0],[0,0]]);

> T25 := array([[0,0],[0,0]]);

> T36 := array([[0,0],[0,0]]);

> T37 := array([[0,0],[0,0]]);

generate 128 statesσ∈Σ7uniformly at random,

digs := rand(1..2): M := randmatrix(128, 7, entries = digs):

and calculate the sufficient statistic

> for i from 1 to 128 do

> T12[M[i,1],M[i,2]] := T12[M[i,1],M[i,2]] + 1;

7.2 Fully Observed Tree Markov Model 147

> T13[M[i,1],M[i,3]] := T13[M[i,1],M[i,3]] + 1;

> T24[M[i,2],M[i,4]] := T24[M[i,2],M[i,4]] + 1;

> T25[M[i,2],M[i,5]] := T25[M[i,2],M[i,5]] + 1;

> T36[M[i,3],M[i,6]] := T36[M[i,3],M[i,6]] + 1;

> T37[M[i,3],M[i,7]] := T37[M[i,3],M[i,7]] + 1;

> od:

This allows to estimate the parameters according to Prop. 7.6, V12 := array([[T12[1,1]/(T12[1,1]+T12[1,2]),

?>=<

Fig. 7.11.Rooted tree with four leaves.

Im Dokument Algebraic Statistics (Seite 155-160)