Data and General Models - Algebraic Statistics

Tree Markov Models

Phylogenetics is a branch of biology that seeks to reconstruct evolutionary history. Inferring a phylogeny is an estimation procedure that is to provide the best estimate of history based on the incomplete information contained in the observed data. Ultimately, we would like to reconstruct the entiretree of life that describes the course of evolution leading to all present day species.

Phylogenetic reconstruction has a long history. The classical reconstruction has been based on the observation and measurement of morphological similarities between taxa with the possible adjunction of similar evidence from the fossil record. However, with the recent advances in technology for sequencing of genomic data, reconstruction based on the huge amount of available DNA sequence data and is now by far the most commonly used technique. Moreover, reconstruction from DNA sequence data can operate automatically on well-defined digital data sets that fit into the framework of classical statistics, rather than proceeding from a somewhat ill-defined mixture of qualitative and quantitative data with the need for expert oversight to adjust for difficulties such as morphological similarity.

This chapter is mainly devoted to two approaches for phylogenetic reconstruction, the maximum likelihood method that evaluates a hypothesis about evolutionary history in terms of the probability and the algebraic method of phylogenetic invariants.

7.1 Data and General Models

Phylogenetic reconstruction makes use of the structure of trees. Atree is a cycle-free connected graph T = (N, E) with node setN =N(T) and edge setE=E(T).

An unrooted tree is considered as an undirected graph; that is, the edges are 2-subsets of the node set. Each edge{k, l} is written as a wordklwithkl=lksince there is no ordering on the nodes. The edges are also calledbranches. Anunrooted binary tree(ortrivalent tree) contains only nodes of degree one (terminal nodesor leaves) and degree three (internal nodes).

Arooted treeis considered as a directed graph; that is, the edges are ordered pairs. Each edge (k, l) is denoted as a word kl with kl 6= lk since there is an ordering on the nodes. A rooted binary tree contains besides nodes of degree one and three also one node of degree two, the socalledroot. The edges are always directed away from the root.

A tree islabelled if its leaves are labelled. In phylogenies, trees are built from data corresponding to the leaves. These data are calledtaxa, while the data at the inner nodes are calledintermediates. Taxa and intermediates are both calledindividuals.

In a phylogenetic tree, the arrow of time points away from the root (if any), paths down through the tree representlineages (lines of descent), any point on a lineage corresponds to a point of time in the life of some ancestor of a taxon, inner nodes represent times at which lineages diverge, and the root (if any) corresponds to the most common ancestor of all the taxa.

The basic data in phylogenetics are DNA sequences corresponding one-to-one with the taxa that have been preprocessed in some suitable way. These sequences are assumed to be aligned. For simplicity, we suppose that we are dealing with segments of DNA without indels such that all taxa share the same common positions, and differences between nucleotides at these positions are due to substitutions.

Suppose we have the following four aligned DNA sequences, taxon 1A G A C G T T A C G T A. . . taxon 2A G A G C A A C T T T G. . . taxon 3A A T C G A T A C G C A. . . taxon 4T C T A G T A A C C C C. . .

A standard assumption is that the behavior at widely separated positions on the genome are statistically independent. With this assumption, the modelling problem reduces to the modelling of nucleotides observed at a given position. For this, define apattern σto be a sequence of symbols that we get when we look at a single site (column) in the aligned sequence data. For instance, the third column gives rise to the patternAATT. A tree that might describe the course of evolution based on this pattern is shown in Fig. 7.1. Any tree with four leaves labelled by the pattern in some order is a potential candidate

?>=<

Fig. 7.1.Rooted binary tree with labelled leaves.

that describes the course of evolution. To this end, we need to capture the various tree topologies and labellings.

Two trees T and T^′ areisomorphic if there is a bijective mappingφ: N →N^′ between the node sets which is compatible with the edges; that is, for each pair of nodes k, l in T, klis an edge in T if and only ifφ(k)φ(l) is an edge inT^′. The mappingφis also called anisomorphism.

7.1 Data and General Models 141

Let Σ denote the DNA alphabet (or any other alphabet used for bioinformatics data, like RNA or amino acids). Let T be a tree with set of leaves L. A labelling of T by DNA data is a mapping ψ :L →Σ. A tree equipped with such a labelling is called labelled. Two labelled trees T and T^′ are equivalent if there is an isomorphismφ:N →N^′ which is compatible with the labelling of the leaves;

that is, ifψ:L→Σ is a labelling ofT and ψ^′ :L^′ →Σ is a labelling ofT^′, thenψ(l) =ψ^′(φ(l)) for each leafl∈L.

Example 7.1.There are two non-isomorphic rooted binary trees with four leaves (Fig. 7.4). The first has twelve labelled trees (Fig. 7.2) and the second has three labelled trees (Fig. 7.3). ♦

leaves 4 5 6 7 leaves 4 5 6 7 A C G T G A C T A G C T G C A T A T C G G T A C C A G T T A C G C G A T T C A G C T A G T G A C

Fig. 7.2.Labellings of the first rooted tree in Fig. 7.4.

leaves 4 5 6 7 A C G T A G C T A T C G

Fig. 7.3.Labellings of the second rooted tree in Fig. 7.4.

Proposition 7.2.A labelled unrooted binary tree withn≥3leaves hasn−2internal nodes and2n−3 edges. The number of inequivalent labelled unrooted binary trees with n≥3leaves is

Yn i=3

(2i−5).

Proof. Let gn denote the number of inequivalent labelled unrooted binary trees with n ≥ 3 leaves.

There is one labelled unrooted binary tree with n = 3 leaves, i.e., g3 = 1. This tree has n−2 = 1 internal nodes and 2n−3 = 3 edges (Fig. 7.5).

Consider a labelled unrooted binary tree T with n ≥ 3 leaves. Choose an edge of T, bisect it by introducing an internal node and connect this node to a new leaf (Fig. 7.6). By induction, the new tree hasn+ 1 leaves, (n−2) + 1 = (n+ 1)−2 internal nodes, and (2n−3) + 2 = 2(n+ 1)−3 edges. Each

?>=<

Fig. 7.4.Two rooted binary trees with four leaves.

'&%$

Fig. 7.5.Labelled unrooted binary tree with three leaves.

labelled unrooted binary tree withn+ 1 leaves can be constructed in this way and the constructed trees are all pairwise inequivalent. Since each labelled unrooted binary tree withnleaves has 2n−3 nodes, we havegn+1=gn·(2n−3) =gn·(2(n+ 1)−5) as required. ⊓⊔

Fig. 7.6.Labelled unrooted binary tree with four leaves.

Proposition 7.3.A labelled rooted binary tree withn≥2 leaves has n−1 internal nodes and2n−2 edges. The number of inequivalent labelled rooted binary trees withn≥2 leaves is

Im Dokument Algebraic Statistics (Seite 151-155)