• Keine Ergebnisse gefunden

2.5 Hybrid Solution Approaches

3.1.1 Phylogenetics

To motivate the CTP we give a short and by no means complete survey on phylogenetic tree reconstruction. Aphylogenetic tree is composed of nodes and branches (arcs) and models the evolutionary relationship between a setLof related objects calledtaxa.

These taxa are the labeled leaf nodes or operational taxonomic units of the tree, whereas unlabeled inner nodes represent probably extinct ancestors, also denoted as hypothetical tax-onomic units. Exemplary evolutionary trees are depicted in Figure 3.1.

In the course of the project we decided to restrict our work on rooted unweighted binary trees, i.e. there exists a single distinguished root node denoting the common ancestor of all taxa, the relations represented by the tree are not weighted by any means (hence not representing a timeline), and each inner node always has exactly two direct descendants.

Next, we will introduce some definitions which will be of use in this chapter; some are taken from [30].

Definition 3 A rooted (phylogenetic) tree is a rooted tree which has every leaf identified with a unique taxon and every node that is not a leaf (i.e. inner node) has at least two children.

Definition 4 A rooted tree is binary if every inner node has exactly two children.

36

Definition 5 A group is a subset of the set of taxa.

Definition 6 A cluster of a treeT is a group which contains all the descendants of its most recent common ancestor.

Definition 7 Two clustersAandBare compatible iffAis contained inB, orBis contained inA, orAandBare disjoint.

Definition 8 A cluster is compatible with a treeT if it is compatible with every cluster ofT. Definition 9 A rooted treeT refines another rooted treeT0 on the same set of taxa if every cluster ofT0 is a cluster ofT.

Definition 10 A rooted tripleab|cdenotes a grouping ofaandbrelative toc. ab|cis said to be a rooted triple of treeT if the least common ancestor ofaandbis a descendant of the least common ancestor ofa,bandc. The set of all rooted triples ofT is denoted asr(T).

Definition 11 A set of rooted triplesRis compatible if there exists a rooted treeT such that R⊆r(T).

The phylogeny problem is to infer the intermediate ancestors and branches, thus to derive the evolutionary relationships, from given species data. The latter is commonly given as biomolecular sequences, i.e. DNA, RNA or amino acid sequences. Often morphological features are used in addition, too. For inferring the tree a multitude of conceptually dif-ferent inference methods exists, each having individual advantages and drawbacks [122].

On the one hand are approaches directly dealing with the sequences, e.g. maximum parsi-mony, where the “cheapest” tree is sought via minimizing the Hamming distance between connected sequences, and maximum likelihood, which uses a stochastic model of evolution, whereas on the other hand the sequences are used to obtain certain “distances” among the taxa and these approaches subsequently work on the derived distance matrix, seeking for a tree best resembling these distances. The distance based approaches can be regarded an intermediate strategy between maximum parsimony and maximum likelihood.

Stated as optimization problems, the maximum parsimony approach resembles a Steiner tree problem [81], and is thusN P-complete. Since maximum likelihood approaches were recently shown to be connected to the former ones [196], the corresponding problem isN P -complete, too. Unfortunately also distance based approaches, e.g. the least-squares-fit and thef statistic belong to the class ofN P-complete problems. Due to this it is quite common to apply heuristics, often relying on rather simple hill-climbing algorithms. Nevertheless, heuristically tackling distance based approaches is regarded to rather quickly yield reason-able phylogenies.

The difficulty of finding the (near-)optimal tree becomes even more evident, when looking at the possible number of trees for a given amount of species n. The number of possible unrooted trees is given by

(2n−5)!! = (2n−5)!

2n−3(n−3)! forn≥3,

3. CONSENSUSTREEPROBLEM

Table 3.1: The rapid growth of the amount of possible rooted trees.

|L| #rooted trees

2 1

4 15

6 945

8 135135

10 3.44·107 12 1.37·1010 14 7.90·1012 16 6.19·1015 18 6.33·1018 20 8.20·1021

whereas the rooted case allows even more freedom – because of placing the root node – and therefore amounts to

(2n−3)!! = (2n−3)!

2n−2(n−2)! forn≥2.

The extreme growth of the number of rooted trees, in which we are subsequently interested in, is demonstrated in Table 3.1.

The mentioned inference methods are utilized to derive phylogenetic trees. Unfortunately, it is likely that biologists end up with several different trees for one and the same taxa setL due to

• having multiple data sets available,

• inferring with different methods, or

• repeated runs of the same non-deterministic method.

This is where the consensus tree comes into play, which will be detailed in the next section.

A long-term goal in phylogenetics is to successively build up a vast evolutionary tree (or collections thereof), and eventually come as close as possible to the so-calledtree of life. Two examples where many people collaborate on this task is the Tree Of Life Web Project [221]

and Wikispecies [236].

Our journey into the domain of phylogenetics is finished with a quote of Charles Darwin of 1859 and his first drawing of an evolutionary tree shown in Figure 3.2.

"The affinities of all the beings of the same class have sometimes been rep-resented by a great tree... As buds give rise by growth to fresh buds, and these if vigorous, branch out and overtop on all sides many a feebler branch, so by generation I believe it has been with the great Tree of Life, which fills with its dead and broken branches the crust of the earth, and covers the surface with its ever branching and beautiful ramifications."

38

Figure 3.2: Darwin’s first evolutionary tree in his “B” notebook on Transmutation of Species, 1837.