Facility Location in the Phylogenetic Tree Space

(1)

Facility Location in the Phylogenetic Tree Space

Dissertation

zur Erlangung des mathematisch-naturwissenschaftlichen Doktorgrades

”Doctor rerum naturalium“

der Georg-August-Universit¨at G¨ottingen

im Promotionsprogramm

”PhD School of Mathematical Sciences“ (SMS) der Georg-August University School of Science (GAUSS)

vorgelegt von

Marco Botte

aus Fritzlar G¨ottingen, 2019

(2)

Betreuungsausschuss

Prof. Dr. Anita Schöbel, seit 1.1.19 Fachbereich Mathematik, Technische Univer- sität Kaiserslautern, vorher Institut für Numerische und Angewandte Mathematik, Georg-August-Universität Göttingen

Prof. Dr. Stephan Huckemann, Institut für Mathematische Stochastik, Georg- August-Universität Göttingen

Mitglieder der Pr¨ ufungskommission

Referentin: Prof. Dr. Anita Schöbel, seit 1.1.19 Fachbereich Mathematik, Tech- nische Universität Kaiserslautern, vorher Institut für Numerische und Angewandte Mathematik, Georg-August-Universität Göttingen

Korreferent: Prof. Dr. Stephan Huckemann, Institut für Mathematische Stochastik, Georg-August-Universität Göttingen

Weitere Mitglieder der Pr¨ufungskommission:

Jun.-Prof. Dr. Anja Fischer, Juniorprofessur Management Science, Technische Uni- versit¨at Dortmund

Prof. Dr. Preda Mih˘ailescu, Mathematisches Institut, Georg-August-Universit¨at G¨ottingen

Prof. Dr. Gerlind Plonka-Hoch, Institut für Numerische und Angewandte Mathe- matik, Georg-August-Universität Göttingen

Prof. Dr. Max Wardetzky, Institut für Numerische und Angewandte Mathematik, Georg-August-Universität Göttingen

Tag der m¨undlichen Pr¨ufung: 28.02.2019

(3)

Acknowledgements

First of all I want to thank my supervisor Prof. Anita Sch¨obel for very fruitful dis- cussions, the constant support throughout the years and for giving me the possibility to work on many different interesting fields of optimization. I also want to thank Prof. Stephan Huckemann for the interesting joint collaboration with international researchers and for introducing interesting points of view on aspects of phylogenetic trees.

Moreover, I thank all members of the work group for the pleasant atmosphere we have in the institute. My special thanks go to Julius for the congenial time in the office and also for proofreading parts of this work.

Last but not least, I wholeheartedly thank my family and Maria for their encour- agement and support, especially for always believing in me.

(4)

(5)

1 Introduction

Phylogenetic trees are diagrams with a tree structure that depict the relations of evolutionary history between a certain set of existing species to be investigated.

Ever since Ernst Haeckel coined the termphylogenyin 1869 as ‘genesis and evolution of a phylum’, where ‘genesis’ translates as ‘origin’ and ‘phylum’ as ‘race’, phylogenetic trees are a part of biologists’ attempts to classify existing species according to their evolutionary history, see Figure 1.1 for an example. This classification is based on shared characteristics and genes.

Figure 1.1: Tree of life according to Ernst Haeckel, cf.[Hae].

In contrast to Figure 1.1, a phylogenetic tree nowadays always comes with atopology,

(8)

which is the tree structure as depicted, but also with positive edge weights, describing the amount of mutations of genes between two nodes of the tree. Phylogenetics is an active field of research because the evolutionary history of a given set of species is often still not agreed upon by biologists. Hence, the so called species tree, which is the phylogenetic tree depicting the actual evolution of a given set of species, is usually unknown and is subject to study.

The Species Tree Problem

As there is no straightforward way to obtain the species tree, many methods have been proposed that hypothesize how the species tree could look like. All of these approaches use sets of phylogenetic trees as input in order to give a hypothesis for the species tree, thereby trying to incorporate shared features of the given trees into the species tree.

The phylogenetic trees that are used as data for such methods are calledgene trees and are obtained as follows: Specific genes that the investigated species share are tagged with a biological marker. To this end one uses gene sequences that are em- pirically known to be good guesses of the species tree. Then the gene sequences are aligned to be as ‘parsimonious’ as possible. There exist many different approaches to infer trees from these aligned gene sequences, e.g., bayesian methods or bootstrap procedures that were introduced to the field of phylogenetics by [Fel85] marking a milestone in phylogenetic inference. The result of such a method is a phylogenetic tree depicting the relation between the investigated species based on the information of a single gene. These trees are called gene trees.

Applying this method, one often obtains different trees for different genes which shows that it is not possible to directly infer thespecies tree from a single gene tree.

[Mad97] show a specific case where the topology, i.e., the tree structure, of the gene tree does not coincide with the topology of the species tree and [DR06] even show that this need not be the case for the “most likely gene tree” as well. Nevertheless, since it is impossible to directly determine the species tree, one tries to infer as much information from these gene trees as possible in order to develop reasonable hypotheses for the species tree.

Having performed this acquisition of gene trees for several genes, one receives a sample set of phylogenetic gene trees. Inferring the species tree from such a set of gene trees is what is often referred to as thespecies tree problem, or ‘gene tree/species tree problem’, as in [PC97]. One of the earliest works on this field is [AI72], where Adams develops the notion of aconsensus tree that is supposed to convey the information of the gene trees as good as possible in order to get a reasonable candidate for the species tree. After consensus trees have been introduced, many different methods to find the species tree have been proposed, but they are mostly educated guesses or

2

(9)

heuristic rules. Popular methods include the majority rule consensus tree [MM81]

and strict consensus methods [MMN83] that are probably widely used due to their simple computation rules. For an overview of consensus trees and related methods see [Bry03]. Naturally, these heuristic methods are limited in describing the actual species tree, which is also discussed in the literature, see for example [BDS91] and [Nel93] as a reply. Thus, there was an urgent need for more sophisticated techniques which have been developed since then with the hope to better infer the species tree from a given set of gene trees.

A lot of research in the younger history involves mathematical, mostly statistical methods to find the species tree. A large part of this mathematical research is based on [BHV01], since it provides the framework for a new systematic approach to phylogenetic tree related problems. Billera, Holmes and Vogtmann were able to define the metric space T_n, which contains all possible phylogenetic trees for a fixed set of existing species {1, . . . , n}. {1, . . . , n} are the leaves of the trees in T_n. In the following we will refer to the space T_n as tree space. T_n is a metric space of ‘global non-positive curvature’. This directly implies two important features for mathematical modeling: Firstly, it is a metric space containing all possible phylogenetic trees onn leaves, i.e., it is a suitable model space for the species tree problem, since the given data and the solution to the problem are contained in this space. Secondly, there does not only exist a distance measure between any given pairT₁, T₂ of trees in T_n, which is required by many algorithms, but there even exists a unique path from X to T whose length equals the distance d(T₁, T₂), where d is the metric on T_n. These are necessary properties to formulate and tackle the species tree problem inT_n.

In this model setting, natural candidates for the species tree are some sort of ‘av- erage’ or ‘centroid’ trees. The question of finding a centroid from a set of sample trees has already been posed in [BHV01] and several possible notions of a centroid of a sample set in a metric space with global non-positive curvature are mentioned.

Unfortunately, all given notions of center points actually require computing the geodesic between two trees, or at least the distance of two trees. So even though [BHV01] show that the above-mentioned shortest path between two trees, called geodesic exists and is unique, they were not able to propose a method how to efficiently calculate the distance between two points in T_n. This shortcoming strongly encouraged follow-up research in this direction. The computation of the distance and the geodesics in T_n has been extensively studied in [Owe08, Owe11] leading to the milestone of a polynomial time algorithm calculating the distance and the parametrization of the geodesic for two given trees, see [OP11].

Now, having the metric spaceT_nand the possibility to efficiently calculate geodesics at hand one is able to use several of the concepts of centroids that [BHV01] introduced to get new hypotheses on the species tree. For example, [MOP15] translate

(10)

the algorithm of Sturm that was already mentioned by [BHV01] into the tree space setting and applied it to several data sets. As it turns out, though, the computation of the center points remains an open problem in practice. Sturm’s algorithm theoretically converges to the mean but for high-dimensional and large data sets the method was not able to converge to a pre-specified termination condition since its calculation time was exceeded. So there is still need for improved procedures as the existing methods are of converge slowly (Sturm’s algorithm is sublinear, see [MOP15]) and not tractable for larger instance sizes or even yield the wrong species tree, [MOP15].

As an attempt to make the computation of the mean more efficient, [MOP15] further investigated the structure of T_n and developed special algorithms, e.g., gradient descent methods that are based on interior point or penalty methods. After all there are still practical and theoretical problems of these approaches, since analytical properties, such as optimality criterions and differentiability, do not exist on specific subsets ofT_n. An up-to-date review on Fr´echet means in tree space is presented in [BO17], both pointing out the strengths and weaknesses of the concept and giving concrete numeric studies.

Another mathematical line of research that recently evolved is to find a different model space for the phylogenetic problems. To this end, the so-called space of ultra- metric trees, equipped with the tropical metric is investigated in [YZZ17, LMY18].

This research needs to be carried out in depth in order to evaluate which advantages and disadvantages this model space has.

Contribution and Structure of the Thesis

As we have seen, there exists a lot of research in the field of phylogenetics includ- ing numerous approaches on finding the species tree. Latest mathematical studies mainly concentrate on finding new model spaces or different and faster ways to compute the Fr´echet mean to do statistics on the model spaces, but neither can the outcome of these approaches be anticipated nor if the result of these approaches will solve the species tree problem, at least to biologists’ satisfaction.

In this thesis we introduce a different point of view on the species tree problem by incorporating “Location Theory”: We interpret the set of gene trees asfacilities, as usual for location problems. Facility Location problems are optimization problems that are motivated from real-world and economic problems and are usually investigated in Euclidean space. So on one hand this work extends the field of research of Location Theory beyond its usual scope. On the other hand, we aim at exploiting the specific local structure of T_n to receive some location problems which may be solved within a Euclidean setting. By doing so we then can make use of known algorithms and results from Location Theory in Euclidean space in order to try solving

4

(11)

the original location problems in tree space, thereby connecting these two fields of research.

The thesis is structured as follows (also confer to Figure 1.2). In Chapter 2 we introduce the phylogenetic tree space T_n as defined in [BHV01]. In Section 2.2 we give a detailed description of the geodesic distance and the parametrization of geodesics in tree space, which are the key tools to work with location problems in T_n.

In Section 2.3 motivate our location theoretical point of view on the species tree problem in more detail before we build the bridge from the species tree problem to Location Theory in Chapter 3. We introduce three location problems in tree space that yield different notions of ‘centroids’ of a given set of trees as hypotheses for the species tree. We also formulate general results regarding optimal solutions for these three problems in Subsection 3.2.1.

Chapter 4 illustrates first approaches to solve these tree space location problems by providing reformulations and solution methods for special cases.

The central problem of the thesis is tackled in Chapter 5. The goal is to find a median of a given set of points in tree space and we present an algorithm to determine a median that is based on a local Euclidean improvement strategy. This algorithm is called theBalance Point Algorithm, which is a heuristic, but for which we prove convergence under certain assumptions in Section 5.3. As the Balance Point Algorithm is a local improvement procedure, it is necessary to find a good neighborhood in which the algorithm is started in order to obtain good results.

In order to find a good neighborhood, we develop bounds for specific subsets of T_n in Section 5.1 to determine auspicious subsets of T_n, where the Balance Point Algorithm may be applied. In Section 5.2 we formulate a heuristic that determines such auspicious subsets in a preprocessing procedure and then applies the Balance Point Algorithm for the remaining subsets.

Finally, we illustrate how the Balance Point Algorithm works on a real data set in Chapter 6 and discuss its results in comparison to other methods before the thesis is summarized in Chapter 7.

(12)

Chapter 2:

Phylogenetic tree spaceTn

Chapter 1:

Species Tree Problem

Chapter 3:

Location Prob- lems inTn

Chapter 5:

BPA for median problems inTn

Chapter 4:

Solutions via Transformations

Chapter 6:

Real data example

Chapter 7:

Conclusion

Figure 1.2: Structure of the thesis.

6

(13)

2 Construction and Properties of the Tree Space

As mentioned in the introduction, we will refer to the space of phylogenetic trees T_n as tree space. The construction of this space is given in a meticulous way to ensure that all subtleties of the tree space and important tools required for proofs are introduced carefully. Note, that since this is an introduction of the tree space almost all definitions and results are known from literature and are always indicated with the proper citation.

2.1 Splits, Compatibility and Orthants

We start by giving a concise mathematical definition for the elements ofTn, the phylogenetic trees. In the following, letn species {1,2, . . . , n}be given whose evolution is to be investigated. These species are leaves of the phylogenetic tree describing their evolution, as they are present today and have no descendants. There is one additional leaf in a phylogenetic tree which is the root node and is indexed by 0.

The root node models a known common ancestor of these species.

Due to the problem’s nature we are only interested in trees which do not contain any node with degree two. If there exists a node with degree two, it can be removed and its two incident edges may be merged into one, as this node does not convey any information on a bifurcation or multifurcation of species. The edges of a tree can be distinguished into edges which are incident with a leaf and edges which are not. The latter are called interior edges. As already mentioned, phylogenetic trees also carry information on how much mutation took place between two nodes, i.e., species in the tree, which is modeled by positive edge weights for all interior edges.

Now, with these properties of a phylogenetic tree in mind, we can formally define our desired notion of trees.

Definition 2.1. [BHV01] A weighted graph T = (V, E, w) is called metric n−tree if

• T = (V, E) is a tree,

• {0,1, . . . , n} ⊆V are the only leaves of V and 0∈V is the root node,

(14)

• does not have any node of degree two,

• satisfies w_e >0 for all interior edges e of E ,i.e., w∈R^|E|>0.

In the following the elements of the tree space, the metric n− trees are usually referred to as trees in the following. If we want to reference a tree without edge weights, we emphasize this by speaking of thetopology of the tree or explicitly write T = (V, E), omitting w.

The next result follows immediately from the definition of interior edges and the property of metric n−trees not having nodes of degree two.

Lemma 2.2. [BHV01] The maximal number of interior edges for metricn−trees is n−2.

This maximal number of edges is attained when, every non-leaf node has degree three. Then, regarding the tree from the root, 0, being the top, to the leaves, {1, . . . , n}, at the bottom, each edge from the top splits up to two different edges;

this motivates the following name.

Definition 2.3. [BHV01] A tree with n−2 interior edges is called binary tree.

Example 2.1.1. The tree topologies for n = 4:

For n = 4 species, all possible topologies of such trees are depicted in figures 2.1 - 2.4. Figure 2.1 contains the star tree as well as four different topologies which all describe that three species have one common ancestor, and one species has developed separately.

0

1 2 3 4

The star tree

0

1

2 3 4

0

2

1 3 4

0

3

1 2 4

0

4

1 2 3

Figure 2.1: Degenerate tree topologies for trees with 4 species.

8

(15)

Figure 2.2 shows the cases of a degenerate tree topology where one pair of species shares a common ancestor that developed after the split of the other two species.

0

1 2

3 4

0

3 4

1 2

0

1 3

2 4

0

2 4

1 3

0

1 4

2 3

0

2 3

1 4

Figure 2.2: Degenerate tree topologies for trees with 4 species.

Figure 2.3 shows the three binary tree topologies in which the species have developed in pairs:

0

1 2 3 4

0

1 3 2 4

0

1 4 2 3

Figure 2.3: 3 of the 15 = (2·4−3)!! = 5·3·1 binary tree topologies for trees with 4 species.

Figure 2.4 shows the remaining twelve binary tree topologies where one species has developed separately, while three of them have one common ancestor from which another species developed separately.

(16)

0

2 4

3 1

0

2 3

4 1

0

3 4

2 1

0

1 4

3 2

0

1 3

4 2

0

3 4

1 2

0

2 4

1 3

0

1 2

4 3

0

1 4

2 3

0

1 2

3 4

0

1 3

2 4

0

2 3

1 4

Figure 2.4: 12 of the 15 = (2·4−3)!! = 5·3·1 binary tree topologies for trees with 4 species.

This enumeration of trees is in fact exhausting, i.e., all different tree topologies for n= 4 have been depicted up to different planar embeddings of the graphs. Handling phylogenetic trees via tree topologies (V, E) is very cumbersome, which is the reason for the introduction of splits of trees in the next section.

10

(17)

Splits

The figures above have show that there are many different topologies for trees. Our goal now is to describe tree topologies in an easy way, as sets ofsplits. Note that the number of interior nodes (non-leaf nodes) and the number of edges may differ from tree to tree. Thus, a representation using nodes or edges does not seem appropriate.

Instead, trees will be described using partitions of its set of leaves that are induced by the interior edges, which are edges of the tree that are not incident to any leaf.

Partitions are induced when removing edges, which is a well-known observation from graph theory:

A treeT = (V, E) with vertices V and edges E gets disconnected and divided into two trees if we remove any of its edges e ∈E. More precisely, when removing any edge e ∈ E the tree T is split into two connected components (V₁(e), E₁(e)) and (V2(e), E2(e)) with V1(e)∪V2(e) = V and E1(e)∪ {e} ∪E2(e) =E.

In particular this yields a partition of the leaves {0,1, . . . , n} ⊆ V of the tree into two disjoint sets A(e) = V₁(e)∩ {0,1, . . . , n} and A(e)^c = V₂(e)∩ {0,1, . . . , n} = {0,1, . . . , n} \A(e). It is easily observed that removing an edge e incident to a leaf i∈ {0,1, . . . , n} of the tree results in the partition {i} and {0,1, . . . , n} \ {i}, and that these partitions can be obtained for any tree T ∈ T_n. We can hence neglect these trivial partitions when describing tree topologies. The other partitions are formalized as follows:

Definition 2.4. [Owe11] A split s = (A|A^c) is a partition of {0,1, . . . , n} into two setsA, A^c such that|A| ≥2and|A^c| ≥2. The set of all splits of the set{0,1, . . . , n}

is denoted as S.

Note that (A|A^c) and (A^c|A) are considered to be the same split. Due to the sym- metry of the unionA∪A^c=A^c∪Athey generate the same partition of{0,1, . . . , n}.

We use the convention that the set which does not contain 0 is always denoted as A, i.e., 0 ∈ A^c. Even though a split is already uniquely defined by choosing A, it will help to also write down A^c when working with splits later on.

The most important thing about splits is that they represent interior edges of a tree without the need of interior nodes to define them. That makes it more convenient to represent trees and also to compare howsimilar trees are by just checking which common splits they possess.

Example 2.1.2. Figure 2.5 illustrates how splits arise when removing an interior edge. It is important to note that this justifies to use the words interior edge or split interchangeably, since there is a one-to-one correspondence between these notions.

This fact is stressed again in Lemma 2.7

As shown in the example, one receives splits by removing interior edges e of a tree.

Doing this for all interior edges of the tree one receives a set of splits that, as we

(18)

0

1 3

e¹

2

e²

4

leaf set: {0,1,2,3,4,5}

0

1 3

2

e²

4

leaf sets: {1,3},{0,2,4,5}

split: s₁ = ({1,3}|{0,2,4,5})

0

1 3

e¹

2

e²

4

leaf set: {0,1,2,3,4,5}

0

1 3

e¹

2 4

leaf sets: {1,2,3},{0,4,5}

split: s₂ = ({1,2,3}|{0,4,5})

Figure 2.5: An interior edge of a metric n−tree induces a split on {0,1, . . . , n}.

will see later, already completely defines the topology (V, E) of a tree (V, E, w).

The notion of topology is frequently used in the same sense as ‘the combinatorial structure of the tree’ in the literature (see, e.g., [OP11]). Our definition here is a little more specific.

Definition 2.5. For a tree T its induced set of splits, the topology of T, is defined as

Split(T) :={(A(e)|A(e)^c) :e is an interior edge of T} ⊆ S.

As mentioned earlier we use splits as they offer a convenient and consistent way to describe the topology of a tree by a set of splits. Nonetheless, not all sets of splits S⊂ S yield a tree topology since some pairs of partitions can not be present in the same tree:

Example 2.1.3. Assume n= 4 and consider two splits on {0,1,2,3,4}, s₁ = ({1,2}|{0,3,4}) and s₂ = ({1,3}|{0,2,4}).

12

(19)

There cannot exist a tree whose interior edges yield both of these splits as there cannot be two edges in the same tree stating that {1,2} are direct neighbors and that{1,3} are direct neighbors. More precisely, {1,2} induces a subtree, where 1 is contained in the subtree{1,2}which does not contain 3. {1,3}, however induces the subtree {1,3} which does not contain 2, which is a contradiction.

The concept ofcompatibility describes whether two splits may be contained in the same tree and even more, if a set of splits yields a tree topology:

Definition 2.6. [BHV01] Two splits (A|A^c) and (B|B^c) are compatible if A ⊆ B or A ⊆ B^c, B ⊆ A or B^c ⊆ A. A set of splits S ⊂ S is called compatible if every pair s_i, s_j ∈ S is compatible.

The definition of splits and compatibilities allows for the following nice representation of tree topologies, which implies the one-to-one correspondence between splits and interior edges that we mentioned in Example 2.1.2.

Lemma 2.7 ([BHV01],[Vog07]). There exists a metric n−treeT with topology S = Split(T) if and only if all pairs of splits {s₁, s₂} ⊂S are compatible.

As an example for this, the unweighted tree T = (V, E) or tree topology depicted in Figure 2.5 has two interior edges which result in the two splits ({1,3}|{0,2,4}) and ({1,2,3}|{0,4}). These, in turn, uniquely define the tree topology of T.

Lemma 2.7 in particular implies that two splits are compatible if and only if they can exist in the same tree.

Moreover, it justifies why we called the set of splits thetopology of a tree in Defini- tion 2.5 and why this fits the terminology chosen in Definition 2.1: The topology of a tree (in the sense of Definition 2.1) T = (V, E, w) is uniquely characterized by its set of splits Split(T).

An immediate consequence of the one-to-one correspondance and Lemma 2.2 is the following corollary.

Corollary 2.8. The maximal number of compatible splits is n−2.

Embedding into R

^N+

Now that we have defined splits and compatibilities, we can make use of the representation of trees by split sets to embed the trees into a Euclidean space. To this end, we construct vectors of the length of all splits; easy combinatorics show that there are N :=|S| = 2ⁿ−n−2 splits on {0, . . . , n}. As the non-negative orthant of R^N,R^N+, is defined inconsistently throughout literature, we specify

R^N+ ={x∈R^N :x_i ≥0∀i= 1, . . . , N}.

(20)

Definition 2.9. Let the splits in S be given in a fixed order according to their indices, say, S = {s₁, . . . , s_N}. Then we can describe every metric n−tree T by a vectort ∈R^N+ by defining

t_i :=

w_e >0 if s_i = (A(e)|(A(e)^c)∈Split(T)

0 if s_i 6∈Split(T), (2.1)

for i= 1, . . . , N.

Let T_n denote the set of all metric n-trees. Then the mapping χ:T_n→R^N+ χ(T) = t,

with t defined as above is called (canonical) embedding of T_n.

Note that the term ‘embedding’ implicitly requires injectiveness of the mapping.

That χ is injective is easy to see: When χ(T₁) = t = χ(T₂), then it follows that T₁ and T₂ have the same set of splits, the ones that correspond to the non-zero components oft. Moreover, they have the same weightst_j for these splits, soT₁ =T₂. We illustrate the embedding ofT_n into R^N+ using metric 4−trees.

Example 2.1.4. For n = 4 the N = 2ⁿ−n−2 = 2⁴ −6 = 10 possible splits on {0,1,2,3,4} are

s₁ = ({1,2}|{0,3,4}) s₂ = ({1,3}|{0,2,4}) s₃ = ({1,4}|{0,2,3}) s₄ = ({2,3}|{0,1,4}) s₅ = ({2,4}|{0,1,3}) s₆ = ({3,4}|{0,1,2}) s₇ = ({1,2,3}|{0,4}) s₈ = ({1,2,4}|{0,3}) s₉ = ({1,3,4}|{0,2}) s₁₀ = ({2,3,4}|{0,1})

0

1 2

3.2

3 4

1.8

0

1 2

2.8 3 2

4

0

3 4

2.5

1 2

0

1 2 3 4

Figure 2.6: Four trees ofT4.

14

(21)

The four metric 4−trees depicted in Figure 2.6 are hence given as

t¹ =





 3.2

0 0 0 0 1.8

0 0 0 0





 , t² =





 2.8

0 0 0 0 0 2 0 0 0





 , t³ =





 0 0 0 0 0 2.5

0 0 0 0





 , t⁴ =





 0 0 0 0 0 0 0 0 0 0





 .

The tree T⁴ corresponding to t⁴ = 0 ∈ R¹⁰+ is called the star tree and is depicted on the rightmost in Figure 2.6. Note that not every vector in R¹⁰+ has a pre-image.

Consider for example t₅ = (2 3 0 0 0 0 0 0 0 0)^t. This would translate to the split s₁ having a weight of 2 and s₂ having a weight of 3. But s₁ and s₂ are incompatible, so there exists no tree that can contain both of these splits. This implies that there is no pre-image fort₅.

With this embedding, each metricn−treeT yields a vectort∈R^N+, but, as explained in the example, not every x∈R^N+ represents a metric n−tree, since the embedding is not surjective. We now describe whichx∈R^N+ are representatives of trees metric n-trees, which follows directly from Lemma 2.7 and the definition of the embedding.

Recall, that it is crucial to first choose a fixed order of the splits in S and then to embed all trees intoR^N+ with respect to to this order.

Theorem 2.10. Let x ∈R^N+. Then x represents a tree if and only if all s_i, s_j ∈ S with x_i 6= 0 and x_j 6= 0 are compatible splits.

After having introduced the concept of splits and defined the embedding we will now incorporate two possibilities to uniquely identify trees T = (V, E, w), without using the (V, E, w) notation. The first is to specify a tupleT = ((s₁, . . . , s_k),(w₁, . . . , w_k)) of a vector of splits (s1, . . . , sk) that yields a compatible set of splits{s1, . . . , sk} ⊂ S and the corresponding positive edge weightsw_i fors_i,i= 1, . . . , k, that are induced by the translation of an interior edge to a split. The second possibility is to identify a treeT ∈ Tnvia the embedding, i.e., by its corresponding vectort ∈R^N+, such that χ⁻¹(t) =T.

Since we have these two possibilities we will always reference trees T ∈ T_n with a capital letter and they consist of a tuple of splits and lengths , whereas the respective lower-case lettert denotes the embedding ofT intoR^N+.

(22)

2.2 The Geodesic Distance

In the previous section we have introduced convenient ways of referencing trees and have seen that compatible pairs of splits yield tree topologies. Having all possible sets S of compatible splits we thus get all tree topologies. Considering all possible weight vectors for these compatible split sets we are able to construct all metricn- trees. The tree spaceT_n is the collection of all these trees. Next, we define a metric on the set of T_n to make it a metric space.

This is achieved in two steps: First, we define a distance between two trees that are contained in a common maximalorthant, a specific region of the tree space that we define in Defintion 2.11. After we know how to measure distance in single orthants we extend the distance to the whole space by investigating how one traverses from one orthant into another.

We start by introducing orthants. The notion of orthants has already been mentioned in [BHV01]; here we define it in a slightly different manner, as regions of the tree space instead of Euclidean orthants.

Definition 2.11. Let S ⊆ S be a set of compatible splits in T_n. Then O(S) :={T ∈ T_n : Split(T)⊆S}

is called the orthant of S. Moreover, for T ∈ Tn let O(T) = O(Split(T)).

In case that S is a set of pairwise compatible splits with maximum cardinality of n−2, we call O(S) a maximal orthant.

By definitionO(S₁)⊆ O(S₂) holds forS₁ ⊆S₂, so any set of splitsS with|S|< n−2 is contained in some maximal orthant S⁰, i.e., S ⊂S⁰, |S⁰|=n−2. Also note, that a tree T whose set of splits Split(T) is not maximal belongs to several maximal orthants, which is illustrated in Figure 2.7. The most extreme example for this is the star tree 0∈ T_n. Since Split(0) =∅, the star tree is contained in every orthant of T_n, since Split(0) =∅ ⊂S for all S ⊂ S.

Before defining the distance in single orthants we refer to a very important result.

The number of orthants ofT_n is exponential n:

Theorem 2.12([BHV01]). There exist(2n−3)!! = (2n−3)·(2n−5)·. . .·3maximal orthants inT_n.

With the definition of orthants we now define the distance between trees that are contained in a common orthant. To this end, let O = O(S) be an orthant that containsT andX, i.e., Split(T)⊂S and Split(X)⊂S. [BHV01] define the distance for two such trees to be the Euclidean norm of their weight vectors. Now using the embedding intoR^N+, i.e., using t, x∈R^N+ instead ofT, X respectively, their distance is given by

d(X, T) := kt−xk₂.

16

(23)

0

1 2 5 3 4

0

1

2 5

3 4

0

2

1 5

3 4

0

5

1 2

3 4

Figure 2.7: A degenerate tree topology with splits S ={s₁, s₂} (top) and the three tree topologies with maximal split sets S₁, S₂, S₃ that contain S.

Note, that all components oftandxthat do not correspond to splits in Split(T)⊂ O or Split(X)⊂ O are 0 by definition of the embedding χ, Definition 2.9.

Before we continue to define the distance inT_n for two trees who are not contained in a common orthant, we want to emphasize the local Euclidean structure, that is implied by the definition of distance for two trees in an orthant.

Since d(T, X) = kt−xk₂ for two trees T, X that are in the same orthant, a single maximal orthant withn−2 splits is isometric to the non-negative Euclidean orthant Rⁿ⁻²+ :

Definition 2.13. Let a maximal orthant O =O(S)⊂ T_n with S ={s_i₁, . . . , sin−2} be given. Then the embedding ψO of orthant O into Rⁿ⁻²+ is given by the map onto the length vectors, i.e., for a tree T = ((s_i₁, . . . , s_i_n−2),(w₁, . . . , wn−2))∈ O, define

ψ_O(T) =





 w₁

... wn−2







and for non-binary treesT = ((s_j₁, . . . , s_j_l),(w₁, . . . , w_l))∈ O(S), i.e., whenl < n−2 and Split(T) = {s_j₁, . . . , s_j_l}(S, define ψ_O(T) coordinate-wise,

ψO(T)

m =

(w_l if s_j_l =s_i_m

0 else for m= 1, . . . n−2.

(24)

Recall that we have also defined an embedding χ of T_n into R^N+ in Definition 2.9.

The orthant embedding ψ(O) is simply the projection of χ(T) to the components of R^N+ that correspond to splits in O.

Now, that we have defined the distance between trees that are contained in a common orthant and understood the local Euclidean structure of the orthants, the second step is toconnect the orthants and measure distance for trees that are not contained in a common orthant via paths through several connected orthants.

Assume now thatT andX are not both contained in some orthantO. This implies, that there exists some split s₁ in Split(X) that is incompatible with a split s₂ in Split(T), as we could otherwise take the compatible split set ¯S := Split(T)∪Split(X) and T, X ∈ O( ¯S) would show thatT and X are contained in a common orthant.

Here, it does not make sense to define the distance by the Euclidean norm of the embedded vectors kt−xk₂: For each point y = (1−λ)x+λt, λ ∈ (0,1) on the Euclidean shortest path, i.e., the line segment connecting t and x in R^N+, we would have thaty_i >0 whenever one ofx_i and t_i is greater than 0 for somei∈ {1, . . . , N}.

Hence the coordinates corresponding to the incompatible splits s₁ and s₂ are both positive for y and Theorem 2.10 implies that y ∈ R^N+ is not a representative of a tree, i.e., y 6∈ χ(T_n). This is easier to understand when looking at the example for the embedding of T₃ into R³+ in Example 2.2.1.

Example 2.2.1. Consider T₃. A maximal split set in T₃ has n−2 = 3−2 = 1 splits and there exist N = 2³−3−2 = 3 splits in T₃, compare p. 13. We take two trees with a different topology:

T₁ = (({2,3}|{0,1}),(2)) T₂ = (({1,3}|{0,2}),(1))

Figure 2.8 depicts the embedding ofT₃intoR³, which is exactly the non-negative axes that are drawn as solid lines. The Euclidean shortest path between T₁ and T₂ is the dotted line. Clearly, the line is not contained inχ(T₃), so it yields a distance between the trees, but not a corresponding path through tree space. In general, the Euclidean distance of the embedded vectors is the extrinsicdistance when used for the subset of trees, as it is the distance that is derived from the ambient space R^N+, i.e., the space into which T_n is embedded. This distance does not have any desirable properties for the space of trees, which is why the distance needs to be defined differently.

What we actually want is a shortest path P from X to T in the tree space, i.e., P ⊆ T_n that traverses several orthants to connect X and T. To find such a path we need to be able to transform one tree topology into a different tree topology. This is done as follows: By definition, two trees T₁, T₂ with split sets Split(T₁),Split(T₂) are in the orthantO(S), when Split(T₁)∪Split(T₂) = S is a compatible set of splits.

That means that we can search for a path from X toT by removing some splits of X to get some tree T⁰, such that Split(T⁰) ⊂Split(X) and then add some splits of

18

(25)

({2,3}|{0,1}) ({1,3}|{0,2})

({1,2}|{0,3})

Figure 2.8: Embedding ofT₃ intoR³ and a Euclidean shortest path inR³connecting two trees of T₃.

T to T⁰ and repeat this procedure until we are only left with splits of X, hence the topology ofX. Thereby, it is important to remove and add splits in such a way that two successive trees on the way from X to T are contained in a common orthant and we can thus use the Euclidean distance for these pairs of trees. In this way, we haveconnected X and T by a path through tree space.

Now that we have sketched the idea, we formalize the length of paths between two trees, as it has already been described in [BHV01].

Definition 2.14. Let (T₁, T₂, . . . , T_k)be a sequence of trees in T_n such that for each i= 1, . . . , k−1 there exists an orthant O_i with T_i, T_i+1 ∈ O_i. Then define the path P = (T₁, . . . , T_k)⊂ T_n to be the path that contains all trees T_i and for each T_i, T_i+1, i= 1, . . . , k−1 it contains all trees χ⁻¹(λt_i + (1−λ)t_i+1) for λ∈(0,1).

P = (T₁, T₂, . . . , T_k) is called a path from T₁ to T_k and its length is defined as L(T₁, T₂, . . . , T_k) :=

k−1

X

i=1

kt_i+1−t_ik.

It is crucial to understand that for eachi= 1, . . . , k−1,T_iand T_i+1 are contained in a common orthantO_i and the distance there is defined as the Euclidean distance of their weight vectors. Note that then y_λ =λt_i+ (1−λ)t_i+1 ∈R^N+ always represents a tree, i.e. y_λ ∈ χ(T_n). This holds, as only splits that are in the common orthant O_i may have non-zero components in the embedding. More precisely,y_λ ∈χ(O_i)⊂ χ(T_n). As the measure in an orthant is the Euclidean distance, it follows that the

(26)

shortest path between their weight vectors is just the straight line which is mapped back toT_n byχ⁻¹.

We illustrate the concept of orthants and paths in T₄. First, we state all maximal orthants and then we show an example of a path and a geodesic.

Example 2.2.2. InT₄ it can be easily checked that no more than two splits can be pairwise compatible. The maximal orthants in T₄ hence all contain two compatible splits. There exist 15 maximal orthants, namely

O₁ = {({1,2}|{0,3,4}),({3,4}|{0,1,2})}={s₁, s₆} O₂ = {({1,3}|{0,2,4}),({2,4}|{0,1,3})}={s₂, s₅} O3 = {({1,4}|{0,2,3}),({2,3}|{0,1,4})}={s3, s4} O₄ = {({1,2}|{0,3,4}),({1,2,3}|{0,4})}={s₁, s₇} O₅ = {({1,2}|{0,3,4}),({1,2,4}|{0,3})}={s₁, s₈} O₆ = {({1,3}|{0,2,4}),({1,2,3}|{0,4})}={s₂, s₇} O₇ = {({1,3}|{0,2,4}),({1,3,4}|{0,2})}={s₂, s₉} O₈ = {({1,4}|{0,2,3}),({1,2,4}|{0,3})}={s₃, s₈} O₉ = {({1,4}|{0,2,3}),({1,3,4}|{0,2})}={s₃, s₉} O10 = {({2,3}|{0,1,4}),({1,2,3}|{0,4})}={s4, s7} O₁₁ = {({2,3}|{0,1,4}),({2,3,4}|{0,1})}={s₄, s₁₀} O₁₂ = {({2,4}|{0,1,3}),({1,2,4}|{0,3})}={s₅, s₈} O₁₃ = {({2,4}|{0,1,3}),({2,3,4}|{0,1})}={s₅, s₁₀} O₁₄ = {({3,4}|{0,1,2}),({1,3,4}|{0,2})}={s₆, s₉} O₁₅ = {({3,4}|{0,1,2}),({2,3,4}|{0,1})}={s₆, s₁₀}

The structure of T4 can be represented by a graph G = (S,O) whose node set S contains the ten splits {s₁, . . . , s₁₀} of T₄ and whose edges connect two splits (s_i, s_j) if and only ifs_i ands_j are compatible. Consequently, the edges belong to the maximal sets of compatible splits inT4 and which are the orthants ofT4, see Figure 2.9. Note, that this compatibility graph actually is the Petersen graph.

Now consider the trees X = ({s₂, s₉},(2,2)) and T = ({s₁, s₆},(1,4)). Then Fig- ure 2.10 depicts two different paths from X to T:

P₁ = (X, T₂, T₃, T) P₂ = (X, T₁, T)

Then, calculating the Euclidean distance between all pairs of trees on the path, we receive

L(P₁) =kt−t₂k₂+kt₂−t₃k₂ +kt₃ −xk₂

=p

2²+ (2−1)²+√

2²+ 1²+p

(4−2)²+ 1² ≈6.71, L(P2) =kt−t1k2+kt1−xk2 =ktk2+kxk2 =√

8 +√

17≈6.83.

20

(27)

Figure 2.9: Compatibility Graph forT₄. The three colored orthants are used for the example for the paths below.

s₁

s₆

s₉ s2

O₁ O₁₄

O₇

X

T T₁

T₃ T₂

Figure 2.10: Example for two paths fromX to T in a diagram of a part of T₄. The shortest path connecting X and T is the solid path.

In Definition 2.15, the distance between two trees X and T is now simply defined as the length of the geodesic, i.e., a path in tree space from X to T with minimal length. Before that, we provide a little bit of background knowledge about the term geodesic in mathematics and how it relates to tree space.

The notion of geodesics originates from differential geometry and is used to gener- alize the notion of straight lines to curved spaces and is, e.g., used for Riemannian manifolds. Geodesics have later been generalized to metric spaces, thereby introducing the notion of a geodesic metric space. As a matter of fact Theorem 2.16 will show thatT_n is a geodesic metric space, which is why this shortest path is called geodesic.

Moreover, as geodesics were introduced as a generalization of straight lines it is in-

(28)

teresting to note that the geodesics here are basically concatenations of straight lines in single orthants and that the tree space actually has global non-positive curvature (see [BHV01]).

Definition 2.15. [BHV01] Let T, X ∈ T_n. The geodesic distancebetweenX and T is given by

d(X, T) = inf{L(T₁, . . . , T_k) : (T₁, . . . , T_k) is a path from X to T in T_n}. A path P = (T1, . . . , Tk) from X to T, X = T1, T = Tk that attains this minimal distance, i.e., L(P) =d(T, X) is called geodesic.

Plugging in the length of a path from Definition 2.14 we get d(X, T) = inf

( _k X

i=1

kt_i+1−t_ik₂ : (T₁, . . . , T_k) is a path from X toT inT_n )

. Naturally, every path from X toT yields an upper bound on the geodesic distance.

A special path that connects any two trees in T_n is the path through the star tree 0∈ T_n and it is calledcone path. Formally, the cone path for X, T ∈ T_n is (X,0, T) and its length isktk2+kxk2.

Due to the triangle inequality of k · k2 that we may apply in every orthant O, we can replace a sequence

(T₁, . . . T_i, T_i+1, . . . , T_j, . . . , T_l)

of trees whereT_i, T_i+1, . . . , T_j are contained in the same orthant O by (T₁, . . . , T_i, T_j, . . . , T_l)

and will thereby not increase the length of the path.

By the definition of geodesics it is not obvious that the infimum is always attained.

The existence of a unique geodesic that attains this infimum connecting any pair of trees in T_n has been shown in [BHV01] by applying a result of [Gro87] concerning metric spaces of global non-positive curvature, also calledCAT(0) spaces. As we do not need curvature in the following, but only need it to get existence and uniqueness of geodesics we refer the interested reader to the appendix, Chapter 7.

Theorem 2.16 ([BHV01]). T_n is a CAT(0) space. In particular, the geodesic distance d is a metric on T_n.

For CAT(0) spaces, existence and uniqueness of geodesics are guaranteed:

22

(29)

Theorem 2.17 ([BHV01]). Let T, X ∈ T_n. There exists a unique geodesic Γ(X, T) from X to T.

As a final note concerning curvature we remark that a Hadamard space is defined as a complete CAT(0) space. SinceT_nis complete, this makes it a Hadamard space, as mentioned in Chapter 7, or compare [Bac14b] for more details on the subject.

The fact thatT_n is a Hadamard space will be of use later.

2.2.1 Support and Parametrization of a Geodesic

Now that the existence and uniqueness has been established, we want to actually find the geodesic between X and T, which we denote with Γ(X, T).

The key to finding and parametrizing the geodesic from a tree X to a tree T is the concept of thesupport sequence or thesupport of a geodesic, that describes in which order edges of the treeX are removed and edges of T are added. With this concept it is possible to apply combinatorial methods to find the geodesic.

[BHV01] prove that splits that are contained in both X and T are contained in every tree along the geodesic. Hence, we do not need to consider these splits when searching for the topologies of trees along the geodesic. The same result holds for splits that are contained in one tree and are compatible with all splits of the other tree. These two types of splits are merged in the following definition.

Definition 2.18. Given two trees T, X ∈ T_n. A split s is called double compatible if s is compatible with all splits in Split(X) as well as with all splits in Split(T).

Let C =C(T, X) denote the set of double compatible splits for two trees.

Next we define a partition of the split sets that describes in which order we remove splits fromX and add splits from T on the path from X to T.

Definition 2.19. [OP11] With the notation introduced above, a support sequence from X ∈ T_n to T ∈ T_n is a pair (A,B) = ((A₁, . . . , A_k),(B₁, . . . , B_k)) with A_i ⊂ Split(X), B_i ⊂Split(T) such that

A1∪. . .∪Ak= Split(X)\C B₁∪. . .∪B_k= Split(T)\C is a partition, i.e., a disjoint union.

In order to define thesupport of a geodesic we introduce the following notation. Let T = ((s₁, . . . , s_l),(w₁, . . . , w_l))∈ T_n. Then for a subset A⊆Split(T) define

kAk₂ = s

X

si∈A

w_s_i.

(30)

Moreover, we also write w_s^T, w_s^X to indicate the weight of splits s in tree T or tree X respectively for a split s∈Split(T)∩Split(X).

Definition 2.20. [OP11] LetT, X ∈ T_nand let A= (A₁, . . . , A_k),B= (B₁, . . . , B_k) be a support sequence fromX to T. Then(A,B) is called a support of the geodesic from X to T if it satisfies the following three properties:

(P1) For each i > j, A_i and B_j are compatible sets of splits.

(P2)

kA1k

kB₁k ≤. . .≤ kAkk kB_kk

(P3) For i = 1, . . . , k there is no non-trivial partition (C₁, C₂) of A_i and (D₁, D₂) of B_i such that C₂ is compatible with D₁ and _kD^kC¹^k

1k < _kD^kC²^k

2k.

In the following we describe what the three properties (P1),(P2) and (P3) actually mean.

From the definition of the geodesic distance it can be seen that the geodesic consists of “straight lines”, also called legs, one for each orthant that it traverses. This follows from its property of being the shortest path connecting two trees together with the definition of distance in the orthants, being the Euclidean distance.

Now, knowing that the geodesic is, roughly speaking, a piecewise Euclidean path that may only bend when traversing a boundary of an orthant, we can use this knowledge to interpret (P1),(P2) and (P3), where we stick to the notation of Definition 2.20.

A support sequence as in Definition 2.19 generally defines a sequence of orthants that a path traverses through. In case of the geodesic fromX toT, the first orthant on the geodesic is the orthant of X, O(X), so the trees on the straight line, the leg, in this orthant have the splits Split(X). To get to the orthant of T we have to remove the splits of X \C and add splits of T; that is the general idea of the partitions of Split(X)\CintoA₁, . . . , A_k and Split(T)\C intoB₁, . . . , B_k. We want to successively delete splits ofX, starting withA₁, thenA₂ and so on and add splits of T starting at B₁, then B₂ and so forth.

(P1) ensures that this procedure of removing and adding splits actually yields an orthant ofT_n, i.e., that B₁∪. . .∪B_i∪A_i+1∪. . .∪A_k is a compatible set of splits.

Now, while (P1) cares about “feasibility”, (P2) and (P3) together ensure that the split weights along the path through these orthants are chosen optimally.

Theorem 2.21. [OP11] Let T, X ∈ T_n, (A,B) be the support for the geodesic from X to T and C be the set of double compatible splits. Define the path P(A,B) = {γ(λ) : λ ∈ [0,1]} via a parametrization γ : [0,1] → T_n that consists of several

24

Facility Location in the Phylogenetic Tree Space