• Keine Ergebnisse gefunden

Facility Location in the Phylogenetic Tree Space

N/A
N/A
Protected

Academic year: 2022

Aktie "Facility Location in the Phylogenetic Tree Space"

Copied!
235
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Facility Location in the Phylogenetic Tree Space

Dissertation

zur Erlangung des mathematisch-naturwissenschaftlichen Doktorgrades

”Doctor rerum naturalium“

der Georg-August-Universit¨at G¨ottingen

im Promotionsprogramm

”PhD School of Mathematical Sciences“ (SMS) der Georg-August University School of Science (GAUSS)

vorgelegt von

Marco Botte

aus Fritzlar G¨ottingen, 2019

(2)

Betreuungsausschuss

Prof. Dr. Anita Sch¨obel, seit 1.1.19 Fachbereich Mathematik, Technische Univer- sit¨at Kaiserslautern, vorher Institut f¨ur Numerische und Angewandte Mathematik, Georg-August-Universit¨at G¨ottingen

Prof. Dr. Stephan Huckemann, Institut f¨ur Mathematische Stochastik, Georg- August-Universit¨at G¨ottingen

Mitglieder der Pr¨ ufungskommission

Referentin: Prof. Dr. Anita Sch¨obel, seit 1.1.19 Fachbereich Mathematik, Tech- nische Universit¨at Kaiserslautern, vorher Institut f¨ur Numerische und Angewandte Mathematik, Georg-August-Universit¨at G¨ottingen

Korreferent: Prof. Dr. Stephan Huckemann, Institut f¨ur Mathematische Stochastik, Georg-August-Universit¨at G¨ottingen

Weitere Mitglieder der Pr¨ufungskommission:

Jun.-Prof. Dr. Anja Fischer, Juniorprofessur Management Science, Technische Uni- versit¨at Dortmund

Prof. Dr. Preda Mih˘ailescu, Mathematisches Institut, Georg-August-Universit¨at G¨ottingen

Prof. Dr. Gerlind Plonka-Hoch, Institut f¨ur Numerische und Angewandte Mathe- matik, Georg-August-Universit¨at G¨ottingen

Prof. Dr. Max Wardetzky, Institut f¨ur Numerische und Angewandte Mathematik, Georg-August-Universit¨at G¨ottingen

Tag der m¨undlichen Pr¨ufung: 28.02.2019

(3)

Acknowledgements

First of all I want to thank my supervisor Prof. Anita Sch¨obel for very fruitful dis- cussions, the constant support throughout the years and for giving me the possibility to work on many different interesting fields of optimization. I also want to thank Prof. Stephan Huckemann for the interesting joint collaboration with international researchers and for introducing interesting points of view on aspects of phylogenetic trees.

Moreover, I thank all members of the work group for the pleasant atmosphere we have in the institute. My special thanks go to Julius for the congenial time in the office and also for proofreading parts of this work.

Last but not least, I wholeheartedly thank my family and Maria for their encour- agement and support, especially for always believing in me.

(4)
(5)

Contents

1 Introduction 1

2 Construction and Properties of the Tree Space 7

2.1 Splits, Compatibility and Orthants . . . 7

2.2 The Geodesic Distance . . . 16

2.2.1 Support and Parametrization of a Geodesic . . . 23

2.3 Modeling the Species Tree Problem in Tree Space . . . 37

3 Location Theory in Tree Space 41 3.1 Introduction to Location Theory . . . 41

3.2 Location Problem Models in Tree Space . . . 43

3.2.1 Split Sets of Optimal Solutions . . . 44

4 Solving Tree Space Location Problems by Transformations to Euclidean Location Problems 49 4.1 Location Problems in T3 . . . 49

4.2 Adjacent Orthants . . . 51

4.3 Completely Incompatible Orthants . . . 54

4.4 Location Problems with a Fixed Gate Point at the Origin . . . 63

4.4.1 Median Problem . . . 64

4.4.2 Center Problem . . . 66

4.4.3 Fr´echet Problem . . . 93

5 Balance Point Algorithms for the Median Problem in Tree Space 99 5.1 Bounds for the Median Problem . . . 100

5.2 The Global Balance Point Heuristic . . . 124

5.2.1 The Balance Point Algorithm . . . 126

5.2.2 Preprocessing . . . 128

5.2.3 Pseudocode Formulation of the Heuristic . . . 129

5.2.4 Experiments . . . 131

5.3 Local Convergence of the BPA . . . 142

5.3.1 The Block-wise Coordinate Descent Method . . . 142

5.3.2 Convergence to a Stationary Point . . . 158

5.3.3 Optimality of Stationary Points and Balance Points . . . 178

(6)

5.4 Restrictions of the BPA . . . 187

6 Real Data Example 193 6.1 Data Set . . . 193

6.2 Experiments . . . 195

6.2.1 PPA Experiments . . . 195

6.2.2 BPA Experiments . . . 201

6.3 Summary of Results and Comparison to other Algorithms . . . 207

7 Conclusion 211

vi

(7)

1 Introduction

Phylogenetic trees are diagrams with a tree structure that depict the relations of evolutionary history between a certain set of existing species to be investigated.

Ever since Ernst Haeckel coined the termphylogenyin 1869 as ‘genesis and evolution of a phylum’, where ‘genesis’ translates as ‘origin’ and ‘phylum’ as ‘race’, phyloge- netic trees are a part of biologists’ attempts to classify existing species according to their evolutionary history, see Figure 1.1 for an example. This classification is based on shared characteristics and genes.

Figure 1.1: Tree of life according to Ernst Haeckel, cf.[Hae].

In contrast to Figure 1.1, a phylogenetic tree nowadays always comes with atopology,

(8)

which is the tree structure as depicted, but also with positive edge weights, describing the amount of mutations of genes between two nodes of the tree. Phylogenetics is an active field of research because the evolutionary history of a given set of species is often still not agreed upon by biologists. Hence, the so called species tree, which is the phylogenetic tree depicting the actual evolution of a given set of species, is usually unknown and is subject to study.

The Species Tree Problem

As there is no straightforward way to obtain the species tree, many methods have been proposed that hypothesize how the species tree could look like. All of these approaches use sets of phylogenetic trees as input in order to give a hypothesis for the species tree, thereby trying to incorporate shared features of the given trees into the species tree.

The phylogenetic trees that are used as data for such methods are calledgene trees and are obtained as follows: Specific genes that the investigated species share are tagged with a biological marker. To this end one uses gene sequences that are em- pirically known to be good guesses of the species tree. Then the gene sequences are aligned to be as ‘parsimonious’ as possible. There exist many different approaches to infer trees from these aligned gene sequences, e.g., bayesian methods or bootstrap procedures that were introduced to the field of phylogenetics by [Fel85] marking a milestone in phylogenetic inference. The result of such a method is a phylogenetic tree depicting the relation between the investigated species based on the information of a single gene. These trees are called gene trees.

Applying this method, one often obtains different trees for different genes which shows that it is not possible to directly infer thespecies tree from a single gene tree.

[Mad97] show a specific case where the topology, i.e., the tree structure, of the gene tree does not coincide with the topology of the species tree and [DR06] even show that this need not be the case for the “most likely gene tree” as well. Nevertheless, since it is impossible to directly determine the species tree, one tries to infer as much information from these gene trees as possible in order to develop reasonable hypotheses for the species tree.

Having performed this acquisition of gene trees for several genes, one receives a sam- ple set of phylogenetic gene trees. Inferring the species tree from such a set of gene trees is what is often referred to as thespecies tree problem, or ‘gene tree/species tree problem’, as in [PC97]. One of the earliest works on this field is [AI72], where Adams develops the notion of aconsensus tree that is supposed to convey the information of the gene trees as good as possible in order to get a reasonable candidate for the species tree. After consensus trees have been introduced, many different methods to find the species tree have been proposed, but they are mostly educated guesses or

2

(9)

heuristic rules. Popular methods include the majority rule consensus tree [MM81]

and strict consensus methods [MMN83] that are probably widely used due to their simple computation rules. For an overview of consensus trees and related methods see [Bry03]. Naturally, these heuristic methods are limited in describing the actual species tree, which is also discussed in the literature, see for example [BDS91] and [Nel93] as a reply. Thus, there was an urgent need for more sophisticated techniques which have been developed since then with the hope to better infer the species tree from a given set of gene trees.

A lot of research in the younger history involves mathematical, mostly statistical methods to find the species tree. A large part of this mathematical research is based on [BHV01], since it provides the framework for a new systematic approach to phy- logenetic tree related problems. Billera, Holmes and Vogtmann were able to define the metric space Tn, which contains all possible phylogenetic trees for a fixed set of existing species {1, . . . , n}. {1, . . . , n} are the leaves of the trees in Tn. In the following we will refer to the space Tn as tree space. Tn is a metric space of ‘global non-positive curvature’. This directly implies two important features for mathemat- ical modeling: Firstly, it is a metric space containing all possible phylogenetic trees onn leaves, i.e., it is a suitable model space for the species tree problem, since the given data and the solution to the problem are contained in this space. Secondly, there does not only exist a distance measure between any given pairT1, T2 of trees in Tn, which is required by many algorithms, but there even exists a unique path from X to T whose length equals the distance d(T1, T2), where d is the metric on Tn. These are necessary properties to formulate and tackle the species tree problem inTn.

In this model setting, natural candidates for the species tree are some sort of ‘av- erage’ or ‘centroid’ trees. The question of finding a centroid from a set of sample trees has already been posed in [BHV01] and several possible notions of a centroid of a sample set in a metric space with global non-positive curvature are mentioned.

Unfortunately, all given notions of center points actually require computing the geodesic between two trees, or at least the distance of two trees. So even though [BHV01] show that the above-mentioned shortest path between two trees, called geodesic exists and is unique, they were not able to propose a method how to effi- ciently calculate the distance between two points in Tn. This shortcoming strongly encouraged follow-up research in this direction. The computation of the distance and the geodesics in Tn has been extensively studied in [Owe08, Owe11] leading to the milestone of a polynomial time algorithm calculating the distance and the parametrization of the geodesic for two given trees, see [OP11].

Now, having the metric spaceTnand the possibility to efficiently calculate geodesics at hand one is able to use several of the concepts of centroids that [BHV01] intro- duced to get new hypotheses on the species tree. For example, [MOP15] translate

(10)

the algorithm of Sturm that was already mentioned by [BHV01] into the tree space setting and applied it to several data sets. As it turns out, though, the computa- tion of the center points remains an open problem in practice. Sturm’s algorithm theoretically converges to the mean but for high-dimensional and large data sets the method was not able to converge to a pre-specified termination condition since its calculation time was exceeded. So there is still need for improved procedures as the existing methods are of converge slowly (Sturm’s algorithm is sublinear, see [MOP15]) and not tractable for larger instance sizes or even yield the wrong species tree, [MOP15].

As an attempt to make the computation of the mean more efficient, [MOP15] further investigated the structure of Tn and developed special algorithms, e.g., gradient descent methods that are based on interior point or penalty methods. After all there are still practical and theoretical problems of these approaches, since analytical properties, such as optimality criterions and differentiability, do not exist on specific subsets ofTn. An up-to-date review on Fr´echet means in tree space is presented in [BO17], both pointing out the strengths and weaknesses of the concept and giving concrete numeric studies.

Another mathematical line of research that recently evolved is to find a different model space for the phylogenetic problems. To this end, the so-called space of ultra- metric trees, equipped with the tropical metric is investigated in [YZZ17, LMY18].

This research needs to be carried out in depth in order to evaluate which advantages and disadvantages this model space has.

Contribution and Structure of the Thesis

As we have seen, there exists a lot of research in the field of phylogenetics includ- ing numerous approaches on finding the species tree. Latest mathematical studies mainly concentrate on finding new model spaces or different and faster ways to compute the Fr´echet mean to do statistics on the model spaces, but neither can the outcome of these approaches be anticipated nor if the result of these approaches will solve the species tree problem, at least to biologists’ satisfaction.

In this thesis we introduce a different point of view on the species tree problem by incorporating “Location Theory”: We interpret the set of gene trees asfacilities, as usual for location problems. Facility Location problems are optimization problems that are motivated from real-world and economic problems and are usually investi- gated in Euclidean space. So on one hand this work extends the field of research of Location Theory beyond its usual scope. On the other hand, we aim at exploiting the specific local structure of Tn to receive some location problems which may be solved within a Euclidean setting. By doing so we then can make use of known algo- rithms and results from Location Theory in Euclidean space in order to try solving

4

(11)

the original location problems in tree space, thereby connecting these two fields of research.

The thesis is structured as follows (also confer to Figure 1.2). In Chapter 2 we introduce the phylogenetic tree space Tn as defined in [BHV01]. In Section 2.2 we give a detailed description of the geodesic distance and the parametrization of geodesics in tree space, which are the key tools to work with location problems in Tn.

In Section 2.3 motivate our location theoretical point of view on the species tree problem in more detail before we build the bridge from the species tree problem to Location Theory in Chapter 3. We introduce three location problems in tree space that yield different notions of ‘centroids’ of a given set of trees as hypotheses for the species tree. We also formulate general results regarding optimal solutions for these three problems in Subsection 3.2.1.

Chapter 4 illustrates first approaches to solve these tree space location problems by providing reformulations and solution methods for special cases.

The central problem of the thesis is tackled in Chapter 5. The goal is to find a median of a given set of points in tree space and we present an algorithm to determine a median that is based on a local Euclidean improvement strategy. This algorithm is called theBalance Point Algorithm, which is a heuristic, but for which we prove convergence under certain assumptions in Section 5.3. As the Balance Point Algorithm is a local improvement procedure, it is necessary to find a good neighborhood in which the algorithm is started in order to obtain good results.

In order to find a good neighborhood, we develop bounds for specific subsets of Tn in Section 5.1 to determine auspicious subsets of Tn, where the Balance Point Algorithm may be applied. In Section 5.2 we formulate a heuristic that determines such auspicious subsets in a preprocessing procedure and then applies the Balance Point Algorithm for the remaining subsets.

Finally, we illustrate how the Balance Point Algorithm works on a real data set in Chapter 6 and discuss its results in comparison to other methods before the thesis is summarized in Chapter 7.

(12)

Chapter 2:

Phylogenetic tree spaceTn

Chapter 1:

Species Tree Problem

Chapter 3:

Location Prob- lems inTn

Chapter 5:

BPA for median problems inTn

Chapter 4:

Solutions via Transformations

Chapter 6:

Real data example

Chapter 7:

Conclusion

Figure 1.2: Structure of the thesis.

6

(13)

2 Construction and Properties of the Tree Space

As mentioned in the introduction, we will refer to the space of phylogenetic trees Tn as tree space. The construction of this space is given in a meticulous way to ensure that all subtleties of the tree space and important tools required for proofs are introduced carefully. Note, that since this is an introduction of the tree space almost all definitions and results are known from literature and are always indicated with the proper citation.

2.1 Splits, Compatibility and Orthants

We start by giving a concise mathematical definition for the elements ofTn, the phy- logenetic trees. In the following, letn species {1,2, . . . , n}be given whose evolution is to be investigated. These species are leaves of the phylogenetic tree describing their evolution, as they are present today and have no descendants. There is one additional leaf in a phylogenetic tree which is the root node and is indexed by 0.

The root node models a known common ancestor of these species.

Due to the problem’s nature we are only interested in trees which do not contain any node with degree two. If there exists a node with degree two, it can be removed and its two incident edges may be merged into one, as this node does not convey any information on a bifurcation or multifurcation of species. The edges of a tree can be distinguished into edges which are incident with a leaf and edges which are not. The latter are called interior edges. As already mentioned, phylogenetic trees also carry information on how much mutation took place between two nodes, i.e., species in the tree, which is modeled by positive edge weights for all interior edges.

Now, with these properties of a phylogenetic tree in mind, we can formally define our desired notion of trees.

Definition 2.1. [BHV01] A weighted graph T = (V, E, w) is called metric n−tree if

• T = (V, E) is a tree,

• {0,1, . . . , n} ⊆V are the only leaves of V and 0∈V is the root node,

(14)

• does not have any node of degree two,

• satisfies we >0 for all interior edges e of E ,i.e., w∈R|E|>0.

In the following the elements of the tree space, the metric n− trees are usually referred to as trees in the following. If we want to reference a tree without edge weights, we emphasize this by speaking of thetopology of the tree or explicitly write T = (V, E), omitting w.

The next result follows immediately from the definition of interior edges and the property of metric n−trees not having nodes of degree two.

Lemma 2.2. [BHV01] The maximal number of interior edges for metricn−trees is n−2.

This maximal number of edges is attained when, every non-leaf node has degree three. Then, regarding the tree from the root, 0, being the top, to the leaves, {1, . . . , n}, at the bottom, each edge from the top splits up to two different edges;

this motivates the following name.

Definition 2.3. [BHV01] A tree with n−2 interior edges is called binary tree.

Example 2.1.1. The tree topologies for n = 4:

For n = 4 species, all possible topologies of such trees are depicted in figures 2.1 - 2.4. Figure 2.1 contains the star tree as well as four different topologies which all describe that three species have one common ancestor, and one species has developed separately.

0

1 2 3 4

The star tree

0

1

2 3 4

0

2

1 3 4

0

3

1 2 4

0

4

1 2 3

Figure 2.1: Degenerate tree topologies for trees with 4 species.

8

(15)

Figure 2.2 shows the cases of a degenerate tree topology where one pair of species shares a common ancestor that developed after the split of the other two species.

0

1 2

3 4

0

3 4

1 2

0

1 3

2 4

0

2 4

1 3

0

1 4

2 3

0

2 3

1 4

Figure 2.2: Degenerate tree topologies for trees with 4 species.

Figure 2.3 shows the three binary tree topologies in which the species have developed in pairs:

0

1 2 3 4

0

1 3 2 4

0

1 4 2 3

Figure 2.3: 3 of the 15 = (2·4−3)!! = 5·3·1 binary tree topologies for trees with 4 species.

Figure 2.4 shows the remaining twelve binary tree topologies where one species has developed separately, while three of them have one common ancestor from which another species developed separately.

(16)

0

2 4

3 1

0

2 3

4 1

0

3 4

2 1

0

1 4

3 2

0

1 3

4 2

0

3 4

1 2

0

2 4

1 3

0

1 2

4 3

0

1 4

2 3

0

1 2

3 4

0

1 3

2 4

0

2 3

1 4

Figure 2.4: 12 of the 15 = (2·4−3)!! = 5·3·1 binary tree topologies for trees with 4 species.

This enumeration of trees is in fact exhausting, i.e., all different tree topologies for n= 4 have been depicted up to different planar embeddings of the graphs. Handling phylogenetic trees via tree topologies (V, E) is very cumbersome, which is the reason for the introduction of splits of trees in the next section.

10

(17)

Splits

The figures above have show that there are many different topologies for trees. Our goal now is to describe tree topologies in an easy way, as sets ofsplits. Note that the number of interior nodes (non-leaf nodes) and the number of edges may differ from tree to tree. Thus, a representation using nodes or edges does not seem appropriate.

Instead, trees will be described using partitions of its set of leaves that are induced by the interior edges, which are edges of the tree that are not incident to any leaf.

Partitions are induced when removing edges, which is a well-known observation from graph theory:

A treeT = (V, E) with vertices V and edges E gets disconnected and divided into two trees if we remove any of its edges e ∈E. More precisely, when removing any edge e ∈ E the tree T is split into two connected components (V1(e), E1(e)) and (V2(e), E2(e)) with V1(e)∪V2(e) = V and E1(e)∪ {e} ∪E2(e) =E.

In particular this yields a partition of the leaves {0,1, . . . , n} ⊆ V of the tree into two disjoint sets A(e) = V1(e)∩ {0,1, . . . , n} and A(e)c = V2(e)∩ {0,1, . . . , n} = {0,1, . . . , n} \A(e). It is easily observed that removing an edge e incident to a leaf i∈ {0,1, . . . , n} of the tree results in the partition {i} and {0,1, . . . , n} \ {i}, and that these partitions can be obtained for any tree T ∈ Tn. We can hence neglect these trivial partitions when describing tree topologies. The other partitions are formalized as follows:

Definition 2.4. [Owe11] A split s = (A|Ac) is a partition of {0,1, . . . , n} into two setsA, Ac such that|A| ≥2and|Ac| ≥2. The set of all splits of the set{0,1, . . . , n}

is denoted as S.

Note that (A|Ac) and (Ac|A) are considered to be the same split. Due to the sym- metry of the unionA∪Ac=Ac∪Athey generate the same partition of{0,1, . . . , n}.

We use the convention that the set which does not contain 0 is always denoted as A, i.e., 0 ∈ Ac. Even though a split is already uniquely defined by choosing A, it will help to also write down Ac when working with splits later on.

The most important thing about splits is that they represent interior edges of a tree without the need of interior nodes to define them. That makes it more convenient to represent trees and also to compare howsimilar trees are by just checking which common splits they possess.

Example 2.1.2. Figure 2.5 illustrates how splits arise when removing an interior edge. It is important to note that this justifies to use the words interior edge or split interchangeably, since there is a one-to-one correspondence between these notions.

This fact is stressed again in Lemma 2.7

As shown in the example, one receives splits by removing interior edges e of a tree.

Doing this for all interior edges of the tree one receives a set of splits that, as we

(18)

0

1 3

e1

2

e2

4

leaf set: {0,1,2,3,4,5}

0

1 3

2

e2

4

leaf sets: {1,3},{0,2,4,5}

split: s1 = ({1,3}|{0,2,4,5})

0

1 3

e1

2

e2

4

leaf set: {0,1,2,3,4,5}

0

1 3

e1

2 4

leaf sets: {1,2,3},{0,4,5}

split: s2 = ({1,2,3}|{0,4,5})

Figure 2.5: An interior edge of a metric n−tree induces a split on {0,1, . . . , n}.

will see later, already completely defines the topology (V, E) of a tree (V, E, w).

The notion of topology is frequently used in the same sense as ‘the combinatorial structure of the tree’ in the literature (see, e.g., [OP11]). Our definition here is a little more specific.

Definition 2.5. For a tree T its induced set of splits, the topology of T, is defined as

Split(T) :={(A(e)|A(e)c) :e is an interior edge of T} ⊆ S.

As mentioned earlier we use splits as they offer a convenient and consistent way to describe the topology of a tree by a set of splits. Nonetheless, not all sets of splits S⊂ S yield a tree topology since some pairs of partitions can not be present in the same tree:

Example 2.1.3. Assume n= 4 and consider two splits on {0,1,2,3,4}, s1 = ({1,2}|{0,3,4}) and s2 = ({1,3}|{0,2,4}).

12

(19)

There cannot exist a tree whose interior edges yield both of these splits as there cannot be two edges in the same tree stating that {1,2} are direct neighbors and that{1,3} are direct neighbors. More precisely, {1,2} induces a subtree, where 1 is contained in the subtree{1,2}which does not contain 3. {1,3}, however induces the subtree {1,3} which does not contain 2, which is a contradiction.

The concept ofcompatibility describes whether two splits may be contained in the same tree and even more, if a set of splits yields a tree topology:

Definition 2.6. [BHV01] Two splits (A|Ac) and (B|Bc) are compatible if A ⊆ B or A ⊆ Bc, B ⊆ A or Bc ⊆ A. A set of splits S ⊂ S is called compatible if every pair si, sj ∈ S is compatible.

The definition of splits and compatibilities allows for the following nice representa- tion of tree topologies, which implies the one-to-one correspondence between splits and interior edges that we mentioned in Example 2.1.2.

Lemma 2.7 ([BHV01],[Vog07]). There exists a metric n−treeT with topology S = Split(T) if and only if all pairs of splits {s1, s2} ⊂S are compatible.

As an example for this, the unweighted tree T = (V, E) or tree topology depicted in Figure 2.5 has two interior edges which result in the two splits ({1,3}|{0,2,4}) and ({1,2,3}|{0,4}). These, in turn, uniquely define the tree topology of T.

Lemma 2.7 in particular implies that two splits are compatible if and only if they can exist in the same tree.

Moreover, it justifies why we called the set of splits thetopology of a tree in Defini- tion 2.5 and why this fits the terminology chosen in Definition 2.1: The topology of a tree (in the sense of Definition 2.1) T = (V, E, w) is uniquely characterized by its set of splits Split(T).

An immediate consequence of the one-to-one correspondance and Lemma 2.2 is the following corollary.

Corollary 2.8. The maximal number of compatible splits is n−2.

Embedding into R

N+

Now that we have defined splits and compatibilities, we can make use of the repre- sentation of trees by split sets to embed the trees into a Euclidean space. To this end, we construct vectors of the length of all splits; easy combinatorics show that there are N :=|S| = 2n−n−2 splits on {0, . . . , n}. As the non-negative orthant of RN,RN+, is defined inconsistently throughout literature, we specify

RN+ ={x∈RN :xi ≥0∀i= 1, . . . , N}.

(20)

Definition 2.9. Let the splits in S be given in a fixed order according to their indices, say, S = {s1, . . . , sN}. Then we can describe every metric n−tree T by a vectort ∈RN+ by defining

ti :=

we >0 if si = (A(e)|(A(e)c)∈Split(T)

0 if si 6∈Split(T), (2.1)

for i= 1, . . . , N.

Let Tn denote the set of all metric n-trees. Then the mapping χ:Tn→RN+ χ(T) = t,

with t defined as above is called (canonical) embedding of Tn.

Note that the term ‘embedding’ implicitly requires injectiveness of the mapping.

That χ is injective is easy to see: When χ(T1) = t = χ(T2), then it follows that T1 and T2 have the same set of splits, the ones that correspond to the non-zero components oft. Moreover, they have the same weightstj for these splits, soT1 =T2. We illustrate the embedding ofTn into RN+ using metric 4−trees.

Example 2.1.4. For n = 4 the N = 2n−n−2 = 24 −6 = 10 possible splits on {0,1,2,3,4} are

s1 = ({1,2}|{0,3,4}) s2 = ({1,3}|{0,2,4}) s3 = ({1,4}|{0,2,3}) s4 = ({2,3}|{0,1,4}) s5 = ({2,4}|{0,1,3}) s6 = ({3,4}|{0,1,2}) s7 = ({1,2,3}|{0,4}) s8 = ({1,2,4}|{0,3}) s9 = ({1,3,4}|{0,2}) s10 = ({2,3,4}|{0,1})

0

1 2

3.2

3 4

1.8

0

1 2

2.8 3 2

4

0

3 4

2.5

1 2

0

1 2 3 4

Figure 2.6: Four trees ofT4.

14

(21)

The four metric 4−trees depicted in Figure 2.6 are hence given as

t1 =

 3.2

0 0 0 0 1.8

0 0 0 0

 , t2 =

 2.8

0 0 0 0 0 2 0 0 0

 , t3 =

 0 0 0 0 0 2.5

0 0 0 0

 , t4 =

 0 0 0 0 0 0 0 0 0 0

 .

The tree T4 corresponding to t4 = 0 ∈ R10+ is called the star tree and is depicted on the rightmost in Figure 2.6. Note that not every vector in R10+ has a pre-image.

Consider for example t5 = (2 3 0 0 0 0 0 0 0 0)t. This would translate to the split s1 having a weight of 2 and s2 having a weight of 3. But s1 and s2 are incompatible, so there exists no tree that can contain both of these splits. This implies that there is no pre-image fort5.

With this embedding, each metricn−treeT yields a vectort∈RN+, but, as explained in the example, not every x∈RN+ represents a metric n−tree, since the embedding is not surjective. We now describe whichx∈RN+ are representatives of trees metric n-trees, which follows directly from Lemma 2.7 and the definition of the embedding.

Recall, that it is crucial to first choose a fixed order of the splits in S and then to embed all trees intoRN+ with respect to to this order.

Theorem 2.10. Let x ∈RN+. Then x represents a tree if and only if all si, sj ∈ S with xi 6= 0 and xj 6= 0 are compatible splits.

After having introduced the concept of splits and defined the embedding we will now incorporate two possibilities to uniquely identify trees T = (V, E, w), without using the (V, E, w) notation. The first is to specify a tupleT = ((s1, . . . , sk),(w1, . . . , wk)) of a vector of splits (s1, . . . , sk) that yields a compatible set of splits{s1, . . . , sk} ⊂ S and the corresponding positive edge weightswi forsi,i= 1, . . . , k, that are induced by the translation of an interior edge to a split. The second possibility is to identify a treeT ∈ Tnvia the embedding, i.e., by its corresponding vectort ∈RN+, such that χ−1(t) =T.

Since we have these two possibilities we will always reference trees T ∈ Tn with a capital letter and they consist of a tuple of splits and lengths , whereas the respective lower-case lettert denotes the embedding ofT intoRN+.

(22)

2.2 The Geodesic Distance

In the previous section we have introduced convenient ways of referencing trees and have seen that compatible pairs of splits yield tree topologies. Having all possible sets S of compatible splits we thus get all tree topologies. Considering all possible weight vectors for these compatible split sets we are able to construct all metricn- trees. The tree spaceTn is the collection of all these trees. Next, we define a metric on the set of Tn to make it a metric space.

This is achieved in two steps: First, we define a distance between two trees that are contained in a common maximalorthant, a specific region of the tree space that we define in Defintion 2.11. After we know how to measure distance in single orthants we extend the distance to the whole space by investigating how one traverses from one orthant into another.

We start by introducing orthants. The notion of orthants has already been men- tioned in [BHV01]; here we define it in a slightly different manner, as regions of the tree space instead of Euclidean orthants.

Definition 2.11. Let S ⊆ S be a set of compatible splits in Tn. Then O(S) :={T ∈ Tn : Split(T)⊆S}

is called the orthant of S. Moreover, for T ∈ Tn let O(T) = O(Split(T)).

In case that S is a set of pairwise compatible splits with maximum cardinality of n−2, we call O(S) a maximal orthant.

By definitionO(S1)⊆ O(S2) holds forS1 ⊆S2, so any set of splitsS with|S|< n−2 is contained in some maximal orthant S0, i.e., S ⊂S0, |S0|=n−2. Also note, that a tree T whose set of splits Split(T) is not maximal belongs to several maximal orthants, which is illustrated in Figure 2.7. The most extreme example for this is the star tree 0∈ Tn. Since Split(0) =∅, the star tree is contained in every orthant of Tn, since Split(0) =∅ ⊂S for all S ⊂ S.

Before defining the distance in single orthants we refer to a very important result.

The number of orthants ofTn is exponential n:

Theorem 2.12([BHV01]). There exist(2n−3)!! = (2n−3)·(2n−5)·. . .·3maximal orthants inTn.

With the definition of orthants we now define the distance between trees that are contained in a common orthant. To this end, let O = O(S) be an orthant that containsT andX, i.e., Split(T)⊂S and Split(X)⊂S. [BHV01] define the distance for two such trees to be the Euclidean norm of their weight vectors. Now using the embedding intoRN+, i.e., using t, x∈RN+ instead ofT, X respectively, their distance is given by

d(X, T) := kt−xk2.

16

(23)

0

1 2 5 3 4

0

1

2 5

3 4

0

2

1 5

3 4

0

5

1 2

3 4

Figure 2.7: A degenerate tree topology with splits S ={s1, s2} (top) and the three tree topologies with maximal split sets S1, S2, S3 that contain S.

Note, that all components oftandxthat do not correspond to splits in Split(T)⊂ O or Split(X)⊂ O are 0 by definition of the embedding χ, Definition 2.9.

Before we continue to define the distance inTn for two trees who are not contained in a common orthant, we want to emphasize the local Euclidean structure, that is implied by the definition of distance for two trees in an orthant.

Since d(T, X) = kt−xk2 for two trees T, X that are in the same orthant, a single maximal orthant withn−2 splits is isometric to the non-negative Euclidean orthant Rn−2+ :

Definition 2.13. Let a maximal orthant O =O(S)⊂ Tn with S ={si1, . . . , sin−2} be given. Then the embedding ψO of orthant O into Rn−2+ is given by the map onto the length vectors, i.e., for a tree T = ((si1, . . . , sin−2),(w1, . . . , wn−2))∈ O, define

ψO(T) =

 w1

... wn−2

and for non-binary treesT = ((sj1, . . . , sjl),(w1, . . . , wl))∈ O(S), i.e., whenl < n−2 and Split(T) = {sj1, . . . , sjl}(S, define ψO(T) coordinate-wise,

ψO(T)

m =

(wl if sjl =sim

0 else for m= 1, . . . n−2.

(24)

Recall that we have also defined an embedding χ of Tn into RN+ in Definition 2.9.

The orthant embedding ψ(O) is simply the projection of χ(T) to the components of RN+ that correspond to splits in O.

Now, that we have defined the distance between trees that are contained in a common orthant and understood the local Euclidean structure of the orthants, the second step is toconnect the orthants and measure distance for trees that are not contained in a common orthant via paths through several connected orthants.

Assume now thatT andX are not both contained in some orthantO. This implies, that there exists some split s1 in Split(X) that is incompatible with a split s2 in Split(T), as we could otherwise take the compatible split set ¯S := Split(T)∪Split(X) and T, X ∈ O( ¯S) would show thatT and X are contained in a common orthant.

Here, it does not make sense to define the distance by the Euclidean norm of the embedded vectors kt−xk2: For each point y = (1−λ)x+λt, λ ∈ (0,1) on the Euclidean shortest path, i.e., the line segment connecting t and x in RN+, we would have thatyi >0 whenever one ofxi and ti is greater than 0 for somei∈ {1, . . . , N}.

Hence the coordinates corresponding to the incompatible splits s1 and s2 are both positive for y and Theorem 2.10 implies that y ∈ RN+ is not a representative of a tree, i.e., y 6∈ χ(Tn). This is easier to understand when looking at the example for the embedding of T3 into R3+ in Example 2.2.1.

Example 2.2.1. Consider T3. A maximal split set in T3 has n−2 = 3−2 = 1 splits and there exist N = 23−3−2 = 3 splits in T3, compare p. 13. We take two trees with a different topology:

T1 = (({2,3}|{0,1}),(2)) T2 = (({1,3}|{0,2}),(1))

Figure 2.8 depicts the embedding ofT3intoR3, which is exactly the non-negative axes that are drawn as solid lines. The Euclidean shortest path between T1 and T2 is the dotted line. Clearly, the line is not contained inχ(T3), so it yields a distance between the trees, but not a corresponding path through tree space. In general, the Euclidean distance of the embedded vectors is the extrinsicdistance when used for the subset of trees, as it is the distance that is derived from the ambient space RN+, i.e., the space into which Tn is embedded. This distance does not have any desirable properties for the space of trees, which is why the distance needs to be defined differently.

What we actually want is a shortest path P from X to T in the tree space, i.e., P ⊆ Tn that traverses several orthants to connect X and T. To find such a path we need to be able to transform one tree topology into a different tree topology. This is done as follows: By definition, two trees T1, T2 with split sets Split(T1),Split(T2) are in the orthantO(S), when Split(T1)∪Split(T2) = S is a compatible set of splits.

That means that we can search for a path from X toT by removing some splits of X to get some tree T0, such that Split(T0) ⊂Split(X) and then add some splits of

18

(25)

({2,3}|{0,1}) ({1,3}|{0,2})

({1,2}|{0,3})

Figure 2.8: Embedding ofT3 intoR3 and a Euclidean shortest path inR3connecting two trees of T3.

T to T0 and repeat this procedure until we are only left with splits of X, hence the topology ofX. Thereby, it is important to remove and add splits in such a way that two successive trees on the way from X to T are contained in a common orthant and we can thus use the Euclidean distance for these pairs of trees. In this way, we haveconnected X and T by a path through tree space.

Now that we have sketched the idea, we formalize the length of paths between two trees, as it has already been described in [BHV01].

Definition 2.14. Let (T1, T2, . . . , Tk)be a sequence of trees in Tn such that for each i= 1, . . . , k−1 there exists an orthant Oi with Ti, Ti+1 ∈ Oi. Then define the path P = (T1, . . . , Tk)⊂ Tn to be the path that contains all trees Ti and for each Ti, Ti+1, i= 1, . . . , k−1 it contains all trees χ−1(λti + (1−λ)ti+1) for λ∈(0,1).

P = (T1, T2, . . . , Tk) is called a path from T1 to Tk and its length is defined as L(T1, T2, . . . , Tk) :=

k−1

X

i=1

kti+1−tik.

It is crucial to understand that for eachi= 1, . . . , k−1,Tiand Ti+1 are contained in a common orthantOi and the distance there is defined as the Euclidean distance of their weight vectors. Note that then yλ =λti+ (1−λ)ti+1 ∈RN+ always represents a tree, i.e. yλ ∈ χ(Tn). This holds, as only splits that are in the common orthant Oi may have non-zero components in the embedding. More precisely,yλ ∈χ(Oi)⊂ χ(Tn). As the measure in an orthant is the Euclidean distance, it follows that the

(26)

shortest path between their weight vectors is just the straight line which is mapped back toTn byχ−1.

We illustrate the concept of orthants and paths in T4. First, we state all maximal orthants and then we show an example of a path and a geodesic.

Example 2.2.2. InT4 it can be easily checked that no more than two splits can be pairwise compatible. The maximal orthants in T4 hence all contain two compatible splits. There exist 15 maximal orthants, namely

O1 = {({1,2}|{0,3,4}),({3,4}|{0,1,2})}={s1, s6} O2 = {({1,3}|{0,2,4}),({2,4}|{0,1,3})}={s2, s5} O3 = {({1,4}|{0,2,3}),({2,3}|{0,1,4})}={s3, s4} O4 = {({1,2}|{0,3,4}),({1,2,3}|{0,4})}={s1, s7} O5 = {({1,2}|{0,3,4}),({1,2,4}|{0,3})}={s1, s8} O6 = {({1,3}|{0,2,4}),({1,2,3}|{0,4})}={s2, s7} O7 = {({1,3}|{0,2,4}),({1,3,4}|{0,2})}={s2, s9} O8 = {({1,4}|{0,2,3}),({1,2,4}|{0,3})}={s3, s8} O9 = {({1,4}|{0,2,3}),({1,3,4}|{0,2})}={s3, s9} O10 = {({2,3}|{0,1,4}),({1,2,3}|{0,4})}={s4, s7} O11 = {({2,3}|{0,1,4}),({2,3,4}|{0,1})}={s4, s10} O12 = {({2,4}|{0,1,3}),({1,2,4}|{0,3})}={s5, s8} O13 = {({2,4}|{0,1,3}),({2,3,4}|{0,1})}={s5, s10} O14 = {({3,4}|{0,1,2}),({1,3,4}|{0,2})}={s6, s9} O15 = {({3,4}|{0,1,2}),({2,3,4}|{0,1})}={s6, s10}

The structure of T4 can be represented by a graph G = (S,O) whose node set S contains the ten splits {s1, . . . , s10} of T4 and whose edges connect two splits (si, sj) if and only ifsi andsj are compatible. Consequently, the edges belong to the maximal sets of compatible splits inT4 and which are the orthants ofT4, see Figure 2.9. Note, that this compatibility graph actually is the Petersen graph.

Now consider the trees X = ({s2, s9},(2,2)) and T = ({s1, s6},(1,4)). Then Fig- ure 2.10 depicts two different paths from X to T:

P1 = (X, T2, T3, T) P2 = (X, T1, T)

Then, calculating the Euclidean distance between all pairs of trees on the path, we receive

L(P1) =kt−t2k2+kt2−t3k2 +kt3 −xk2

=p

22+ (2−1)2+√

22+ 12+p

(4−2)2+ 12 ≈6.71, L(P2) =kt−t1k2+kt1−xk2 =ktk2+kxk2 =√

8 +√

17≈6.83.

20

(27)

Figure 2.9: Compatibility Graph forT4. The three colored orthants are used for the example for the paths below.

s1

s6

s9 s2

O1 O14

O7

X

T T1

T3 T2

Figure 2.10: Example for two paths fromX to T in a diagram of a part of T4. The shortest path connecting X and T is the solid path.

In Definition 2.15, the distance between two trees X and T is now simply defined as the length of the geodesic, i.e., a path in tree space from X to T with minimal length. Before that, we provide a little bit of background knowledge about the term geodesic in mathematics and how it relates to tree space.

The notion of geodesics originates from differential geometry and is used to gener- alize the notion of straight lines to curved spaces and is, e.g., used for Riemannian manifolds. Geodesics have later been generalized to metric spaces, thereby introduc- ing the notion of a geodesic metric space. As a matter of fact Theorem 2.16 will show thatTn is a geodesic metric space, which is why this shortest path is called geodesic.

Moreover, as geodesics were introduced as a generalization of straight lines it is in-

(28)

teresting to note that the geodesics here are basically concatenations of straight lines in single orthants and that the tree space actually has global non-positive curvature (see [BHV01]).

Definition 2.15. [BHV01] Let T, X ∈ Tn. The geodesic distancebetweenX and T is given by

d(X, T) = inf{L(T1, . . . , Tk) : (T1, . . . , Tk) is a path from X to T in Tn}. A path P = (T1, . . . , Tk) from X to T, X = T1, T = Tk that attains this minimal distance, i.e., L(P) =d(T, X) is called geodesic.

Plugging in the length of a path from Definition 2.14 we get d(X, T) = inf

( k X

i=1

kti+1−tik2 : (T1, . . . , Tk) is a path from X toT inTn )

. Naturally, every path from X toT yields an upper bound on the geodesic distance.

A special path that connects any two trees in Tn is the path through the star tree 0∈ Tn and it is calledcone path. Formally, the cone path for X, T ∈ Tn is (X,0, T) and its length isktk2+kxk2.

Due to the triangle inequality of k · k2 that we may apply in every orthant O, we can replace a sequence

(T1, . . . Ti, Ti+1, . . . , Tj, . . . , Tl)

of trees whereTi, Ti+1, . . . , Tj are contained in the same orthant O by (T1, . . . , Ti, Tj, . . . , Tl)

and will thereby not increase the length of the path.

By the definition of geodesics it is not obvious that the infimum is always attained.

The existence of a unique geodesic that attains this infimum connecting any pair of trees in Tn has been shown in [BHV01] by applying a result of [Gro87] concerning metric spaces of global non-positive curvature, also calledCAT(0) spaces. As we do not need curvature in the following, but only need it to get existence and uniqueness of geodesics we refer the interested reader to the appendix, Chapter 7.

Theorem 2.16 ([BHV01]). Tn is a CAT(0) space. In particular, the geodesic dis- tance d is a metric on Tn.

For CAT(0) spaces, existence and uniqueness of geodesics are guaranteed:

22

(29)

Theorem 2.17 ([BHV01]). Let T, X ∈ Tn. There exists a unique geodesic Γ(X, T) from X to T.

As a final note concerning curvature we remark that a Hadamard space is defined as a complete CAT(0) space. SinceTnis complete, this makes it a Hadamard space, as mentioned in Chapter 7, or compare [Bac14b] for more details on the subject.

The fact thatTn is a Hadamard space will be of use later.

2.2.1 Support and Parametrization of a Geodesic

Now that the existence and uniqueness has been established, we want to actually find the geodesic between X and T, which we denote with Γ(X, T).

The key to finding and parametrizing the geodesic from a tree X to a tree T is the concept of thesupport sequence or thesupport of a geodesic, that describes in which order edges of the treeX are removed and edges of T are added. With this concept it is possible to apply combinatorial methods to find the geodesic.

[BHV01] prove that splits that are contained in both X and T are contained in every tree along the geodesic. Hence, we do not need to consider these splits when searching for the topologies of trees along the geodesic. The same result holds for splits that are contained in one tree and are compatible with all splits of the other tree. These two types of splits are merged in the following definition.

Definition 2.18. Given two trees T, X ∈ Tn. A split s is called double compatible if s is compatible with all splits in Split(X) as well as with all splits in Split(T).

Let C =C(T, X) denote the set of double compatible splits for two trees.

Next we define a partition of the split sets that describes in which order we remove splits fromX and add splits from T on the path from X to T.

Definition 2.19. [OP11] With the notation introduced above, a support sequence from X ∈ Tn to T ∈ Tn is a pair (A,B) = ((A1, . . . , Ak),(B1, . . . , Bk)) with Ai ⊂ Split(X), Bi ⊂Split(T) such that

A1∪. . .∪Ak= Split(X)\C B1∪. . .∪Bk= Split(T)\C is a partition, i.e., a disjoint union.

In order to define thesupport of a geodesic we introduce the following notation. Let T = ((s1, . . . , sl),(w1, . . . , wl))∈ Tn. Then for a subset A⊆Split(T) define

kAk2 = s

X

si∈A

wsi.

(30)

Moreover, we also write wsT, wsX to indicate the weight of splits s in tree T or tree X respectively for a split s∈Split(T)∩Split(X).

Definition 2.20. [OP11] LetT, X ∈ Tnand let A= (A1, . . . , Ak),B= (B1, . . . , Bk) be a support sequence fromX to T. Then(A,B) is called a support of the geodesic from X to T if it satisfies the following three properties:

(P1) For each i > j, Ai and Bj are compatible sets of splits.

(P2)

kA1k

kB1k ≤. . .≤ kAkk kBkk

(P3) For i = 1, . . . , k there is no non-trivial partition (C1, C2) of Ai and (D1, D2) of Bi such that C2 is compatible with D1 and kDkC1k

1k < kDkC2k

2k.

In the following we describe what the three properties (P1),(P2) and (P3) actually mean.

From the definition of the geodesic distance it can be seen that the geodesic consists of “straight lines”, also called legs, one for each orthant that it traverses. This follows from its property of being the shortest path connecting two trees together with the definition of distance in the orthants, being the Euclidean distance.

Now, knowing that the geodesic is, roughly speaking, a piecewise Euclidean path that may only bend when traversing a boundary of an orthant, we can use this knowledge to interpret (P1),(P2) and (P3), where we stick to the notation of Definition 2.20.

A support sequence as in Definition 2.19 generally defines a sequence of orthants that a path traverses through. In case of the geodesic fromX toT, the first orthant on the geodesic is the orthant of X, O(X), so the trees on the straight line, the leg, in this orthant have the splits Split(X). To get to the orthant of T we have to remove the splits of X \C and add splits of T; that is the general idea of the partitions of Split(X)\CintoA1, . . . , Ak and Split(T)\C intoB1, . . . , Bk. We want to successively delete splits ofX, starting withA1, thenA2 and so on and add splits of T starting at B1, then B2 and so forth.

(P1) ensures that this procedure of removing and adding splits actually yields an orthant ofTn, i.e., that B1∪. . .∪Bi∪Ai+1∪. . .∪Ak is a compatible set of splits.

Now, while (P1) cares about “feasibility”, (P2) and (P3) together ensure that the split weights along the path through these orthants are chosen optimally.

Theorem 2.21. [OP11] Let T, X ∈ Tn, (A,B) be the support for the geodesic from X to T and C be the set of double compatible splits. Define the path P(A,B) = {γ(λ) : λ ∈ [0,1]} via a parametrization γ : [0,1] → Tn that consists of several

24

Referenzen

ÄHNLICHE DOKUMENTE

Rat liver ll/(- hydroxysteroid dehydrogenase complementary deoxyribonucleic acid encodes oxoreductase activity in a mineralocorticoid- responsive toad bladder cell line.. Escher

The observed distribution of small tree topologies, the evaluation of im- balance statistics and the splitting pattern comparison, indicates that the BDM generates too balanced

This software is furnished subject to the following restrictions: it shall not be reproduced or copied without express written permission of MITS, Inc.. to take

This active high signal (1) indicates a fault, and inhibits further writing until the condition is corrected. Multiple heads selected. DC voltages are grossly out

Article 1(2) of the Convention defined terrorism as “any act of violence or threat thereof notwithstanding its motives or intentions perpetrated to carry out

In a similar vein, NATO’s “Allied Joint Doctrine for Air and Space Operations“ highlights the potential of space assets in achieving the Alliance’s security objectives.. At the

2015 IT IS 3 MINUTES TO MIDNIGHT Unchecked climate change, global nuclear weapons modernizations, and outsized nuclear weapons arsenals pose extraordinary and undeniable threats

It is intended to be the firmware reference manuaL and to be used by the advanced microprogrammer of the Rikke MathiLda system.. WideStore, WS, is the common