• Keine Ergebnisse gefunden

Gene and genome duplication and the evolution of novel gene functions

N/A
N/A
Protected

Academic year: 2022

Aktie "Gene and genome duplication and the evolution of novel gene functions"

Copied!
144
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Gene and genome duplication and the evolution of novel gene functions

Dissertation zur Erlangung des akademischen Grades

des Doktors der Naturwissenschaften (Dr. rer. nat.)

an der

Universität Konstanz, Mathematisch-naturwissenschaftliche Sektion

Fachbereich Biologie

vorgelegt von

Dipl. Biol. Dirk Steinke

Tag der mündlichen Prüfung: 8. Februar 2006 Referent: Prof. Dr. Axel Meyer

Referent: Prof. Dr. Yves van de Peer

(2)

Acknowledgements

I would like to thank my advisor, Prof. Axel Meyer for giving me the opportunity to complete a doctoral degree in his lab.

I am very grateful to Walter Salzburger for advice, friendship and patience during three years of work for this thesis. And of course for an uncounted number of cigarettes he provided.

I am grateful of most of the past and current members of the Meyer Lab and especially I thank for support and helpful discussions, with no particular order, Simone Hoegg, Ylenia Chiari, Marta Barluenga, Mai-Britt Becker, Ingo Braasch, Arie van der Meijden, Matthias Sanetra, Kai Stölting, Henner Brinkmann, Miguel Vences, Nils Offen, Gerrit Begemann, Cornelius Eibner, Dave Gerrard, and Paul Bunje.

I think it’s also a matter of course to thank our Technicians for their help and support. Special thanks goes to Elke Hespeler, Ilse Eistetter, Ursula Topel, Rita Hellmann, and Christina Chang-Rudolf.

Of course I would like to thank my mother and posthumously my father for facilitating my studies. Sorry, that it took a little longer than expected.

I am deeply grateful to my wife Claudia and to my children Tarana and Finn. Our little family is my source of strength, motivation and staying power. I regret all the time I could not spend with you because of the extensive work on this study. I hope it is worth the trouble.

(3)

With love to my little scientists Tarana and Finn.

For them every little discovery is still an adventure and a sensation.

“Observe constantly that all things take place by change, and accustom thyself to consider that the nature of the Universe loves nothing so much as to change the things which are, and to make new things like them.”

Marcus Aurelius Antoninus

(4)

Table of contents

I. General Introduction 6

II. EverEST – a phylogenomic EST database approach 14

2.1. Abstract 15

2.2. Introduction 15

2.3. The program 16

III. Novel relationships among ten model fish species supported by

phylogenomics with ESTs 20

3.1. Abstract 21

3.2. Introduction 21

3.3. Materials and Methods 25

3.4. Results 27

3.5. Discussion 32

3.6. Conclusions 34

IV. Annotation of expressed sequence tags for the East African cichlid fish species Astatotilapia burtoni and evolutionary analyses of cichlid ORFs 36

4.1. Abstract 37

4.2. Introduction 37

4.3. Materials and Methods 40

4.4. Results 43

4.5. Discussion 51

V. Three rounds (1R/2R/3R) of genome duplications and the evolution of the glycolytic pathway in vertebrates 54

5.1. Abstract 55

5.2. Introduction 55

5.3. Materials and Methods 59

5.4. Results 61

5.5. Discussion 65

5.6. Conclusions 71

VI. A high proportion of species-specific genes with asymmetric rates of

molecular evolution found in four fish models 73

6.1. Abstract 74

6.2. Introduction 75

6.3. Materials and Methods 77

6.4. Results 79

6.5. Discussion 87

6.6. Conclusions 92

(5)

VII. Genome desiccation in mammals - Gene deserts explain the uneven

distribution of genes in mammalian genomes 94

7.1. Introduction & Discussion 95

7.2. Materials and Methods 99

7.3. Results 101

VIII. Summary 104

IX. Zusammenfassung 106

X. References 109

XI. Appendix 127

(6)

I.General Introduction

Biology is a discipline rooted in comparisons. Comparative studies have assembled a detailed catalogue of the biological similarities and differences between species, revealing insights into how life has adapted to fill a wide range of environmental niches. In this regard such studies have provided the fundamental paradigms from which each discipline has grown.

Genomics is the most recent branch of biology to employ comparison-based strategies. At the foundation of the evolutionary relationship of all vertebrates is conserved genetic information in the of DNA sequence, which is assumed to underlie homologous functional and anatomical similarities between species.

Comparative genomics and ESTs

Comparative phylogenomic analyses using Expressed Sequence Tags (ESTs) from taxa across the spectrum of animal diversity promise to yield reliable and robust results. ESTs also provide an economical approach to identify large numbers of genes that can be used in gene expression and phylogenomic studies (Gerhold and Caskey, 1996; Renn et al., 2004).

Because of their usefulness, the rapid automated way of data collection, and the relatively low costs associated with this technology, many individual scientists as well as large genome sequencing centers have generated large numbers of ESTs that are publicly available and their number continuous to increase rapidly. The use of local similarity search algorithms (e.g.

BLAST) against public databases is often the preferred method of choice for screening short ESTs. The need for automation of processing, similarity analyses, and annotation of ESTs is highlighted in cases when several BLAST results should be analyzed in a combined approach.

EverEST is a WINDOWS program that automates database management and phylogenetic analyses of ESTs (Chapter II). It’s features representing the basis for several studies within this thesis. The program is processing simultaneous database searches using the BLAST algorithm against three databases to identify the best hits for any given EST sequence. In a further step EST sequences are associated with BLAST results and phylogenetic analyses in a relational database.

Phylogenomics of fishes

There are more than 25,000 species of teleost fishes amounting to nearly half of the extant vertebrate species and about 96% of all extant fishes are classified as teleosts (Nelson, 1994). They show vast differences in their morphology and adaptations. Their sister group, the lobe-finned fishes, include the rest of the bony vertebrates, such as coelacanths, lungfishes

(7)

and tetrapods. The ray-finned fishes and the lobe-finned fishes have diverged 400-450 million years ago (Kumar and Hedges, 1998). Although this large evolutionary distance implies that only a fraction of the functional sequences in the genomes are shared, comparative studies have, for example, revealed that most human genes are also common in fish.

Although some progress has been made in the elucidation of the phylogeny of the higher ray-finned fishes, the teleosts, relationships among the main lineages of the derived fishes (Euteleostei) remain poorly defined and are still under discussion. Chapter III represents a phylogenomic analysis of 42 sets of orthologous genes (ESTs and other genomic data) from ten fish model systems representing seven orders (Salmoniformes, Siluriformes, Cypriniformes, Tetraodontiformes, Cyprinodontiformes, Beloniformes and Perciformes) of euteleosts to estimate divergence times and to examine the evolutionary relationships among those lineages. All ten fish species serve as models for developmental, aquaculture, genomic and comparative genetic studies; however the molecular phylogenetic relationships among them had not been tested rigorously. The study revealed a molecular phylogeny of higher- level relationships of derived teleosts, which indicates that the use of multiple genes produces robust phylogenies. The phylogenomic analyses confirmed that the euteleostean superorders Ostariophysi and Acanthopterygii are monophyletic and the Protacanthopterygii and Ostariophysi are sister clades. In addition and contrary to the traditional phylogenetic hypothesis, the analyses determined that killifish (Cyprinodontiformes), medaka (Beloniformes) and cichlids (Perciformes) appear to be more closely related to each other than either of them is to pufferfish (Tetraodontiformes). All ten lineages split before or during the fragmentation of the supercontinent Pangea in the Jurassic.

The cichlid case

East African cichlids exemplify the diversity of the perciform fishes, a vast group that dominates aquatic habitats and accounts for 25% of all vertebrate species. The more than 3,000 species of cichlids are distributed from Central and South America, Africa and Madagascar to southern India. The distribution of the family suggests that it had its origins in Gondwana about 100 MY ago (Stiassny, 1991; Zardoya et al., 1996). Molecular phylogenies (Figure 1.1) show that the Indian and Malagasy species diverged first followed by a deep split between separate monophyletic groups of African and New World species (Farias et al., 2000;

Farias et al., 2001). The hotspot of their biodiversity is East Africa, where they form adaptive radiations composed of hundreds of endemic species in several lakes of various sizes and ages. They have formed so-called “species flocks” with sometimes hundreds of endemic

(8)

species in each of these lakes. The species flocks of lakes Malawi and Victoria are estimated to contain more than 700 species each (Turner et al., 2001). These flocks arose from within an older radiation of more than 250 species in Lake Tanganyika (Salzburger et al., 2002). Each of these radiations was phylogenetically independent, and resulted in multiple instances of convergent evolution of morphology and behavior (Kocher et al., 1993). Diverse jaw morphologies have evolved to scrape algae, suck zooplankton, or crush snails. Cichlids have also evolved a range of mating and parental care strategies, where maternal mouth brooding is the most common strategy, and is often accompanied by the development of elaborate male coloration.

Figure 1.1A: The distributional pattern of the cichlids, with the representatives from India, Sri Lanka, and Madagascar forming the most basal lineages and the reciprocally monophyletic African and American lineages as sister-groups, is consistent with an initially Gondwanaland distribution (Farias et al., 2000; Farias et al., 2001;

Sparks, 2003; Streelman et al., 1998;

Zardoya et al., 1996).

Figure 1.1B: The supercontinent of Gondwanaland some 200 million years ago (MYA).

(Figure from Salzburger and Meyer 2004 with permission).

Chapter IV shows the collection and annotation of more than 12,000 ESTs generated from two different cDNA libraries obtained from the East African cichlid species Astatotilapia burtoni (Günther 1894). This species has long been used as model system to study cichlid spawning behavior (Fryer and Iles, 1972; Wickler, 1962a; Wickler, 1962b), social interactions (Crapon de Caprona, 1980; Wickler, 1962a; Wickler, 1969), behavioral plasticity (Hofmann, 2003; Hofmann and Fernald, 2001), endocrinology (Robison et al., 2001), the cichlids’ visual system (Kroger et al., 2001), as well as cichlid development and embryogenesis (Hagedorn et al., 1998). In addition, the phylogenetic position of A. burtoni makes this species an ideal model system for comparative genomic research (Figure 1.2).

Astatotilapia burtoni, which belongs to the most species-rich lineage of cichlids, the haplochromines, was shown to be sister group to both the Lake Victoria region superflock (~

600 species) and the species flock of Lake Malawi (~ 1,000 species) (Meyer et al., 1990;

Salzburger et al., 2005; Verheyen et al., 2003). With H. chilotes and H. sp. “redtailsheller”

(Lake Victoria), and M. zebra (Lake Malawi) three highly specialized species from two species flocks have already been established as genomic models. The comparison of their genomes to that of A. burtoni, which has a more generalist’s life style and is likely to

(9)

resemble the ancestral lineage that seeded the cichlid adaptive radiations in these two lakes, promises important insights into cichlid (genome) evolution.

Figure 1.2: Phylogeny of East African Cichlids based on 2000 bp of mitochondrial DNA. It shows clearly that the Haplochromini of Lake Victoria are genetically not distinguishable and Astatotilapia burtoni, has a basal position with respect to this clade (Figure from Salzburger et al. 2005 with permission).

Gene/genome duplication

Based on basic data such as genome sizes and allozymes, (Ohno, 1970) proposed that the increase in complexity also during the evolution of the vertebrate lineage was

(10)

accompanied by an increase in gene number due to duplication of genes and/or genomes.

Recent data from genome sequencing projects showed that genome size is not tightly correlated with the numbers of genes an organism possesses, and for many genes, multiple copies can be found in vertebrates, while basal deuterostomes and invertebrates typically have only one orthologous copy. The “one-two-four” rule is the current model to explain the evolution of gene families and of vertebrate genomes more generally.

Based on this model, two rounds of genome duplication occurred early in the deuterostome evolution (Figure 5.1). An ancestral genome was duplicated to two (1R), and then to four genomes after the second (2R) genome duplication (Sharman and Holland, 1996;

Sidow, 1996). Recent data suggest that an additional whole genome duplication (3R) occurred in the fish lineage (Fish Specific Genome Duplication - FSGD), extending the “one-two-four”

to a “one-two-eight” rule (Christoffels et al., 2004; Jaillon et al., 2004; Meyer and Schartl, 1999; Taylor et al., 2003; Taylor et al., 2001a; Van de Peer et al., 2003; Vandepoele et al., 2004).

Most glycolytic enzymes for example follow the pattern of repeated duplications during the early evolution of vertebrates. The glycolytic pathway is particularly suitable for testing theories of enzyme evolution and the involvement of gene/genome duplications.

Chapter V gives insights into the vertebrate specific evolution of the glycolytic enzymes.

Many of the obtained gene trees generally reflect the history of two rounds of duplication within the vertebrate lineage plus additional duplication events within the lineage of ray- finned fish. By and large, the glycolytic pathway appears to have arisen primarily by random association of enzymes converging on similar structures and has subsequently evolved by gene duplication and divergence of each constituent enzyme.

Fish Specific Genome Duplication - FSGD

It has been suggested that the large number of fish species and their tremendous morphological diversity might be due to a genome duplication event specific to the teleost lineage (Amores et al., 1998; Chen et al., 2004; Meyer and Van de Peer, 2005; Taylor et al., 2003; Taylor et al., 2001a; Taylor et al., 2001b; Wittbrodt et al., 1998). The first indications for actinopterygian-specific genome duplication came from studies based on Hox genes and Hox clusters, in particular those of the zebrafish and fugu (Figure 1.3). Extra Hox gene clusters were discovered in the zebrafish (Danio rerio), medaka (Oryzias latipes), the African cichlid (Oreochromis niloticus), the pufferfish (Takifugu rubripes) and this suggested that all teleost fishes experienced an additional genome duplication in ray-finned fishes

(11)

(Actinopterygii) before the divergence of most teleost species (Hoegg and Meyer, 2005;

Prohaska and Stadler, 2004; Van de Peer, 2004; Van de Peer et al., 2003; Volff, 2005).

Figure 1.3: Hox-clusters in currently sequenced genomes and hypothetical evolutionary ancestral conditions.

While shark, coelacanth, tetrapods and also bichir have four clusters, teleost fish have at least seven clusters resulting from a fish-specific genome duplication event. (Figure from Hoegg and Meyer 2005 with permission).

Since gene and genome duplication events increase the amount of genetic material, it has been speculated that there is a relationship between gene copy number and morphological complexity and, by extension, also species diversity (Ohno, 1970). This implies that duplicated genes have diverged from the roles of their pre-duplication homologs. In Chapter VI this was demonstrated by an increase in evolutionary rate and/or by evidence for positive Darwinian selection. Duplicated genes may be redundant, which means that inactivation of one of the two duplicates might have little or no effect on the phenotype (Gibson and Spring, 1998; Lynch and Conery, 2000; Nowak et al., 1997). Therefore, since one of the copies is free from functional constraint, mutations in this gene might be selectively neutral and have the potential to turn the gene into a non-functional pseudogene. Alternatively, one of the duplicates might adopt a new function (Ohno, 1973). Although post-duplication secondary gene loss is a frequent event, ~20%-50% of paralogous genes are retained for longer evolutionary time spans after a genome duplication event (Lynch and Force, 2000;

(12)

Postlethwait et al., 2000). Selection can prevent the loss of redundant genes (Gibson and Spring, 1999) if those genes encode for components of multidomain proteins because mutant alleles disrupt such proteins. A selective advantage due to a new unique function might be sufficient to retain this gene copy and to prevent degenerative substitutions and prevent this functional gene copy from becoming a pseudogene. Also, positive Darwinian selection can be responsible for functional divergence between the duplicates (e.g., Duda and Palumbi, 1999;

Hughes et al., 2000; Zhang et al., 1998). When a gene with multiple functions is duplicated, the duplicates are redundant only for as long as each copy retains the ability to perform all ancestral roles (Force et al., 1999; Hughes, 1994). When one duplicate experiences a mutation that prevents it from carrying out one of its ancestral roles, the other duplicate is no longer redundant. According to the duplication-degeneration-complementation (DCC) model (Force et al., 1999), degenerative mutations preserve rather than destroy duplicated genes but also change their functions or at least restrict their original functions to become more specialized.

Gene deserts

The understanding of how genes are distributed and organized in genomes of different sizes remains a major challenge of genome biology (Gregory, 2005). Already 20 years ago, Susumu Ohno (Ohno, 1985) postulated the desertification of the euchromatic region of the higher vertebrates’ genome owing to continuous gene duplication events followed by degeneration of newly emerged gene copies in their evolutionary history. Only with the first release of the complete sequence of the human genome in 2001 was Ohno’s prediction of the existence of such deserts confirmed (Lander et al., 2001; Venter et al., 2001) – yet, their functional or evolutionary significance remains to be explained. Due to the lack of protein- coding DNA, gene deserts seem to be devoid of any biological function. In spite of this, some gene deserts have been shown to contain regulatory regions for neighboring genes that function over large distances (Nobrega et al., 2003). Also, the observation of stable gene deserts with homologous flanking genes (often transcription factors), which are maintained over long evolutionary times (Ovcharenko et al., 2005), and the existence of numerous conserved non-genic sequences in mammalian genomes (Dermitzakis et al., 2005), suggest that gene deserts are not just genomic junkyards but, instead, might be of functional significance. By contrast, there are gene deserts that can be deleted without noticeable phenotypic effects (Nobrega et al., 2004).

When plotting the number of genes per chromosome versus chromosome length for 15 eukaryotic genomes, as shown in Chapter VII, a strong linear correlation in non-mammalian

(13)

genomes could be documented. Mammalian genomes, on the other hand, do not show constant ratios of gene number per chromosome over chromosome length. It appears that the uneven distribution of gene deserts on mammalian chromosomes accounts for the deviations from otherwise constant ratios.

(14)

EverEST – A phylogenomic EST database approach

Published in Phyloinformatics (2004) 6: 1-4

“The next major explosion is going to be when genetics and computers come together.”

Alvin Toffler

(15)

II. EverEST – A phylogenomic EST database approach

2.1. Abstract

The use of local similarity search algorithms (e.g. BLAST) against public databases is often the preferred method of choice for screening short expressed sequence tags (EST). The need for automation of processing, similarity analyses, and annotation of ESTs is highlighted in cases when several BLAST results should be analyzed in a combined approach.

EverEST is a WINDOWS program that automates database management and phylogenetic analyses of ESTs. Together with its interactive visualization tools and a variety of built-in data, EverEST effectively integrates data mining, annotation, and phylogenetic evaluation into one application. EverEST is constructed to maximize evolutionary relevant information that is contained in large amounts of DNA data while attempting to minimize computation time. The database package will be freely available at:

http://www.evolutionsbiologie.uni-konstanz.de.

2.2. Introduction

Large-scale sequencing of partial cDNA clones as expressed sequence tags (ESTs) and similarity searches of these against public DNA and protein sequence databases is becoming a more widely-used method for gaining information on gene content and genomic complexity for many phyla (Hedges and Kumar, 2002). The use of local similarity search algorithms against the public databases is often the method of choice for screening short EST sequences, because they determine scores for only those regions conserved between sequences (Altschul et al., 1990). Furthermore, local similarities of translated EST sequences to the public protein databases can often be detected even when similarities to the public DNA databases appear coincidental (Gish and States, 1993). Several EST projects (Fizames et al., 2004; Renn et al., 2004; Whitfield et al., 2002) have used similarity searches for positive identification of known genes and the determination of putative functions of others. The most commonly used method for this is to run the BLAST similarity search programs for each EST. Hits from EST similarity searches are the basis for the inference of probable biological functions and also the evolutionary history, i.e., homology, whereas a lack of hits at least suggests the possibility of the discovery of a novel gene or convergence of genes.

The need for automation of processing, similarity analysis, and annotation of ESTs is highlighted in cases when several BLAST results are analyzed in a combined approach.

(16)

Here, we describe a database software we term EverEST, for processing simultaneous database searches using the BLAST algorithm against three databases to identify the best hits for any given EST sequence. In a further step EST sequences are associated with BLAST results and phylogenetic analyses in a relational database using its own specific database management system.

2.3. The program

A flow-chart for our phylogenomic EST database is shown in Figure 1a. Most records in this flow-chart involve a ‘one-to-one relationship’ with the exception of the distance_matrix table, where a ‘one-to-many relationship’ exists. The distance_matrix table consists of all genetic distances between sequences in one entry of the alignment table and the used distance method. Figure 2.1A depicts that the blast_results are joined to the query_sequence tables through the alignment table. This table associates the distance_matrix table and through this the tree_file and triangle_coordinates tables.

Figure 1A: Entity relationship diagram describing the association between the query sequence, the BLAST results and the phylogenetic analyses results. Primary keys are indicated with an asterisk.

Figure 1B: Typical process for pre-processing and analyzing an EST sequence and for integrating the query and BLAST resources in the EverEST database.

The tree_file table consists of the tree representation in the Newick tree format and the result of a branch length test (Takezaki et al., 1995). The branch length test is a test of rate difference for each sequence under the tree root from the average rate of all sequences. This

(17)

enables a linearized graphic representation of a group of ORF alignments in a ternary plot for which all related data are stored in the triangle_coordinates table. The primary key is the query sequence ID associating every single EST in the database with BLAST results, related alignments, distance matrices, NJ trees, ternary coordinates and Ka/Ks ratios.

Figure 2.1B depicts the processes involved in populating the tables in Figure 2.1A. In case of newly generated EST data, pre-processing includes base calling, filtering of low- quality sequences, identification of sequence features, and vector trimming. Automatic base calling and quality and vector trimming maybe performed with PHRED (Ewing et al., 1998).

The input source for EverEST are pre-processed high quality ESTs of a length of more than 200 bp that are screened by local BLAST searches with an expected value threshold of e.g. <

1 x 10-15 against several databases. These should be chosen based on completeness and taxonomic relatedness to the source of the cDNA. The BLAST interface of EverEST is command line driven following the NCBI syntax for standalone BLAST and needs fastA input files (Pearson and Lipman, 1988). The results are parsed into the blast_results tables which contain GenBank Accession Numbers of the three best hits, the sequence of the best hit and its annotation as well as related e-Values and information concerning translation frames.

The query sequence and all possible three best hits of every single search are translated into amino acid code, combined by a Visual Basic routine, and aligned using the T-Coffee algorithm (Notredame et al., 2000) which is a component of the EverEST package. T-Coffee is a progressive multiple alignment program which considers information from all of the sequences during each step, not just those being aligned at that particular stage. The alignment is stored into the alignment table in the fastA file format. Following the alignment, sequence divergences for every pair in the alignment is estimated as the observed proportion of amino acid sites at which the two sequences to be compared are different. All alignment positions with gaps were excluded previously (complete deletion). This option is generally desirable because different regions of DNA or amino acid sequences often evolve under different evolutionary forces. The user can choose between two methods for estimating evolutionary distances: Poisson-correction distance and Gamma distance (Dayhoff, 1978).

The distances are used to construct a neighbor-joining tree and a ternary graphic representation as depicted in Figure 2.2. The ratio of the number of nonsynonymous substitutions per nonsynonymous site (Ka) to the number of synonymous substitutions per synonymous site (Ks) is calculated by a subroutine to evaluate the selective forces acting on the protein (McDonald and Kreitman, 1991).

(18)

Figure 2.2A: Example for a ternary diagram of p- distances between query and BLAST results of three database searches re- spectively.

Figure 2.2B: Examples how the ternary diagram should be read. Note the numbers 1 - 4 on the diagram. The composition for each of these points is:

1 (60%/20%/20%), 2 (25%/40%/35%), 3 (10%/70%/20%), 4 (0%/25%/75%).

Please note that the ternary diagram is read counter clockwise.

The EverEST database was recently implemented on WINDOWS XP and 2000. It requires a preinstalled standalone BLAST version with at least the two components blastall and formatdb. The parse routine is written in C++ and has been successfully tested on several variants of the BLAST output files e.g. BLASTN, BLASTP or TBLASTX results. The other utilities supporting the database (e.g. divergence calculation, NJ-Tree construction and barycentric construction) were developed in Visual Basic. To visualize the results, several support features are available, e.g. output of trees in Newick format or the output of distance matrices in tab-delimited text format. The ternary graphic window allows choosing single ORF groups to obtain all related information in the database by a simple mouse click.

Configurability was one of the main design issues in the development of EverEST. The usage is not limited to the query sequences and databases.

This flow-chart presented here (Figure 2.1B) is constructed to maximize evolutionary relevant information that is contained in large amounts of DNA data while attempting to minimize computation time. This system is geared towards as much automation as possible, while maintaining, organizing and flagging all results for individual inspection. Moreover, increased automation reduces the number of error-prone steps. By reducing errors in the data processing and manipulation, the quality of data submitted to the public databases should be increased. The system we have described here is functional, and in use on a regular basis, however, it is evolving. The areas that we continue to address include the implementation of the data pre-processing, enhancements in putative function assignment, and the automatic

(19)

GenBank submission after successful annotation. For those ESTs that have insufficient hits, methods for automatically re-executing similarity searches periodically as the public databases get updated will be developed.

(20)

Novel relationships among ten model fish species supported by phylogenomics with ESTs

Accepted for publication in Journal of Molecular Evolution (2006)

“The biggest fish he ever caught were those that got away.”

Eugene Field

(21)

III. Novel relationships among ten model fish species supported by phylogenomics with ESTs

3.1 Abstract

The power of comparative phylogenomic analyses also depends on the amount of data. We used Expressed Sequence Tags (ESTs) from model fish species as a proof of principle approach in order to test the reliability of using ESTs for phylogenetic inference. As expected, the accuracy increases with the amount of sequences. Although some progress has been made in the elucidation of the phylogeny of teleosts, relationships among the main lineages of the derived fishes (Euteleostei) remain poorly defined and are still debated. We performed a phylogenomic analysis of 42 sets of orthologous genes from ten available fish model systems representing seven orders (Salmoniformes, Siluriformes, Cypriniformes, Tetraodontiformes, Cypinodontiformes, Beloniformes and Perciformes) of euteleosts to estimate divergence times and to examine the evolutionary relationships among those lineages. All ten fish species serve as models for developmental, aquaculture, genomic and comparative genetic studies. The phylogenetic signal and the strength of the contribution of each of the 42 genes were estimated with randomly chosen data subsets. Our study revealed a molecular phylogeny of higher-level relationships of derived teleosts, which indicates that the use of multiple genes produces robust phylogenies. Our phylogenomic analyses confirm that the euteleostean superorders Ostariophysi and Acanthopterygii are monophyletic and the Protacanthopterygii and Ostariophysi are sister clades. In addition and contrary to the traditional phylogenetic hypothesis, our analyses determine that killifish (Cyprinodontiformes), medaka (Beloniformes) and cichlids (Perciformes) appear to be more closely related to each other than either of them is to pufferfish (Tetraodontiformes). All ten lineages split before or during the fragmentation of the supercontinent Pangea in the Jurassic.

3.2 Introduction

The relative importance of increasing the number of analyzed taxa and the number of characters for accuracy of phylogenetic inferences remains an issue of debate (Cummings and Meyer, 2005; Gadagkar et al., 2005; Hillis, 1998; Hillis et al., 2003; Rosenberg and Kumar, 2003). Large-scale phylogenetic analyses inevitably involve a trade-off between taxon sampling and gene sampling. However, recent simulation and empirical studies suggest that increased gene sampling, in general, might have a greater beneficial effect on the rigor of the

(22)

estimation of phylogenetic topologies than more extensive taxon sampling (Mitchell et al., 2000; Rokas and Carroll, 2005; Rosenberg and Kumar, 2001). The benefits of sampling several independent gene genealogies to infer an organismal phylogeny with confidence are widely recognized (Chen et al., 2004; Cummings et al., 1995; Takezaki et al., 2003) because a more complete representation of the whole genome is highly desirable and stochastic errors occurring in data with small sample size will decrease with increasing sample size.

Comparative phylogenomic analyses using Expressed Sequence Tags (ESTs) from taxa across the spectrum of animal diversity promise to yield reliable and robust results. ESTs also provide an economical approach to identify large numbers of genes that can be used in gene expression and phylogenomic studies (Gerhold and Caskey, 1996; Renn et al., 2004).

Because of their usefulness, the rapid automated way of data collection, and the relatively low costs associated with this technology, many individual scientists as well as large genome sequencing centers have generated large numbers of ESTs that are publicly available and their numbers continue to increase rapidly.

However, the use of ESTs for phylogenetic analyses is limited to the rather small number of species, for which EST and genome projects have been conducted. In order to test to test the power of multi-locus approaches to reveal phylogenies it seems necessary to choose a group of species, for which many EST data are available, e.g., the teleost fish to conduct analyses based on different approaches such as Bayesian inference and maximum likelihood to overcome possible pitfalls of one particular method. A recent theoretical study (Mossel and Vigoda, 2005) revealed that Bayesian MCMC methods for phylogeny reconstruction could be misleading when the data are generated from a mixture of datasets. Thus, in cases of data set that contain potentially conflicting phylogenetic signals, phylogenetic reconstruction should be performed separately on each subset according to Mossel and Vigoda (2005).

There are more than 25,000 species of teleost fishes amounting to nearly half of the extant vertebrate species and about 96% of all extant fishes are classified as teleosts (Nelson, 1994). Since the pioneering work on the systematics of fishes by Greenwood et al. (1966), many studies have proposed novel hypotheses about the relationships among basal teleosts, but the relationships among the derived teleosts are still debated. One particular species-rich monophyletic group of derived teleosts is the Euteleostei currently ranked as one of the four subdivisions of the Teleostei, along with the more basal groups, Osteoglossomorpha, Elopomorpha, and Clupeomorpha (Arratia, 1999; De Pinna, 1996; Nelson, 1994).

The Euteleostei are the most derived and species-rich group of teleost fishes, comprising approximately 16,000 species. These are placed in 32 orders and nine superorders.

(23)

Currently used fish model species for developmental, genomic and comparative genetic studies are assigned to three superorders of the Euteleostei, the Ostariophysi, the Protacanthopterygii and the Acanthopterygii (Table 3.1). Ostariophysi are basal euteleosts characterized by the presence of the Weberian apparatus. Protacanthopterygii is a superorder that was established by Greenwood et al. (1966) and originally included a wide array of basal euteleosts. Since then, (Rosen and Patterson, 1969), (Rosen and Greenwood, 1970) and (Rosen, 1973; Rosen, 1974) have repeatedly removed several of the orders that were originally included in the Protacanthopterygii by them. The resulting Protacanthopterygii (sensu Rosen, 1974) is the basis of subsequent discussions on monophyly, inter-, and intrarelationships (e.g., Fink and Fink, 1996; Ishiguro et al., 2003). Among the euteleosts the, by far, most diverse lineage are the Acanthopterygii (spiny rayed fish), comprising approximately 14,800 species, in which both the dorsal and pelvic fins have true fin spines as well as rays. The majority of Protacantopterygii has ctenoid scales, and the pelvic fins are thoracic, and the jaws are protrusible. To the Acanthopterygii (Johnson and Patterson, 1993) assigned five orders (Perciformes, Dactylopteriformes, Scorpaeniformes, Pleuronectiformes and Tetraodontiformes) to a single clade as putative sister group to the Smegmamorpha (which contains the lineages Synbranchiformes, Mugiloidei, Elassomatidae, Gasterosteiformes and the Atherinomorpha).

Table 3.1: List of species used in this study. The classification follows (Nelson, 1994).

Class Actinopterygii (23,681 species, 42 Orders) Division Teleostei (23,637 species, 38 Orders)

Subdivision Euteleostei (22,262 species, 32 Orders)

Superorder Acanthopterygii (13,414 species, 13 Orders) Order Beloniformes (191 species, 5 families)

Oryzias latipes

Order Cyprinodontiformes (807 species, 8 families) Fundulus heteroclitus

Order Perciformes (9,293 species, 148 families) Haplochromis sp.

Order Tetraodontiformes (339 species, 9 families) Takifugu rubripes

Tetraodon nigroviridis

Superorder Ostariophysi (6,507 species, 5 Orders) Order Cypriniformes (2,662 species, 5 families)

Cyprinus carpio

Danio rerio

Order Siluriformes (2,405 species, 34 families) Ictalurus punctatus

Superorder Protacanthopterygii (312 species, 3 Orders) Order Salmoniformes (66 species, 1 family)

Oncorhynchus mykiss Salmo salar

(24)

Among the euteleosts the, by far, most diverse lineage is the Acanthopterygii (spiny rayed fish), comprising approximately 14,800 species, in which both the dorsal and pelvic fins have true fin spines as well as rays. The majority of Protacantopterygii has ctenoid scales, and the pelvic fins are thoracic, and the jaws are protrusible. To the Acanthopterygii (Johnson and Patterson, 1993) assigned five orders (Perciformes, Dactylopteriformes, Scorpaeniformes, Pleuronectiformes and Tetraodontiformes) to a single clade as putative sister group to the Smegmamorpha (which contains the lineages Synbranchiformes, Mugiloidei, Elassomatidae, Gasterosteiformes and the Atherinomorpha).

Much controversy persists over the interrelationships among teleosts. The euteleost origin dates back to about 290 million years ago (Inoue et al., 2005; Kumazawa et al., 1999), and due to the extensive variation not only in morphology but also in behavior, ecology, and physiology (see Helfman et al., 1997), it is not surprising that comparative anatomical approaches have faced a number of difficulties (e.g., lack of applicable characters for phylogenetic analyses, and difficulties in the homology assessment among characters). The same is true for molecular studies (Miya and Nishida, 2000; Stepien and Kocher, 1997) that used shorter (mostly mitochondrial) DNA sequences (mostly < 1000 positions) based on limited taxonomic representation. However, it is highly desirable to establish the relationships among the fish model systems in order to be able to interpret comparative genomic and developmental processes within the correct phylogenetic framework. It appears that adequate resolution of higher-level relationships among distantly related lineages will require longer stretches of DNA (e.g., Miya et al., 2003), amino acid sequences (e.g. Hoegg et al., 2004)) or DNA data sets based on multiple loci (e.g., Chen et al., 2004; Simmons and Miya, 2004;

Takezaki et al., 2004).

In order to increase the size of the gene sample available for phylogenetic analysis we took advantage of two complete actinopterygian fish genomes and collections of ESTs available from public databases. In the present study 42 concatenated amino acid sequences retrieved by similarity searches against public DNA and protein sequence databases were used to address the question of the relationships of derived teleosts and to test the power of multi-locus approaches to reveal phylogenies. The phylogenetic signal and the strength of contribution of each of the 42 genes was estimated with randomly chosen data subsets. Our study revealed a molecular phylogeny of higher-level relationships of derived teleosts, which indicates that the use of multiple genes produces robust phylogenies. The phylogeny was used to estimate divergence times and to examine the evolutionary history of the component lineages within the teleostean fishes.

(25)

3.3 Materials and Methods

Data collection

Cichlid EST sequences generated by us (Chapter II) and in a previous study (Watanabe et al., 2004) were screened against GenBank EST data of Cyprinus carpio (Cypriniformes), Fundulus heteroclitus (Cyprinodontiformes), Ictalurus punctatus (Siluriformes), Oncorhynchus mykiss (Salmoniformes), Oryzias latipes (Beloniformes), Salmo salar (Salmoniformes), and Tetraodon nigroviridis (Tetraodontiformes). All of these species are important fish model species. We also used protein data of Danio rerio (Zebrafish Sequencing Group at the Sanger Institute), Takifugu rubripes (JGI Fugu v3.0), and genome data of Homo sapiens (GenBank). Homo sapiens was used as closest related outgroup with available data. We used EverEST (Steinke et al., 2004), a software for processing simultaneous database searches based on the BLAST algorithm against all abovementioned databases to identify the best hits for any given cichlid EST sequence. EverEST was also used to assign query sequences to matched BLAST results. Only those sequences were assigned to the query gene from cichlids that were recovered as “best hits” in a translated BLAST routine using the standard vertebrate code and an e-value <10-50. The sequences were aligned using the T-Coffee algorithm (Notredame et al., 2000). Forty-two genes were found to be present in all eleven databases for all taxa and conserved enough so that an unambiguous alignment was possible. The accession numbers of the analyzed sequences and the number of amino acids used for the phylogenetic analyses are listed in Appendix A1. Gene sequences were concatenated to form a super-gene alignment with a total length of 7726 amino acid positions.

Phylogenetic Analyses

Neighbor-Joining (NJ) and maximum parsimony (MP) analyses of the combined amino acid alignment were performed with PAUP* v. 4.10b (Swofford, 2002). Maximum likelihood (ML) analyses were performed using PHYML (Guindon and Gascuel, 2003). The best fitting models of sequence evolution for ML were obtained by ProtTest 1.2 (Abascal et al., 2005).

Confidence in estimated relationships of NJ, MP and ML tree topologies was evaluated by a bootstrap analysis with 2,000 replicates (Felsenstein, 1985) and Bayesian methods of phylogeny inference (Larget et al., 2005). Bayesian analyses were initiated with random seed trees and were run for 200,000 generations. The Markov chains were sampled at intervals of 100 generations with a burn in of 1000. Bayesian phylogenetic analyses were conducted with MrBayes 3.0b4 (Huelsenbeck and Ronquist, 2001) using the Whelan and

(26)

Goldman model (2001). Alternative topologies were compared applying the approximately unbiased test (Shimodaira, 2002) as implemented in the CONSEL package (Shimodaira and Hasegawa, 2001), using the sidewise likelihood values estimated by PAML (Yang, 1997).

In order to test the phylogenetic signal and the contribution of each of the 42 genes to the general topology we randomly selected 100 subsets each containing six EST groups and constructed maximum likelihood trees for every subset and every single gene using PHYML with the model settings estimated as described above. The subset size of six represents a trade-off between computational power and the likelihood to retrieve every possible pair of loci. The amount of locus pairs represented in correct subset gene trees is counted and used for a graphical matrix representation. Therefore the number of subset gene trees supporting the basal dichotomy was used to evaluate the contributing loci. We also calculated the number of single gene trees supporting a given partition of the general topology (see Gadagkar et al., 2005). According to the extent of the amino acid substitution rate we also generated six subsets each containing seven loci. The loci were grouped according to their divergence rates.

This analysis was performed to test relationships between divergence rates and topology by constructing maximum likelihood trees as described above.

Molecular Clock

To estimate a local molecular clock a method of estimation with an optimization via the truncated Newton method was employed, as implemented in r8s (Sanderson, 2003). The truncated Newton (TN) algorithm tolerates age constraints. Divergence time algorithms require at least one internal node to be fixed or constrained. We used three dates: 55 MYA marking the earliest known fossil evidence for the Tetraodontidae (Berg, 1958), 75 MYA for the earliest known fossil evidence of the Salmonidae (Resetnikov, 1988), and 50 MYA as the age of the last common ancestor of Danio and Cyprinus (Cavender, 1991; Kruiswijk et al., 2002). The first two calibration nodes were only constrained by the max_age function in r8s, the latter one was fixed because the fossil represents the last common ancestor of both lineages. Based on these fossil calibrations, trees were constrained at the basal node at 290 MYA, the date at which pufferfish and zebrafish, a representative of the most basal lineage in this study, shared a last common ancestor. This estimation is based on a previous calibration from molecular data (Inoue et al., 2005; Kumazawa et al., 1999) and the data should therefore be taken with caution. Confidence intervals were assessed by means of a bootstrap approach.

We simulated 25 bootstrap matrices with Seqboot (PHYLIP 3.63 package, (Felsenstein, 1989) and, for each matrix, constructed a maximum likelihood tree. The resulting trees were then

(27)

analyzed with r8s as described above. The minimum and maximum values are represented by the minimum and maximum age estimates of the simulation matrices. Consistency between fossil and molecular age estimates for the three fossil calibration points was examined by using the fossil cross-validation method (Near et al., 2005; Near and Sanderson, 2004). The calibration points are approximately equally accurate because the magnitude of the squared deviation is only decreasing by a small fraction as fossils are removed.

3.4 Results

The alignment of the dataset consisted of 42 orthologous groups of eukaryotic protein fragments of ten teleost species and one outgroup species. The total length of the combined dataset was 7726 amino acid positions.

Maximum parsimony, maximum likelihood, neighbor joining and Bayesian inference analyses produced congruent tree topologies. The phylogenetic analyses of the complete dataset (Figure 3.1) strongly supported the monophyly of the teleost fishes used in this study.

Our analyses recovered two major clades in the teleosts. The first clade includes members of the Salmoniformes, Siluriformes and Cypriniformes and is supported by high bootstrap and posterior probability values. Within this clade, the representatives of the Salmoniformes (Oncorhynchus mykiss and Salmo salar) appeared as sistergroup to a clade comprised by the Siluriformes (Ictalurus punctatus) and the Cypriniformes (Cyprinus carpio and Danio rerio).

In the second clade, the representatives of the Tetraodontiformes (Tetraodon nigroviridis and Takifugu rubripes) were placed as sister group to a clade formed by the Cyprinodontiformes, Beloniformes and Perciformes. All nodes in this clade were strongly supported as well. The members of the Perciformes (Haplochromis sp.) and Beloniformes (Oryzias latipes) formed a monophyletic group, and the representative of the Cyprinodontiformes (Fundulus heteroclitus) branched basal to this clade.

(28)

Figure 3.1: Phylogeny based on a combined data set of 42 loci with a total of 7726 amino acid positions.

Values above branches indicate posterior probabilities (MrBayes; upper value of quartet), bootstrap values from maximum likelihood (PHYML; second value of quartet). Numbers below branch represent bootstraps from neighbor joining (third value of quartet) and maximum parsimony (both PAUP*; lowest value of quartet). Numbers right to a node groups represent estimated ages in MYA calculated using the local molecular clock method of age estimation with an optimization via the truncated newton method with r8s (Sanderson, 2003). Confidence intervals were assessed by means of a bootstrap approach with 25 replicates.

Calibration points are indicated with an asterisk. Numbers in a circle left to a node represent the percentage of single gene trees supporting a given node (see Gadagkar et al., 2005).

Comparing different topologies within the euteleost fishes with the approximately unbiased test significantly ruled out possible alternative Superorder relationships ((Protacanthopterygii (Ostariophysi + Acanthopterygii)) or (Ostariophysi (Protacanthopterygii + Acanthopterygii))). Thus a sister group relationship between the Ostariophysi and Acanthopterygii or between the Protacanthopterygii and Acanthopterygii were rejected (Table 3.2).

(29)

Table 3.2: Comparison of the likelihood values of different topologies among the different superorders within the euteleosts, applying the approximately unbiased test. Note the first topology is the maximum likelihood tree. Abbreviations: Acanthopterygii (acan), Ostariophysi (osta), Protacanthopterygii (prot), likelihood (loglk), difference of likelihood (∆loglk), P value (P).

Topology loglk ∆loglk P approx. unbiased test (Acan (Prot +Osta)) -58999.377 0.000

(Prot (Osta + Acan)) -59234.660 -235.283 0.002 (Osta (Prot + Acan)) -59456.775 -457.398 0.001

The relative ages of the main clades within the teleostean fishes as revealed from our molecular clock analyses were also estimated (Figure 3.1). The split between the Ostariophysi/Protacanthopterygii clade and the Acanthopterygii was dated to the early Triassic (approximately 217 + 4 MYA) whereas all other splits were estimated to have occurred in the Jurassic. Based on our calibrations, the split between the Cypriniformes and the Siluriformes was estimated to have occurred at 141 + 4 MYA. The time estimate for the split between the Tetraodontiformes and all other Acanthopterygian species was 195 MYA whereas the Cyprinodontiformes diverged 153 + 16 MYA from the latter group. The estimated divergence time between the cichlids and the Beloniformes was dated to 113 + 11 MYA. Based on our calibration points, these age estimates are relatively robust; the mean age estimated from 25 bootstrap trees for which we repeated the age-estimation procedure outlined above reveals a maximum of 16 MYA standard deviation and a fossil cross- validation (Near et al., 2005; Near and Sanderson, 2004) showed no inconsistent molecular age estimates. Despite the fact that our results correspond well with recent studies all molecular clock estimations should be treated with caution, because we used a molecular calibration (290 MYA for the last common ancestor of zebrafish and pufferfish) to constrain the basal node.

The strength of the phylogenetic signal and the contribution of each subset of genes to the general topology in 100 random subsets each containing six EST loci are rather weak as depicted in Figure 3.2 with a matrix representation of the amount of occurrence of loci in

‘correct’ topologies. Only a few loci (e.g., EST 11 or EST 42) showed enough resolution to reproduce the basal dichotomy or at least one of three subgroups (superorders), however the complete estimated topology as depicted in Figure 3.1 was not found with any of the 100 subsets above. The proportion of single gene trees supporting a given partition of the general topology ranges from 10 to 71 % (Figure 3.1) with terminal nodes being more often correctly inferred than basal nodes. Substitution rates of loci with high (e.g. EST 11 or EST 42) and

(30)

low (EST 7 or EST 12) phylogenetic signal were similar and ranged from 0.13 to 0.17. The amino acid substitution rate among all loci ranged from 0.01 to 0.31 (Figure 3.3).

Phylogenetic analyses of six subsets containing seven loci each according to the amino acid substitution rate supported the two major clades even with low substitution rates. However, relationships within the two clades varied with the substitution rate (Figure 3.3). The overall topology as depicted in Figure 3.1 was not recovered. However, the topologies were similar to that recovered from the analysis of the combined data set, whenever loci with lower amino acid substitution rates were used. Figure 3.4 shows that the length of the EST groups used does not correlate (R2 = 0.0081) with the amino acid substitution rates and therefore we conclude that the analyses of subsets are not biased due to length differences.

Figure 3.2: Matrix plot of EST pairs out of 100 simulated subsets containing six loci. The gray scale corresponds to the percentage of maximum likelihood topologies congruent with the topology depicted in Figure 3.1. The histogram below depicts the absolute number of appearances of single genes in congruent topologies. The trees on the right hand exemplify two maximum likelihood trees of pairs with high phylogenetic signal (EST 11 + 42) and low phylogenetic signal (EST 7 + 12). Numbers correspond to first column in Table 3.1.

(31)

Figure 3.3: Rates of amino acid substitution for the 42 ORF groups in ascending order. The trees above show estimated maximum likelihood topologies for six subsets each containing seven loci.

Figure 3.4: Plot of amino acid substitution rates against the number of amino acids used for each gene.

(32)

3.5 Discussion

Implications for multi locus phylogenies

The approach used in this study led to a well-supported hypothesis of the novel evolutionary relationship among the euteleostean fishes. The sampling of multiple genes with a comparatively large number of sequence positions is likely to improve phylogenetic robustness (Lake and Moore, 1998). The large amounts of ESTs being produced through automated sequencing technologies is therefore likely to provide scientists with sufficient data to calculate reliable multi locus phylogenies. The power of comparative phylogenomic analyses using ESTs from different taxa is a function of the number of data available. Here we were able to show that robustness increases with the amount of available sequences independent of their length and rate of amino acid substitution (Figure 3.4). Remarkably, all analyses produced congruent tree topologies with confidence values not lower than 82 (Figure 3.1). Given the fact that subsets of EST groups containing six sequences simulated in this study were not able to recover the phylogeny of the concatenated dataset we conclude that comparably high numbers of loci are needed to infer robust phylogenies from EST based studies among distantly related taxa. The number of single gene trees supporting partitions of the general topology corroborates this observation (Figure 3.1). Only terminal nodes could be resolved with single gene trees. The discrepancy between high confidence values within the combined analysis and the low number of single gene trees supporting the general topology shows that multi locus analyses perform better in resolving higher-level relationships among distantly related lineages than single locus analyses as was suggested before repeatedly.

However, the single gene trees are based on relatively short amino acid sequences due to the fact that we used EST data, which are usually not longer than ~600 bp. Therefore single loci datasets might not contain sufficient information to produce robust phylogenies. Although each of the loci included in the 42-gene set was carefully screened for orthology among the sequences derived from the different species, the possibility of unrecognized paralogy at a few of the loci cannot be fully excluded. The contribution of such paralogous sequences could not have been large enough to influence the phylogenetic reconstruction of the concatenated dataset, however, it might have been large enough to influence the reconstruction of some single gene trees.

Part of the increase in accuracy afforded by concatenating multiple genes is due to the fact that many branches in individual gene trees may have experienced only a few substitutions. Adding genes to a dataset by concatenation increases the absolute number of evolutionary changes on such branches and makes it possible to infer them with greater

(33)

accuracy. Furthermore, an overall increase in sequence length leads to an overall smaller variance in evolutionary rates and other parameters in model based methods. Therefore, it may be better not to discard genes producing incongruent phylogenies, as they may provide additional information for resolving some short branches (Rokas et al., 2003; Shevchuk and Allard, 2001). On the other hand, if individual gene trees contain systematic errors that may result in similar (but erroneous) phylogenies, then the use of congruent phylogenies may actually result in an attenuation of this error. Despite the fact that we did not make an effort to account for large variations in evolutionary rates, sequence length, transition–transversion ratio, and base composition (G+C content) among the single sequences the concatenated dataset performed well. This indicates that the increase in phylogenetic signal or signal/noise- ratio due to the concatenation is much higher than any bias introduced by using a single substitution pattern applied to the entire concatenated sequence. It is possible that the use of gene-specific evolutionary models in a partitioned approach may improve the accuracy of concatenated sequence analysis, but this is to date not possible with the available methods and software given the number of loci used in this study.

Implications for the teleost phylogeny

The results of phylogenomic analyses based on 42 orthologous groups of nuclear protein-coding genes confirmed the basal placement of Ostariophysi and Protacanthopterygii but revealed some unexpected relationships among acanthoptergyian species. Recent molecular studies have demonstrated that Ostariophysi and Protacanthopterygii are sister groups (Ishiguro et al., 2003; Saitoh et al., 2003) a finding that was confirmed in this study.

According to the molecular clock analyses the basal divergence of the Ostariophysi and Protacanthopterygii took place no later than the middle Triassic (213-221 MYA). Pangean separation in the middle Jurassic may have been responsible for the present geographic patterns, in which cypriniform fishes show a largely Laurasian distribution whereas siluriform fishes are likely to have originated in Gondwanaland leading to their present South American distribution on one hand and African lineages that subsequently dispersed into the Eurasian continent following land connections or accretion on the other (Saitoh et al., 2003). All members of the Ostariophysi share four or five modified vertebrae, aiding in hearing, which connect the swim bladder to the inner ear and convey pressure changes and sound (Weberian apparatus). Basal lineages maintained an adipose fin posterior to the dorsal fin, which is considered to be the ancestral character state for euteleosts. This enigmatic fin is not found in all basal euteleosts, however, since, e.g., esociforms and alepocephaloids lack it (Johnson et

(34)

al., 1996), it is likely that these have been lost secondarily. In all other lineages of the Ostariophysi, especially in basal orders such as the Characiformes, an adipose fin is usually present.

The molecular data support a close relationship between the Atherinomorpha (Beloniformes and Cyprinodontiformes) and a representative (Haplochromis) of the Percomorpha, a sister group of the Smegmamorpha (Johnson and Patterson 1993). The monophyly of the Smegmamorpha is not supported by the present study or any previous molecular phylogeny (Chen et al., 2003; Miya et al., 2003; Wiley et al., 2000). Ancestral features among the atherinomorphs like a protrusible upper jaw or flexible spines on dorsal and pelvic fins in abdominal or subabdominal position are shared with basal teleosts (Nelson, 1994). These features could be the result of a secondary loss that occurred during the evolution of ray finned fish because Oryzias and Fundulus are nested among perchlike fish like Takifugu, Tetraodon and Haplochromis, just as recently hypothesized by Chen et al.

(2004).

The splits in the Acanthopterygii group correspond well with the beginning breakup of Laurasia and the enlarging Turgai Sea in the Jurassic except the split of the tetraodontiform lineage (max. 216 MYA). Most of the tetraodontiform families are found in warm and temperate marine waters worldwide, with a few families absent from the Atlantic and eastern Pacific. The earliest known fossil evidence for the Tetraodontidae (Berg, 1958; Santini and Tyler, 2003) dates back to the early Tertiary. The relatively long branches among the Tetraodontidae as depicted in Figure 3.1 might be the result of independent and unique evolution along this lineage leading to rather compact genomes (Aparicio et al., 2002; Jaillon et al., 2004). Extant species of killifish and cichlids show a Gondwanan distribution (Streelman et al., 1998; Zardoya et al., 1996) that is in concordance with our paleo- phylogenetic reconstructions. The majority of the beloniform species are found in marine waters worldwide, the family Adrianichthyidae and members of the Belonidae are known to be secondary freshwater fishes with Gondwanan distribution (Collette, 2003).

3.6 Conclusion

We showed that multi gene EST phylogenies represent a powerful method to increase robustness of topologies. Our evaluations have demonstrated that inference of phylogeny accuracy increases with the number of loci and that these loci should be chosen according to their rate of amino acid substitution. This study revealed several more slowly evolving genes that are suitable for phylogenetic analyses in a concatenated frameset in fishes. The results of

(35)

the genome-wide phylogenetic analysis described here indicate that the available data support previous findings in mtDNA based molecular studies for the Ostariophysi/Protacanthopterygii relationship (e.g., Ishiguro et al., 2003) and concatenated nuclear loci among the Acanthopterygii (e.g., Chen et al., 2004). To reach a new level of confidence for phylogenetic purposes, representative samples of genome sequences or EST sequences from additional relevant taxa are required. The rapid progress of genomic resources for an increasing number of species also emphasized the importance of a reliable phylogenetic framework in which to interpret comparative results correctly.

(36)

Annotation of expressed sequence tags for the East African cichlid fish species Astatotilapia burtoni and evolutionary

analyses of cichlid ORFs

“Fish say, they have their Stream and Pond, but is there anything beyond?”

Rupert Brooke

(37)

IV. Annotation of expressed sequence tags for the East African cichlid fish species Astatotilapia burtoni and evolutionary analyses of cichlid ORFs

4.1 Abstract

To gain insights into the underlying genomic aspects of the astonishing diversity of the adaptive radiations of cichlid fish we conducted an expressed sequence tags (ESTs) study.

Here we report the collection and annotation of more than 12,000 ESTs generated from two different cDNA libraries obtained from the East African cichlid species Astatotilapia burtoni.

Putative transcripts were functionally annotated (using the Gene Ontology classification system) based on matching gene sequences in Homo sapiens.For evolutionary analyses, we combined our newly generated ESTs with all available sequence data for haplochromine cichlids, which resulted in a total of more than 45,000 ESTs. The ESTs represent a broad range of molecular functionsand biological processes. We compared the haplochromine ESTs to sequence data from those available for other fish model systems such as pufferfish (Takifugu rubripes and Tetraodon nigroviridis), trout, and zebrafish. We characterized genes in haplochromine cichlids that show a faster or slower rate of base substitutions in cichlids compared to other fish species, as this is indicative of a relaxed or reinforced selection regime.

About 18 % of the surveyed ESTs were found to have haplochromine specific rate differences suggesting that these genes might play a role in lineage specific features of cichlids. When characterizing these genes further, by means of calculating KA/KS ratios, we found four genes or 3,45 % of all more slowly evolving genes showed a signature of positive selection in the haplochromine lineage. These genes are candidate genes for further work on the genetic causes of cichlid fish diversity.

4.2 Introduction

The exceptionally diverse species flocks of cichlid fishes in the East African Great Lakes Tanganyika, Malawi and Victoria are prime examples for adaptive radiations and explosive speciation (Kocher et al., 2005; Kornfield and Smith, 2000; Salzburger and Meyer, 2004).

More than 2,000 cichlid species have evolved in the last few thousands to a couple of million years in the rivers and lakes of East Africa (Kocher et al., 2005; Salzburger et al., 2005;

Verheyen et al., 2003). Together with an additional ~1,000 species that are found in other parts of Africa, in South- and Central America, in Madagascar, and in India, the family

(38)

Cichlidae represents on of the most species-rich family of vertebrates. Besides for their unparalleled species-richness, cichlids are famous for their ecological, morphological and behavioral diversity (Fryer and Iles, 1972), for their propensity for rapid speciation (Verheyen et al., 2003), for their capacity for sympatric speciation (Schliewen et al., 1994; Wilson et al., 2000), and for the formation of parallel or convergent characters in independently evolved species flocks (Kocher et al., 1993; Stiassny and Meyer, 1999). For these reasons, the cichlid fishes, which have been referred to as ‘natures grand experiment in evolution’ (Barlow, 2000), appear as excellent model system to study basic dynamics of evolution, adaptation and speciation. However, while the phylogenetic relationships between the main cichlid lineages are largely established and some of the cichlids’ evolutionary novelties and particularities have been identified (see e.g., Kocher, 2004; Salzburger et al., 2005; Salzburger and Meyer, 2004), little is known about the genomic bases of the evolutionary success of the cichlids.

This is in contrast to the many advantages that the cichlid model system, and in particular the cichlid species flocks in East Africa, provide for genomic research. For example, the myriads of closely related species in East Africa’s cichlid species flocks are comparable to a ‘mutagenic screen’ (Kocher et al., 2005; Meyer, 1993). Thus, because of the possibility to produce viable crosses between different cichlid species in the lab, the genetic bases for particular traits can be studied by means of classical genetic tools (see e.g.

(Albertson et al., 2003; Streelman et al., 2003). Also, the many of the few differences that are expected in the genomes of such extremely closely related species are likely to be directly linked to different phenotypes. Because of the close relatedness of the different species, primer sets for the amplification of particular genomic DNA regions such as candidate gene loci, microsatellites or SNPs are applicable to a wide range of species (see e.g. Albertson et al., 2003; Carleton and Kocher, 2001; Sugie et al., 2004; Terai et al., 2002). The same holds for expression profiling with cDNA microarrays that, even if developed for one species, can be used for a variety of East African cichlid species (Renn et al., 2004). Genomic resources such as genomic maps, BAC and cDNA libraries, cDNA microarrays or expressed sequence tags (ESTs) that have been developed for one cichlid species are, thus, of general use to the cichlid community and are likely to facilitate genomic research in this exceptional model system.

A variety of genomic resources have already been established for East African cichlid species. Genetic maps are available for the Nile tilapia Oreochromis niloticus (Kocher et al., 1998; Lee et al., 2005) and the Lake Malawi species Metriaclima zebra (Albertson et al., 2003). BAC libraries are also available for O. niloticus (Katagiri et al., 2001) and M. zebra

Referenzen

ÄHNLICHE DOKUMENTE

The aim of this work was to create a contribution to the comparative physical gene map, especially to the cytogenetic gene map of the horse by mapping candidate genes

The study investigated ex- amples of new production technologies and processes, new materials applications in the fields of electronics and information technology, construction

Di V erential screening of this library led to the iden- ti W cation of about 30 genes (PIGs: in planta induced genes) that are up-regulated during parasitic growth, showing

Many of the obtained gene trees generally reflect the history of two rounds of duplication during vertebrate evolution, and were in agreement with the hypothesis

To this end, we determined the DNA sequence of the entire ParaHox C1 paralogon in the East African cichlid fish Astatotilapia bur- toni, and compared it to orthologous regions in

Our parsimony reconstruction of Hox cluster architecture at various stages of vertebrate evolution highlights that the variation in Hox cluster structures among jawed

The situation appears different when gene deserts (intergenic regions &gt;500 kb in size) are considered. That gene deserts counterbalance the number of genes on

thaliana, it is a salt-tolerant ephemeral plant that is widely distributed in semi-arid and semi-salinized regions of the Xinjiang region of China, thus providing an ideal candidate