• Keine Ergebnisse gefunden

The lamprey genome: illuminating vertebrate origins

(in review in Nature Genetics)

14

The Lamprey genome: illuminating vertebrate origins

Jeramiah J Smith 1*†,2, Shigehiro Kuraku 3†,4, Carson Holt 5†,6, Tatjana Sauka- Spengler 7†,8, Ning Jiang 9, Michael S. Campbell 6, Mark D. Yandell 6, Tereza Manousaki 4, Axel Meyer 4, Ona E. Bloom 10,11, Jennifer R. Morgan 12†,13, Joseph D. Buxbaum 14-17, Ravi Sachidanandam

14, Carrie Sims 18, Alexander S. Garrett 18, Malcolm Cook 18, Robb Krumlauf 18,19, Leanne M.

Wiedemann 18,20, Stacia A. Sower 21, Wayne A. Decatur 21, Jeffrey A. Hall 21, Chris T.

Amemiya 2,22, Nil R. Saha 2, Katherine M. Buckley 23, Jonathan P. Rast 23, Sabyasachi Das 24, Masayuki Hirano 24, Nathanael McCurley 24, Peng Guo 24, Nicolas Rohner 25, Clifford J.

Tabin 25, Paul Piccinelli 26, Greg Elgar 26, Magali Ruffier 27, Bronwen L. Aken 27, Stephen M.

J. Searle 27, Matthieu Muffato 28, Miguel Pignatelli 28, Javier Herrero 28, Matthew Jones 8, C.

Titus Brown 29,30, Yu-Wen Chung-Davidson 31, Kaben G. Nanlohy 31, Scot V. Libants 31, Chu-Yin Yeh 31, David W. McCauley 32, James A Langeland 33, Zeev Pancer 34, Bernd Fritzsch 35, Pieter J. de Jong 36, Lucinda L Fulton 37, Brenda Theising 37, Paul Flicek 28, Marianne Bronner

8, Wesley C. Warren 37, Sandra W. Clifton 37,38†, Richard K. Wilson 37, Weiming Li 31

Affiliations:

1 Department of Biology, University of Kentucky, Lexington, KY, USA.

2 Benaroya Research Institute at Virginia Mason, Seattle, WA, USA.

3 Genome Resource and Analysis Unit, Center for Developmental Biology, RIKEN, Japan.

4 Department of Zoology and Evolutionary Biology, University of Konstanz, Konstanz, Germany.

5 Ontario Institute for Cancer Research, Informatics and Bio-Computing, Toronto, ON, Canada. 6 Eccles Institute of Human Genetics, University of Utah, Salt Lake City, UT, USA.

7 The Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK.

8 Division of Biology, California Institute of Technology, Pasadena, CA, USA.

15

9 Department of Horticulture, Michigan State University, East Lansing, MI, USA. 10 The Feinstein Institute for Medical Research, Manhasset, NY, USA.

11 The Hofstra North Shore-LIJ School of Medicine, Hempstead, NY, USA.

12 Marine Biological Lab, Woods Hole, MA, USA

13 Department of Cell and Molecular Biology, University of Texas at Austin, Austin, TX, USA.

14 Department of Genetics and Genomics Sciences, Mount Sinai School of Medicine, New York, NY, USA.

15 Department of Psychiatry, Mount Sinai School of Medicine, New York, NY, USA.

16 Department of Neuroscience, Mount Sinai School of Medicine, New York, NY, USA. 17 Friedman Brian Institute, Mount Sinai School of Medicine, New York, NY, USA.

18 Stowers Institute for Medical Research, Kansas City, MO, USA.

19 Department of Anatomy &Cell Biology, The University of Kansas School of Medicine, Kansas City, KS, USA.

20 Department of Pathology and Laboratory Medicine, University of Kansas School of Medicine, Kansas City, KS, USA.

21 Center for Molecular and Comparative Endocrinology, University of New Hampshire, Durham, New Hampshire, USA.

22 Department of Biology, University of Washington, Seattle, WA, USA.

23 Department of Immunology and Department of Medical Biophysics, University of Toronto, Sunnybrook Research Institute, Toronto, ON, Canada.

24 Emory Vaccine Center and Department of Pathology and Laboratory Medicine, Emory University, Atlanta, Georgia, USA.

25 Department of Genetics, Harvard Medical School, Boston, MA, USA.

26 MRC National Institute for Medical Research, London, England.

27 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.

16

28 European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton Cambridge, UK.

29 Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA.

30 Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, USA.

31 Department of Fisheries and Wildlife, Michigan State University, East Lansing, MI, USA.

32 Department of Zoology, University of Oklahoma, Norman, OK, USA.

33 Department of Biology, Kalamazoo College, Kalamazoo, MI, USA.

34 Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, Baltimore, MD, USA.

35 Department of Biology, University of Iowa, Iowa City, IA, USA.

36 Children's Hospital Oakland, Oakland, CA, USA.

37 The Genome Institute, Washington University School of Medicine, St. Louis, MO, USA.

38 The Advanced Center for Genome Technology, Norman, OK, USA.

* Correspondence to: Jeramiah J Smith, jjsmit3@uky.edu

Current Address

17

Abstract

Lampreys are representatives of an ancient vertebrate lineage that diverged from our own

~500 million years ago. By virtue of this deeply shared ancestry, the sea lamprey (Petromyzon marinus) genome is uniquely poised to provide insight into the ancestry of vertebrate genomes and fundamentals of vertebrate biology. Here, we present the first version of the lamprey genome assembly, generated by overcoming challenges presented by its high content of repetitive elements and GC bases as well as the absence of broad-scale sequence information from closely related species. Analyses of the assembly indicate that two whole genome duplications likely occurred before divergence of ancestral lamprey and gnathostome lineages. Moreover, the results help define key evolutionary events within vertebrate lineages, including the origin of myelin associated proteins and development of appendages. The lamprey genome provides an important resource for reconstructing vertebrate origins and evolutionary events that have shaped extant organisms and their genomes.

18

Main text

The fossil record reveals that, during the Cambrian period, there was a great elaboration in the diversity of animal body plans. This includes emergence of a species with several characteristics shared with modern vertebrates such as a cartilaginous skeleton that encases the central nervous system (cranium and vertebral column) and provides a support structure for the branchial arches and median fins. The cartilaginous cranium of this species housed a tripartite brain with a forebrain regulating neuroendocrine signaling via the pituitary gland, a midbrain (including an optic tectum) for processing sensory information from paired sensory organs and a segmented hindbrain for controlling unconscious functions such as respiration and heart rate. These adult features suggest that their embryos must already have possessed uniquely vertebrate cell types like the skeletogenic neural crest and ectodermal placodes, both defining characters of modern day vertebrates. Subsequent diversification of this lineage gave rise to the jawed vertebrates (gnathostomes), hagfish (for which genome-scale sequence data are currently limited), lamprey and several extinct lineages (Figure 1, Supplement 1).

Given the critical phylogenetic position of lamprey, as an outgroup to the gnathostomes (Figure 1), comparing the lamprey genome to gnathostome genomes holds the promise of providing insights into the structure and gene content of the ancestral vertebrate genome. Unsolved important questions include the timing and subsequent elaboration of ancient genome duplication events and the elucidation of genetic innovations that may have contributed to the evolution and development of modern vertebrate features. These include the jaws, myelinated nerve-sheaths, adaptive immune system and paired appendages/limbs.

19

Figure 1. An abridged phylogeny of the vertebrates. This figure shows the timing of major radiation events within the vertebrate lineage. Extinct lineages and some extant lineages (e.g. coelacanths, lungfish  and  hagfish)  have  been  omitted  for  simplicity.  CZ  =  Cenozoic.  Here,  “reptiles”  is  synonymous   with   “sauropsids”,   “ray-finned   fish”   is   synonymous   with   “actinopterygian”   and   'Euteleostome'   is   synonymous  with  “Osteichthyan”.

Sequencing, assembly and annotation

Approximately 19 million sequence reads were generated from genomic DNA derived from the liver of a single wild-captured adult female sea lamprey (Petromyzon marinus) (Supplement 2). The lamprey genome project was initiated well before the discovery that lamprey undergoes programmed genome rearrangements during early embryogenesis, which results in the deletion of ~20% of germline DNA from somatic tissues (Smith et al. 2009), though the effects of rearrangement on the genic component of the genome are not fully understood. We used raw sequence reads to examine large-scale sequence content and repetitive structure of the lamprey genome. These analyses indicated that the lamprey genome is highly repetitive, rich in GC bases, and highly heterozygous (Supplement 3). Although these features tend to encumber assembly of long contiguous sequences, analyses of

broad-20

scale structure enabled optimization of parameters used in assembly algorithms (Supplement 3).

The current assembly was generated using Arachne (Jaffe et al. 2003) and consists of 0.816 Gb of sequence, distributed across 25,073 contigs. Half of the assembly is in 1,219 contigs of 174 kb or longer and the longest contig is 2.4 megabases. This assembly resolves multi-kilobase to megabase-scale structure over a majority of single-copy genomic regions (Supplement 3), permitting the annotation of repetitive elements (Supplement 4), genes (Supplements 5 and 6) and conserved intergenic features (Supplement 7). Detection of extensive conserved synteny with gnathostome genomes (below) indicates that lamprey scaffolds accurately reflect the chromosomal organization of the lamprey genome. This assembly therefore provides unparalleled resolution of gene content and structure of this evolutionarily important genome.

Ab initio searches for repetitive DNA sequences revealed that the lamprey genome contains abundant repetitive elements with high sequence identity. We identified 7752 distinct families of repetitive elements, accounting for 34.7% of the assembly (Supplement 4).

Notably, this proportion is expected to be a significant underestimate, due to collapsing of repetitive elements during genome assembly. The large diversity of lamprey repetitive elements and abundance of high-identity (presumably young) repeats represents a potentially rich resource for studies of the evolution and transposition of repetitive sequences.

The location of genes was determined by combining RNAseq mapping and exon linkage data (Supplement 5) with gene homologies, prediction of coding sequences, splicing signals and repetitive elements using the MAKER pipeline (Cantarel et al. 2008) (Supplement 6). The final set of annotated protein-coding genes contained a total of 26,046 genes encoding 26,204 transcripts. This number is similar to the numbers of predicted protein-coding genes in other vertebrate genomes reported to date. Conserved non-coding elements (CNEs) were identified by homology to published sequences (Woolfe et al. 2007; Venkatesh et al. 2006).

Searches_ENREF_7 identified a limited number of homologous CNEs in lamprey, 337 (5.0%

of 6670 (4)) and 287 (6.0% of 4782 5), in close agreement with previous analyses (McEwen

21

et al. 2009). For those lamprey CNEs that where linked to conserved homologous regions in lamprey and gnathostome genomes, sequence identity typically extends over approximately half the length (53%) of the gnathostome CNE (Supplement 7). Thus, either the lamprey lineage diverged from jawed vertebrates before most gnathostome CNE sequences became highly constrained or these CNEs have evolved much more rapidly in the lamprey genome than in jawed vertebrate genomes. Future work on additional lamprey and hagfish genomes should ultimately resolve this question.

Variation in nucleotide content and substitution can strongly influence intragenomic functionality and intergenomic comparative analyses. Analysis of the lamprey genome revealed that the GC-content of the lamprey genome assembly is higher than most other vertebrate genomes that have been reported. Overall, 46% of the assembly is composed of GC bases, similar to the GC content of raw WGS reads (Supplement 8). Genome-wide analyses also reveal patterns of intragenomic heterogeneity in GC content similar to those of amniote species that possess isochore structures, but lower in magnitude. Moreover, GC-content of protein-coding regions (61%) is markedly higher than that of non-coding and repetitive regions. As expected, this is highest in the third position of codons (GC3; 75%) (Supplement 8). Patterns of GC bias strongly affect codon usage and amino acid composition of lamprey genes, imparting an underlying structure to lamprey coding sequences that differs substantially from all other sequenced vertebrate and invertebrate genomes (Figure 2, Supplement 8). Notably, we did not detect a significant correlation between GC3 and GC-content of adjacent non-coding regions. Thus, it appears that processes that lead to patterns of intragenomic heterogeneity in lamprey GC-content differ fundamentally from those in species that possess isochore structures. This raises a question regarding the adaptive value or other biological importance of the observed variation of GC content within and among genomes.

To further explore the biological basis of high GC content and its intragenomic heterogeneity, we examined the relationship between protein-coding GC-content and codon usage bias, amino acid composition and gene expression level. The results show that genomic GC content strongly correlates with codon usage bias and amino acid composition, but not

22

with gene expression level (Supplement 8). These observations are consistent with a scenario wherein high GC content results from broad-scale substitution bias, rather than selection for specific GC- rich codons. As lamprey is clearly an outlier amongst vertebrates, further dissection of coding GC content in the sea lamprey and other lamprey and hagfish species will help reveal the causes and consequences of intragenomic heterogeneity of GC content in vertebrate genomes.

Figure 2. Genome-wide deviation of lamprey coding sequence properties from patterns observed in other vertebrate and invertebrate genomes. (A) Codon usage bias. Correspondence analysis on relative synonymous codon usage (RSCU) values was performed using nucleotide sequences of all predicted genes concatenated for individual species. (B) Amino acid composition. Correspondence analysis was performed using deduced peptide sequences of all predicted genes for individual species. Red: lamprey.

Grey: invertebrates. Green: jawed vertebrates.

Duplication structure of the genome

It is generally accepted that two rounds (2R) of whole genome duplication (WGD) occurred early in the history of vertebrate evolution (Ohno 1979). However, the timing of these defining duplication events has not been well supported by genome-wide sequence data thus far (Kuraku et al. 2009). As the proximate outgroup to jawed vertebrates, the lamprey genome is uniquely suited for addressing several questions regarding the occurrence, timing, and outcome of WGD events. To identify gene and genome duplication events in the ancestral

23

vertebrate lineage, we analyzed patterns of duplication within conserved syntenic regions of lamprey and gnathostome genomes and compared these patterns to the entire genome.

We estimated duplication frequencies based on aligning all predicted lamprey proteins from the MAKER (Cantarel et al. 2008) dataset to whole genome assemblies for human (GRCh37, GCA_000001405.1) and chicken (Gallus_gallus-2.1, GCA_000002315.1).

To account for the possibility that paralogs have been retained on one or both genomes, in a way that bypasses many confounding aspects of phylogenetic reconstruction (Supplement 9, 10), regions were considered putative orthologs if they yielded the highest-scoring alignment between the two genomes or an alignment score (bitscore) within 90% of the top-scoring alignment (see Supplement 10 for details). Strong patterns of conserved synteny are observed between lamprey and both human and chicken genomes (Supplement 10). For simplicity, we present comparisons to the chicken genome below as the genome is known to have undergone substantially fewer interchromosomal rearrangements than have mammalian genomes (Hillier et al. 2004; Smith and Voss 2006).

24

Figure 3. Conserved synteny and duplication in the lamprey and gnathostome (chicken) genomes. In panels A – D, the locations of presumptive lamprey/chicken orthologs (including duplicates) are plotted relative to their physical position on chromosomes and scaffolds, and connected by colored lines. Panels A and B show pairs of chicken chromosomes that correspond to a series of lamprey scaffolds. In panel A, 10 lamprey loci are present as duplicate copies in the chicken genome and 59 are present as single copies. In panel B, 12 lamprey loci are present as duplicate copies in the chicken genome and 54 are present as single copies. In Panels C and D, asterisks indicate duplicates.

Our analyses indicate that most lamprey and gnathostome genes currently do not possess 2R-duplicates in their respective genomes (Supplement 9, 10), presumably due to frequent loss of one paralog following duplication. Accordingly, we used the lamprey genome to search for a signature of large-scale duplication that does not rely on the retention of duplicated genes, but can be informed by their presence. Specifically, we searched for cases

25

wherein a single lamprey scaffold contains interdigitated homologies from two distinct regions of a gnathostome genome (Figure 3). Such patterns are consistent with large-scale duplication followed by random loss of either paralogous copy. Nearly all lamprey scaffolds exhibited patterns of interdigitated conserved synteny of gnathostome orthologs (Tables S10.1 and S10.2). Moreover, homologs from individual pairs of gnathostome chromosomes were recurrently observed in interdigitated syntenic blocks on several lamprey scaffolds.

Importantly, some of the individual homologous markers that contributed to these conserved syntenic blocks were mapped to duplicate positions within gnathostome genomes, being present on the two homologous gnathostome chromosomes. Although these duplicates constitute a relatively modest fraction of conserved syntenic homologs (14.5% in Figure 3A, and 18.2% in Figure 3B, not counting redundant copies), we interpret these as strong evidence that large-scale (i.e. whole genome) duplication has played a major role in shaping gnathostome genome architecture.

Similar duplication patterns on lamprey scaffolds also appear to support the notion that large-scale (i.e. whole genome) duplication has played a major role in shaping lamprey genome architecture. Although lamprey scaffolds do not yet provide chromosome-scale resolution, several cases were identified wherein two large lamprey scaffolds contain predicted paralogs and patterns of interdigitated conserved synteny (two defining signatures of large-scale duplication; Figure 3C and D, Supplement 10). To further assay for patterns indicative of ancient whole genome duplication events (i.e. 2R) within the lamprey genome, we manually examined all lamprey scaffolds that possessed 10 or more gnathostome homologs. These 83 scaffolds accounted for 10% of the comparative map and possessed a duplication frequency (0.463, including redundant copies of duplicates) that was similar to the genome at large (0.448). Among these scaffolds, we identified 29 gene pairs that were present as duplicates on two large scaffolds and one trio that was present on three large scaffolds. For a majority of duplicates, scaffolds contained at least one additional ortholog on the chicken chromosome that harbored an ortholog of the duplicate [specifically, both (59.3 %), one (29.6

%) and no (11.1%) scaffolds contained an additional syntenic ortholog]. On average, these

26

scaffolds contained 2.98 additional conserved syntenic genes for each individual lamprey duplicate (including the 11.1% with no syntenic markers). These patterns are consistent with the existence of patterns of interdigitated synteny in the lamprey genome that are highly similar to in gnathostome genomes, indicating that the most recent (2R) WGD event likely occurred in the common ancestral lineage of lampreys and gnathostomes.

Additional genome-wide analyses reveal that: 1) the number of ancestral loci with retained duplicates in gnathostome genomes is not significantly different from the number with retained duplicates in lamprey (lamprey = 0.271, chicken = 0.262, c2 = 2.94, P = 0.08, Supplement 10); 2) the frequency of shared duplications is higher than would be expected by chance (observed = 0.150, expected = 0.022, c2 = 6179, P(c2) < 1e-100,  P(Fisher’s  Exact)  <  

1e-100, Supplement 10); 3) a model invoking recurrent selection against small-scale duplicates across a majority of the genome is not sufficient to explain genome-wide patterns of shared duplication (Supplement 10), and 4) inclusion of lamprey in phylogenetic analyses resolves gene families consistent with two rounds of whole genome duplication (Supplement 9). Moreover, targeted analyses of Hox clusters and gonadotropin releasing hormone (GnRH)-syntenic regions reveal that post-duplication loss of paralogs has occurred largely independently in lamprey and gnathostome genomes, consistent with divergence of the two lineages shortly after the last WGD event (Figure 4, Supplement 11 and 12). Although the less parsimonious scenario of one or two independent and ancient WGD events in gnathostome and lamprey lineages cannot be completely ruled out, neither a gnathostome-specific genome duplication nor persistent selection to retain a subset of independent duplicates is likely to explain the subtle differences in duplication structure of lamprey versus gnathostome genomes. It seems exceedingly unlikely that such genomic arrangements and distributions of synteny blocks would arise by chance or mechanisms other than by an ancient shared WGD. We therefore propose that genome wide patterns of duplication are indicative of a shared history of two rounds of genome-wide duplication prior to the lamprey/gnathostome divergence.

27

Figure 4. The effect of genome duplication and independent paralog loss on the evolution of lamprey/gnathostome conserved syntenic regions. (A) Conserved synteny among the GnRH group 2, 3, and 4 genes in lamprey, chicken, and humans, including the medaka region for GnRH3, which is absent in tetrapods. The orientation of each chromosome (Chr) and scaffold (Sf) is indicated with line arrows. A pointed box represents the orientation  of  each  gene.  Open  rectangles  with  red  X’s  indicate   lost GnRH loci. The ancestral state of the gene region is shown at the bottom. (B) Assembled lamprey Hox scaffolds and patterns of conserved synteny, relative to human Hox clusters (human Hox clusters, rather than chicken, are used because all four human hox-syntenic regions are integrated into the human genome assembly). A pointed box represents the orientation of each gene. Three additional conserved syntenic genes, located adjacent to the PM2Hox cluster, are omitted due to space limitations (retinoic acid receptor, heterogeneous nuclear ribonucleoprotein and thyroid hormone receptor).

28 Ancestral vertebrate biology

It has been suggested that many of the morphological and physiological features that characterize vertebrates evolved through the modification of preexisting regulatory regions

It has been suggested that many of the morphological and physiological features that characterize vertebrates evolved through the modification of preexisting regulatory regions