• Keine Ergebnisse gefunden

MATERIALS AND METHODS

The phylogenetic utility of the nuclear Rag-1 (recombination activating gene-1) in vertebrate

8.3. MATERIALS AND METHODS

A total of 1839 nucleotide sequences of the Rag-1 gene (259 sequences from Amphibia, 579 Aves, 32 Chondrichthyes, 31 Crocodylia, 186 Lepidosauria, 363 Mammalia, 364 Actinopterygii, 25 Testudines) were downloaded from GenBank in May of 2005. Sequences with less than 300 bp of coding sequence were excluded for the analyses, due to the difficulties of aligning them.

A first alignment was made for each group (Amphibia, Aves, Chondrichthyes, Crocodylia, Lepidosauria, Mammalia, Actinopterygii and Testudines) using the ALIGNMENT option of the program CLUSTAL X 0.1 (Ramu et al. 2003). The obtained alignment was checked manually in the MEGA 3 (Kumar et al. 2004) alignment editor. The complete Rag-1 sequences of Xenopus (accession number L19324), Homo sapiens (accession number M29474) and Oncorhynchus mykiss (accession number U15663) were used to check and correct the alignment of all vertebrates and the consensus sequences created for each group with the program BioEdit (North Carolina State University). The consensus sequences were aligned using the CLUSTAL option as implemented in MEGA 3. Subsequently, the complete set of 1839 sequences was aligned according to each group-specific consensus sequence. A final manual check was performed to refine the correct alignment (total alignment available upon request from the authors).

Since numerous studies based on Rag-1 often used non-homologous fragments of the gene, not all the downloaded sequences started and finished at the same point (see Results section). Thus, in order to carry out a comparative analysis across vertebrates of the same homologous section of the gene, we selected a common fragment of 510 bp (from bp 1653 to bp 2163, corresponding to amino acid 551 to 721) of the gene. This fragment was the longest one possible common to all the vertebrate-groups considered (Figure 8.1), although it was still not complete in all taxa included. A subset of 1460 sequences was acceptably complete for this fragment (a minimum of 400 out of the 510 bp) and was used for the comparative analysis across all vertebrate (205 sequences from Amphibia, 579 Aves, 31 Chondrichthyes, 13 Crocodylia, 122 Lepidosauria, 165 Mammalia, 322 Actinopterygii, 23 Testudines).

For each group, base composition for the total length of the gene fragment and for each codon position has been calculated in PAUP* (Swofford 2002). We tested for possible biases in base composition using the Chi-square test as implemented in

PAUP*. This program was also used to perform a NJ tree analysis on the 1460 vertebrate sequences for the 510 bp of the selected fragment. MEGA 3 was used to calculate the most and the least represented amino acids, codon usage, number of conserved, variable and parsimony informative amino acids, and the uncorrected p-distances within genera. The number of conserved, variable and parsimony-informative nucleotide sites was calculated in PAUP*, as were the uncorrected p-distances for each codon position and for the total selected fragment of the gene. Due to the large number of sequences, a subset of 351 sequences was used for birds to calculate the uncorrected p-distances for the 510 bp fragment and for the 1st, 2nd and 3rd codon position of it. This subset was created removing superfluous sequences corresponding to randomly generated numbers (from 1 to 579) by the program Microsoft EXCEL. The generated subset of 351 sequences was considered to be representative of the total 579 original sequences since the mean uncorrected p-distances calculated on the 351 sequences corresponded to the mean uncorrected p-distances based on 579 sequences. Uncorrected p-distances were also calculated for the total length (or for the available part of it) of the gene for Aves, Actinopterygii, Lepidosauria, Amphibia and Testudines. These were the only vertebrate groups for which the dataset was available spanning over the total or near total length of the gene (see Results section). The dN/dS ratio was calculated in PAML (Yang 1997), following the method of Nei and Gojobori as implemented in the program.

Due to the large amount of data analyzed (pairwise comparisons of the studied sequences), the program MATLAB (The MathWorks, Natick, Massachusetts) was used to plot the uncorrected distance for each codon positions against the uncorrected p-distances for the selected fragment to study possible saturation of the gene. Maximum and mean uncorrected intragroup p-distances of Amphibia, Aves, Lepidosauria, Mammalia and Actinopterygii were plotted against estimates of ages of divergence of the respective groups. Species divergence times based on fossil record and molecular clocks were used according to Hedges and Kumar (2003):

- ray-finned fish (Actinopterygii) as 450 mya - amphibians (Xenopus) as 360 mya;

- mammals as 310 mya;

- birds (Gallus) as 228 mya.

Sliding window analyses of sequence divergence along the Rag-1 gene were constructed with Microsoft EXCEL plotting the data obtained by the program DnaSP Ver. 4.10.3 (Rozas et al. 2003). Based on this program the number of polymorphic (segregating) sites (=S) was calculated for the total length of the gene in window of 50 bp and steps of 10. To maximize the resolution, since the program start the computation beginning the analysis of the sequence only at the point all the sequences are present (thus, no if there are unknown positions or gaps), a subset of the initial number of sequences for each group was considered (88 for Amphibia, 305 Aves, 26 Chondrichthyes, 13 Crocodylia, 91 Lepidosauria, 75 Actinopterygii, 21 Testudines). For Mammalia, due to the fact that two main segments have been used in published studies (one at the beginning, and one at the second half of the Rag-1 gene), the dataset was divided into three main subsets to increase the resolution. The first dataset contained 53 sequences, the second contained 35 sequences and the third contained 55 sequences.

Sequences present in one subset were not included in the others.

8.4. RESULTS

General Alignment of Rag-1 for all vertebrates (from bp 1 to bp 3324)

Due to the use of a large number of different primers different sections of the Rag-1 gene have been sequenced, and consequently not all the sequences analyzed started at the same point. After the alignment of all vertebrate sequences, in amphibians the most represented fragment is that between bp 1794 and bp 3009, present in about 204 sequences (Figure 8.1). However, different starting sequence points are at bp 1656, 1965 and between 1554 and 1597. Different end points are from 2559 to 2571 and 3196.

In ray-finned fish, the most represented fragment is between bp 1711 and bp 2569 (present in 357 sequences, Figure 8.1). Different sequence starting points are at bp 1465, bp 1496, bp 1561, between bp 1618 and bp 1641, bp 2593 and bp 2734. End points are at bp 582 (but only for two sequences), bp 2589, bp 3114, bp 3153, bp 3165, bp 3201.

In birds, the common fragment of all the sequences is found between bp 1177 and 2166 (fragment present in all 579 sequences) (Figure 8.1). However, some sequences start also at bp 102, while the sequence of chicken (accession number M58530) starts at bp 19. Different sequence end points are at bp 2256, at bp 3165 and at

bp 3132. The complete sequence of chicken (accession number M58530) finishes at bp 3324.

In Chondrichthyes the most represented fragment is between bp 1638 and bp 3056 (present in 29 sequences, Figure 8.1). Different sequences starting points are between bp 1564 and 1570. One sequence begins only at bp 2593 and finishes at bp 2985. Other end points are at bp 2634, between bp 3056 and 3058, between bp 3123 and 3132 and at bp 3163.

In Crocodylia, the most common fragment is found between bp 651 and bp 1424 (presents in 30 out of 31 sequences, Figure 8.1). Another starting point is at bp 642, while another end point is at 2739.

In Lepidosauria, the common fragment present in the majority of the sequences is found between bp 2581 and 3044 (present in 162, Figure 8.1). However, sequences start points are variable between bp 102 and bp 123, at bp 148, between 156 and bp 163, bp 1635, at bp 1652 and at bp 2743. Different end points are at bp 2335, between bp 2536-2566, bp 3054, bp 3077, bp 3100, bp 3114 and bp 3198.

In mammals, the common fragment present in most of the sequences is found between bp 1913 and 2403 (present in 210 sequences, Figure 8.1). Starting and ending points of the sequences are highly variable in this group, with sequences starting between bp 88 and 112, or at bp 321, bp 569, bp 630, bp 1134, bp 1924, bp 2189, bp 2368 and bp 2566. Ending point are between bp 1444 and 1487, between bp 2383 and 2403, at bp 2924, bp 2980, bp 2298 and bp 3300. The complete Rag-1 sequence of Lama glama (accession number AF305953) has been included in the alignment. Finally, in Testudines the most represented fragment is between bp 157 and bp 3114 (present in 25 sequences, Figure 8.1), although one sequence finishes at bp 3088. Another starting point is at bp 148.

Figure 8.1. Alignment of the most represented fragment within each group. Most represented fragments are in grey. Numbers refer to the complete vertebrates alignment and indicates the bp of beginning and end of the fragment. Bp 1 and bp 3324 are the start and the end of the total Rag-1 gene, in the alignment of all vertebrates.

The total length of the Rag-1 gene is different in the various taxa and major groups of vertebrates (see also the Results section concerning the subset of 510 bp).

- In birds, only Corvus orru (AY443277), Corvus coronoides (AY443276), Corvus corone (AY056989) have 20 additional amino acids compared to the remaining birds analyzed between position 57 and 78. Of these 20 additional amino acids, 16 are present also in all Lepidosauria, Testudines and Crocodylia. Of these 16 common amino acids, the ones in position 60 and 61 are always serine and leucine in each of the above-mentioned groups.

- Also in birds, only 20 taxa (Tinamus guttatus (AF143726), Struthio camelus (AF143727), Chauna torquata (AF143728), Megapodius freycinet (AF143731), Gallus gallus (AF143730, M58530), Anhima cornuta (AY140765), Chauna torquata (AY140766), Oreophasis derbianus (AY140773), Chamaepetes goudotii (AY140769), Penelopina nigra (AY140771), Penelope obscura (AY140771), Aburria aburri (AY140768), Pipile jacutinga (AY140772), Crax blumenbachii (AY140775), Ortalis canicollis (AY140774), Pauxi pauxi (AY140778), Mitu tuberosa (AY140776), Nothocrax urumutum (AY140777), Megapodius reinwardt (AY140767)) have five

additional amino acids between position 99 and 104. Anas strepera (AF143729) has only three of them. Four of these amino acids are presents also in Lepidosauria and Mammals.

- In birds, only Pericrocotus ethologu (AY443316) has three additional amino acids compared to the remaining birds analyzed between position 122 and 126. These three amino acid positions are also in all Lepidosauria, Testudines and Crocodylia, even if they are not alanine, histidine, alanine as in Pericrocotus ethologu.

- All Chondrichthyes are uniquely characterized among the available vertebrate sequences by an additional amino acid in position 643 and by missing amino acids 630 and 631.

- All Actinopterygii are missing a proline in position 634, which is instead present in all the other groups.

- The basal actinopterygian Amia calva (AF369059) has an additional amino acid in position 647 compared to all the rest of vertebrates analyzed.

- A single amino acid is present in position 683 in almost all Actinopterygii sequences spanning over this region, except in Sphoeroides dorsalis (AY308795), Badis sp.

(AY330976), Badis siamensis (AY330975), Badis ruber (AY330974), Badis pyema (AY330973), Badis kyar (AY330972), Badis khwae (AY330971), Badis kanabos (AY330970), Badis corycaeus (AY330969), Badis blosyrus (AY330968), Badis badis (AY330967), Badis assamensis (AY330966), Dario dario (AY330977), Dario hysginon (AY330978), Nandus nandus (AY330979), Colisa china (AF519735), Trichogaster leerii (AF519734), Betta macrostoma (AF519733), Betta ocellata (AF519732), Betta unimaculata (AF519731), Betta patoti (AF519730), Arothron hispidus (AY700367), Parosphromenus deisnneri (AF519740). All the rest of analyzed vertebrates do not have this amino acid, which seems to be gained in most Actinopterygii.

- The actinopterygian Merluccius albidus (accession number AY308787) has an insertion of 14 amino acids between amino acid 695 and amino acid 710. These 14 amino acids, after blasting them against the sequences contained in GenBank, did not result to be similar to any other available sequence (e.g., transponsable elements, part of Rag-1 intron of ray-finned fishes).

- The hyloid amphibians (Hyla meridionalis (AY523737, AY583339, AY571662), Hyla

Telmatobius bolivianus (AY583344), Ceratophrys ornata (AY364218), Phrynohyas venulosa (AY364215), Rhinoderma darwinii (AY364222), Litoria cerulea (AY323767), Bufo bufo (AY583336, AY323762), Bufo regularis (AY323763), Bufo melanostictus (AY364197), Dendrobates auratus (AY364214), Centrolene prosoblepon (AY364223), Agalychnis callidryas (AY323765), Leptodactylus mystacinus (AY323771), Leptodactylus fuscus (AY323770), Leptodactylus melanonotus (AY364224)), the ray-finned fish (Actinopterygii) and sharks (Chondrichthyes) have an additional amino acid in position 679. This is absent in the rest of amphibian and vertebrates analyzed.

The most conserved part of the gene, as expected, is the second half, as it is possible to observe from the sliding window analyses and from the mean intragroup uncorrected p-distances calculate on the total length of the gene (Figure 8.2; Table 8.1).

N Mean Mean 1st cod.

Amphibia no Xenopus 88 18.0 9.3 3.9 41,3 0.0-27.0

Amphibia 510 bp no Xenopus

205 21.6 11.7 5.7 47.5 0.0-31.0

Lepidosauria 91 15.0 9.7 6.9 28.5 0.1-23,4

Lepidosauria 510 bp 122 12.8 6.9 4.3 27.0 0.0-21.0

Actinopterygii 75 12.6 7.2 3.7 27.0 0.1-36.8

Actinopterygii 510 bp

322 21.6 12.7 8.1 44.0 0.0-45.7

Testudines 21 5.6 3.5 1.9 11.5 0.9-10.0

Testudines 510 bp 23 5.5 2.7 1.5 12.1 0.2-10.7

Table 8.1. Uncorrected p-distances calculated on the total length of the Rag-1 gene (3324 bp) and for the selected shorter fragment of 510 bp. Numbers are in percent with approximation to the higher value. N= number of sequences used. *indicates that a subset of the initial 579 sequences available was generated eliminating random sequence/number with the method described in Material and Methods.

The deep fracture that can be observed in the sliding window analysis of Aves between the windows 801-850 and 881-930 is due to the indels inserted in correspondence of the 20 additional amino acids (=60 nucleotides) of the sequences of the genus Corvus. The other deep fracture observed in almost all the sliding window analyses is due to the presence of gaps inserted in the alignment in correspondence of the additional 14 amino acids (= 42 bp) of Merluccius albidus. The low variability of Chondrichthyes, Testudines and Crocodylia is probably due to the low number of available sequences for this analysis. However, more recently diverged groups (like Aves) have less intragroup variability than older groups (like Actinopterygii) (Table 8.1). If we consider a ratio of divergence ages between Actinopterygii and Aves of about two (450 mya/228 mya), Actinopterygii have on average variability double than Aves (12.6/6.3= 2), even if the mean average of Actinopterygii could be underestimated in this case due to the low number of sequences used. However, based on the subset of the fragment studied (510 bp), there is a correspondence between ages of divergence among the different studied groups and uncorrected intragroup p-distances (Table 8.1 and Results section on the subset of 510 bp).

Figure 8.2. Comparative sliding window analyses calculated on the total length of the Rag-1 gene (from bp 1 to bp 3324) for the each vertebrate group analyzed. The values on the x-axes indicate the number of variable sliding windows analyzed (see Materials and Methods, also for number of sequences used). S on the y-axes indicates the number of polymorphic sites. Different start and end points for the various groups are because of different length of fragments available. See Materials and Methods for the number of sequences used in this analysis.

Comparison across taxa (sequence subset of 510 bp) Base composition

The selected subset of 510 bp (= 170 amino acids, corresponding to the fragment comprised between amino acid 551 to amino acid 721 of the total alignment on all vertebrates) spans, compared with humans and Pleurodeles waltl, part of the conserved homeodomain which binds the RSS (Frippiat et al. 2001) and part of the protein necessary for the interaction with Rag-2 (McMahan et al. 1997). Due to this fact, this is probably a highly conserved part of the total Rag-1 gene. Amino acid length is similar in all studied groups, if we exclude the sequence of Merluccius albidus from the alignment (Table 8.4). Among all the vertebrates analyzed for the above-mentioned subset of the gene, 4% of amino acids are conserved, 88% are variable and 74% are parsimony-informative.

A heterogeneous base composition could give misleading results (Lockhart et al.

1994; Tarrio et al. 2000). Chi-square tests rejected the hypothesis of homogeneous base frequencies at the third codon position in various vertebrate groups (Figure 8.3). Base composition (Figure 8.3) reveals a significant dominance of guanine at the first codon position in all groups. In contrast, no general trend can be observed at the third codon position. Mammalia and Actinopterygii have a slightly higher GC content, while Aves and Amphibia have a slightly higher AT content. Most frequently used amino acids are also different in different groups (Table 8.2). The least used amino acid is tryptophan in four out of eight groups. Consistently in all vertebrate groups, of nucleotides at first codon positions, G was overrepresented and C underrepresented. Second positions had a positive bias towards T and A in all taxa, and third positions had a positive bias towards T in Amphibia, Chondrichthyes, Lepidosauria, Crocodylia, and Testudines.

Figure 8. 2. Base composition for the total length of the gene and for separately first, second and third codon positions. Numbers are in percentage. * indicates Chi-square probability p<0.05. A p<0.05 indicates the null Hp is rejected, so the heterogeneous base frequencies observed is true.

Most used aa Frequency Least used aa Frequency

Aves Leu 9.9 Trp 0.7

Amphibia Leu 9.5 Trp 0.5

Actinopterygii Ser 10.6 Tyr 0.9

Chondrichthyes Glu 9.6 His 0.7

Mammalia Val 10.7 Trp 0.3

Lepidosauria Leu 10.3 Trp 0.7

Crocodylia Leu 9.3 Gln 0.0

Testudines Leu 10.1 Gln 0.5

Table 8.2. Most and least represented amino acids in the different studied vertebrates group and its frequency on the total of amino acids used.

Corresponding to the lack of base composition differences within each group at the third codon position of the Rag-1 gene, also the use of synonymous codons at the third position does not reveal a general trend, not within neither among the different analyzed groups (Appendix 8.1).

Nucleotide substitution pattern

Based on the 510 bp fragment analyzed, lineages of older origin, as ray-finned fish and amphibians, have, as expected, higher nucleotide variability and parsimony informative sites (Table 8.3) than younger groups as mammals or birds. However, despite the high difference of uncorrected p-distances between, for example, Amphibia and Aves (x 2.5), the number of parsimony-informative sites at the third codon position is at a similar level among different groups (except the ones for which few or a lot of sequences were considered). The fact that an old lineage as Chondrichthyes (in our analysis represented only by Elasmobranchii, with only one sequence of Batoidea) has low variability based on our data is probably due to the low number of sequences analyzed. However, it cannot be excluded that this low variability is related to a slower rate of molecular evolution of sharks compared to, for example, mammals (Martin 1999). The nucleotide variability is reflected also by the amino acid variability within each studied group (Table 8.4). The ratio between synonymous and non-synonymous substitution is below 1 in almost all cases, except in Amphibia when we include in our

could explain the high number of non-synonymous mutations observed in Amphibia.

Examining the number of parsimony-informative sites for each codon position in Amphibia, this number remains the same for the third codon position when excluding the sequences of Xenopus from the analysis, while it decreases for first and second codon position (Table 8.4). Since the first and second codon position affect amino acid change more than third codon position, this could explain the higher number of non-synonymous mutation in the complete Amphibia data set. Thus, after removing all of the Xenopus sequences from our analysis, the dN/dS ratio decreases to below 1. In Aves the high dN/dS ration, even considering the low uncorrected sequence divergence and amino acid variability compared with the other studied groups, could probably be an artifact of the large number of analyzed sequences (579). However this fact merits to be further investigated, with more attention to the possible sites of the gene under

Actinopterygii 322 75 25 12 63 98 71 152

Chondrichthyes 31 40 60 13 27 29 16 89

Mammalia 165 43 57 9 34 30 21 120

Lepidosauria 122 52 48 6 46 56 35 142

Crocodylia 13 6 94 3 3 5 2 10

Testudines 23 21 79 8 13 13 5 49

Table 8.3. Percentages of nucleotide variability with approximation to the higher value, for the 510 bp analyzed of the Rag-1 gene. N= number of sequences used in the analysis.

N Variable

Amphibia no Xenopus 161 59 32 41 0-0.9 153

Actinopterygii 322 79 12 62 0-0.9 169

Table 8.4. dN/dS range and amino acid variability in percentages with approximation to the higher value and amino acid length for the 510 bp analyzed of the Rag-1 gene. N=

number of sequences used in the analysis.

At the nucleotide level, as expected, the third codon position is more variable than first and second. If we consider uncorrected sequence divergence for mitochondrial genes, the third codon position becomes saturated near to 10% in vertebrates (Moritz et al. 1992). Uncorrected sequence differences within each group could then suggest saturation at the third codon position at least Actinopterygii and Amphibia (uncorrected distance above 30%, Table 8.5), which is not confirmed by saturation plots (Figure 8.3).

Moreover, these selected 510 bp of the Rag-1 gene are able to recover most major vertebrate clades, such as Gymnophiona, Anura, Caudata, Neobatrachia, Lepidosauria, Crocodilia, Testudines, Aves, Actinopterygii, Amniota and Diapsida in a NJ analysis (data not shown). Important clades not recovered as monophyletic are Amphibia and Aves+Crocodylia. Dipnoi and the coelacanth are not basal to tetrapods (thus monophyly of tetrapods is not recovered). Our data also indicates for the fragment studied a positive correlation between the mean uncorrected sequences divergence or the maximum ones and the age of divergence (Figure 8.4).

Mean Mean

Amphibia 21.7 11.8 5.6 47.7 0.0-31.0 1.8 0.0-7.2 14.0 1.4

Amphibia no Xenopus

21.6 11.7 5.7 47.5 0.0-31.0 - - 14.0 1.3

Actinopterygii 21.6 12.7 8.1 44.0 0.0-45.7 2.8 0.0-18.2 25.0 1.5

Chondrichthyes 10.4 6.0 3.5 21.7 0.0-19.9 - - 14.0 2.6

Mammalia 13.5 4.9 3.4 32.3 0.0-26.2 1.4 0.0-5.3 21.0 2.3

Lepidosauria 12.8 6.9 4.3 27.0 0.0-21.0 1.5 0.0-4.1 17.0 1.8

Crocodylia 1.9 2.0 0.5 3.3 0.0-3.5 - - 14.0 5.3

Testudines 5.5 2.7 1.5 12.1 0.2-10.7 - - 10.0 2.8

Table 8.5. Uncorrected p-distances in percentages with approximation to the higher value, based on the 510 bp fragment of Rag-1 studied. * indicates that values were calculated on a subset of 351 sequences, as described in Materials and Methods. For groups with less than 40 sequences variation within genus have not been calculated.