3. Exact Matching of RNA secondary structures of viral genomes 23
3.4. Known families in clusters
Search using CMs from clusters Clusters in RefSeq reference
0 50 100 150 200
250 Search using CMs from clusters
Clusters in RefSeq reference
0 50 100 150 200 250
GP_kn ot1 s2m
HIV_PBS rimP Tel
omera se-v
ert C4
Coro na_F
SE TPP
Flaviviru s_D
B tmRN
A RF_site3
IRES_Aptho SBWMV1_UPD
-PKb
cyan o_tmRN
A
Pox_AX_el ement Tomb
us_I RE
Tym o_tRN
A-like cHP sok mir
STnc100 Other tRN A
Family name
Frequency
Figure 3.7.:Results of the two search strategies that match known ncRNAs inRfamwith the occurrences of the clusters.
3.4. KNOWN FAMILIES IN CLUSTERS
its own RNA family. This enables us to search for homologs in the fullRfamsequence database using our own CMs that are defined by the clusters. These newly generated CMs are further subjected to the previously mentioned training procedure so that the model can provide us with expectation values (E-values) for the bit scores. Subsequently, we have performed a local search with the toolcmsearchof theInfernalsuite using an E-value cutoff 10−2. A hit using this search suggests that the sequences in the respective cluster are related to the sequence of the Rfamdatabase and can be classified as related to that RNA family.
The results of the searches are shown in Figure 3.7. First, when we compare the number of clusters that have been found using the two search strategies (394 and 581) with the total number of clusters (61,281), it is apparent that with both strategies less than 1% could be assigned to a known RNA family. Most of the clusters have been assigned to the tRNA family, the STnc100 family, whose ncRNAs can either bind to proteins or mRNAs that regulate gene expression, and the mir family of miRNA precursors. The tRNAs are found by both methods with a similar frequency while the other families with lower frequencies have diverging counts.
The frequent occurrence of tRNA clusters can be explained by their strong secondary structure conservation even with low primary sequence identity (see Chapter 5 for a detailed analysis).
The other diverging frequencies can be explained by the relatively small number of identified occurrences and by the different search strategies. The analysis of the search strategies might also explain why only 1% of the clusters could be assigned to an RNA family. First, the CM that is built for each cluster is based on a structure-annotated multiple sequence alignment.
This alignment does not contain any gaps, because all sequences in a cluster fold into the same secondary structure. Often, the primary sequences of all cluster members are also very similar, which in the end results in a very restrictive CM that only matches sequences in theRfamdata set that also have similar primary sequence content. This might explain the low assignment rate of the second search strategy, but the first strategy that looks for matches in theRfamcross references also could not assign much more clusters to a family. In general, new members of existingRfamfamilies are found usingInfernal. In the first step of this process, all sequences of the manually curated seed alignment are aligned against a database of all targets usingBLAST. All BLASThits above a certain threshold are retained and are then searched usingInfernal with all available family models. The first filtering step using BLAST exploits pure primary sequence similarity to prefilter potential hits and this might result in the loss of potential family members with different primary sequences and similar secondary structures.
Figure 3.8 shows one of the≈ 250 tRNA clusters. The members of this cluster are mostly bacteriophages from different viral families that fold into the exact same secondary structure.
The strongly conserved tRNA secondary structure for 29 viral genomes can be explained by the universally conserved translational machinery, where the interaction of host ribosome, tRNA, mRNA, amino acid, and polypeptide chain requires specific structural patterns to function properly. Interestingly, Figure 3.8b shows that the primary sequences of all cluster members are
Fowl adenovirus 5 Sulfitobacter phage pCB2047−B Cyanophage Syn30
Vibrio phage pVp−1 Caulobacter phage phiCbK Mycobacterium phage Angelica Caulobacter phage CcrMagneto Mycobacterium phage Artemis2UCLA Erwinia phage phiEaH2 Mycobacterium phage Chy5 Mycobacterium phage Severus Mycobacterium phage Rizal Salmonella phage ViI
Cronobacter phage CR3 Prochlorococcus phage P−RSM4 Synechococcus phage S−SM2 Pseudomonas phage phiKZ Pseudomonas phage EL Bacillus phage SPO1 Staphylococcus phage GH15 Enterobacteria phage P1
Vibrio phage nt−1
Synechococcus phage metaG−MbCM1 Shigella phage Shfl2
Synechococcus phage S−MbCM6 Aeromonas phage 65 Acinetobacter phage 133 Ostreococcus tauri virus 2 Micromonas sp. RCC1109 virus MpV1 Viruses dsDNA viruses no RNA stage
Adenoviridae Aviadenovirus Fowl adenovirus B unclassified dsDNA phages
Caudovirales
Siphoviridae
T5likevirus unclassified T5likevirus
unclassified Siphoviridae
Myoviridae
I3likevirus unclassified I3−like viruses Viunalikevirus
unclassified Myoviridae Phikzlikevirus
Spounavirinae Spounalikevirus
Twortlikevirus unclassified Twortlikevirus Punalikevirus
Tevenvirinae
Schizot4likevirus
T4likevirus
unclassified T4−like viruses
Phycodnaviridae unclassified Phycodnaviridae
Prasinovirus unclassified Prasinovirus
(a)Taxonomy
N N N N N U R R H D H R M N N K D
H A
D H D Y
V N N N B N Y Y
B N N
D H N V N N N N
N R
N H N
N N
N G W
U C R N A H C N N N N N N N N N 1 10
20
30
40
50
60
68
(b)Secondary structure
Anticodon Frequency Amino acid
GUU 6 Gln
UCU 5 Arg
CCA 5 Gly
UGU 4 Thr
GAU 4 Leu
CAU 4 Val
GAA 2 Leu
UUU 1 Lys
UUG 1 Asn
UGC 1 Thr
UAA 1 Ile
GUC 1 Gln
CCU 1 Gly
CAC 1 Val
(c)Anticodons
Figure 3.8.:Example of a cluster that was assigned to the tRNA family. The sum of the frequencies in (c) is larger than the number of cluster members shown in (a), because some viruses have multiple genomic regions that are part of this cluster.
(c) shows a graphic of the common secondary structure of all cluster members including IUPAC nucleotides generated using the union of all nucleotides at a specific position.
highly diverse, which results in multiple anticodons that are present in different cluster members (see Figure 3.8c). Each anticodon is specific for one amino acid, but the list of represented amino acids does not follow any observable pattern. We have observed similar characteristics for the other clusters assigned to the tRNA family. The large frequency of assigned tRNAs is in accordance with studies that have shown that they are the only translation-associated genes frequently found in bacteriophages. Furthermore, the studies found that frequently occurring tRNAs in phage genomes tend to correspond to codons that are highly used by the phage genes but are rare in host genes [8, 21]. Using the data gathered in Chapter 3, we can neither confirm nor refute this theory, because the host associations provided by the NCBI Viral Genomes