• Keine Ergebnisse gefunden

Known families in clusters

3. Exact Matching of RNA secondary structures of viral genomes 23

3.4. Known families in clusters

Search using CMs from clusters Clusters in RefSeq reference

0 50 100 150 200

250 Search using CMs from clusters

Clusters in RefSeq reference

0 50 100 150 200 250

GP_kn ot1 s2m

HIV_PBS rimP Tel

omera se-v

ert C4

Coro na_F

SE TPP

Flaviviru s_D

B tmRN

A RF_site3

IRES_Aptho SBWMV1_UPD

-PKb

cyan o_tmRN

A

Pox_AX_el ement Tomb

us_I RE

Tym o_tRN

A-like cHP sok mir

STnc100 Other tRN A

Family name

Frequency

Figure 3.7.:Results of the two search strategies that match known ncRNAs inRfamwith the occurrences of the clusters.

3.4. KNOWN FAMILIES IN CLUSTERS

its own RNA family. This enables us to search for homologs in the fullRfamsequence database using our own CMs that are defined by the clusters. These newly generated CMs are further subjected to the previously mentioned training procedure so that the model can provide us with expectation values (E-values) for the bit scores. Subsequently, we have performed a local search with the toolcmsearchof theInfernalsuite using an E-value cutoff 102. A hit using this search suggests that the sequences in the respective cluster are related to the sequence of the Rfamdatabase and can be classified as related to that RNA family.

The results of the searches are shown in Figure 3.7. First, when we compare the number of clusters that have been found using the two search strategies (394 and 581) with the total number of clusters (61,281), it is apparent that with both strategies less than 1% could be assigned to a known RNA family. Most of the clusters have been assigned to the tRNA family, the STnc100 family, whose ncRNAs can either bind to proteins or mRNAs that regulate gene expression, and the mir family of miRNA precursors. The tRNAs are found by both methods with a similar frequency while the other families with lower frequencies have diverging counts.

The frequent occurrence of tRNA clusters can be explained by their strong secondary structure conservation even with low primary sequence identity (see Chapter 5 for a detailed analysis).

The other diverging frequencies can be explained by the relatively small number of identified occurrences and by the different search strategies. The analysis of the search strategies might also explain why only 1% of the clusters could be assigned to an RNA family. First, the CM that is built for each cluster is based on a structure-annotated multiple sequence alignment.

This alignment does not contain any gaps, because all sequences in a cluster fold into the same secondary structure. Often, the primary sequences of all cluster members are also very similar, which in the end results in a very restrictive CM that only matches sequences in theRfamdata set that also have similar primary sequence content. This might explain the low assignment rate of the second search strategy, but the first strategy that looks for matches in theRfamcross references also could not assign much more clusters to a family. In general, new members of existingRfamfamilies are found usingInfernal. In the first step of this process, all sequences of the manually curated seed alignment are aligned against a database of all targets usingBLAST. All BLASThits above a certain threshold are retained and are then searched usingInfernal with all available family models. The first filtering step using BLAST exploits pure primary sequence similarity to prefilter potential hits and this might result in the loss of potential family members with different primary sequences and similar secondary structures.

Figure 3.8 shows one of the≈ 250 tRNA clusters. The members of this cluster are mostly bacteriophages from different viral families that fold into the exact same secondary structure.

The strongly conserved tRNA secondary structure for 29 viral genomes can be explained by the universally conserved translational machinery, where the interaction of host ribosome, tRNA, mRNA, amino acid, and polypeptide chain requires specific structural patterns to function properly. Interestingly, Figure 3.8b shows that the primary sequences of all cluster members are

Fowl adenovirus 5 Sulfitobacter phage pCB2047−B Cyanophage Syn30

Vibrio phage pVp−1 Caulobacter phage phiCbK Mycobacterium phage Angelica Caulobacter phage CcrMagneto Mycobacterium phage Artemis2UCLA Erwinia phage phiEaH2 Mycobacterium phage Chy5 Mycobacterium phage Severus Mycobacterium phage Rizal Salmonella phage ViI

Cronobacter phage CR3 Prochlorococcus phage P−RSM4 Synechococcus phage S−SM2 Pseudomonas phage phiKZ Pseudomonas phage EL Bacillus phage SPO1 Staphylococcus phage GH15 Enterobacteria phage P1

Vibrio phage nt−1

Synechococcus phage metaG−MbCM1 Shigella phage Shfl2

Synechococcus phage S−MbCM6 Aeromonas phage 65 Acinetobacter phage 133 Ostreococcus tauri virus 2 Micromonas sp. RCC1109 virus MpV1 Viruses dsDNA viruses no RNA stage

Adenoviridae Aviadenovirus Fowl adenovirus B unclassified dsDNA phages

Caudovirales

Siphoviridae

T5likevirus unclassified T5likevirus

unclassified Siphoviridae

Myoviridae

I3likevirus unclassified I3−like viruses Viunalikevirus

unclassified Myoviridae Phikzlikevirus

Spounavirinae Spounalikevirus

Twortlikevirus unclassified Twortlikevirus Punalikevirus

Tevenvirinae

Schizot4likevirus

T4likevirus

unclassified T4−like viruses

Phycodnaviridae unclassified Phycodnaviridae

Prasinovirus unclassified Prasinovirus

(a)Taxonomy

N N N N N U R R H D H R M N N K D

H A

D H D Y

V N N N B N Y Y

B N N

D H N V N N N N

N R

N H N

N N

N G W

U C R N A H C N N N N N N N N N 1 10

20

30

40

50

60

68

(b)Secondary structure

Anticodon Frequency Amino acid

GUU 6 Gln

UCU 5 Arg

CCA 5 Gly

UGU 4 Thr

GAU 4 Leu

CAU 4 Val

GAA 2 Leu

UUU 1 Lys

UUG 1 Asn

UGC 1 Thr

UAA 1 Ile

GUC 1 Gln

CCU 1 Gly

CAC 1 Val

(c)Anticodons

Figure 3.8.:Example of a cluster that was assigned to the tRNA family. The sum of the frequencies in (c) is larger than the number of cluster members shown in (a), because some viruses have multiple genomic regions that are part of this cluster.

(c) shows a graphic of the common secondary structure of all cluster members including IUPAC nucleotides generated using the union of all nucleotides at a specific position.

highly diverse, which results in multiple anticodons that are present in different cluster members (see Figure 3.8c). Each anticodon is specific for one amino acid, but the list of represented amino acids does not follow any observable pattern. We have observed similar characteristics for the other clusters assigned to the tRNA family. The large frequency of assigned tRNAs is in accordance with studies that have shown that they are the only translation-associated genes frequently found in bacteriophages. Furthermore, the studies found that frequently occurring tRNAs in phage genomes tend to correspond to codons that are highly used by the phage genes but are rare in host genes [8, 21]. Using the data gathered in Chapter 3, we can neither confirm nor refute this theory, because the host associations provided by the NCBI Viral Genomes