Genome Organization - Computational Exploration of Virus Diversity on Transcriptomic Datasets

2.3 TRAVIS

2.3.5 Genome Organization

The genome organization contributes a lot to the classiﬁcation of Reoviridae whereas the pure sequence similarity plays a minor role (see chapter 2.3.1; Upadhyaya et al., 1998;

Graham et al., 2006; Deng et al., 2012). Since the assembly of the transcriptomes was targeted towards the host and not to extract viruses in the ﬁrst place, the settings were most likely not ideal for viral sequences. It is a general problem to assemble viral sequences simply due to their high inner-species variation (Eriksson et al., 2008; Yang et al., 2012).

These problems reduce the probability of a fully assembled virus within the transcriptomes.

However, it was expected to retrieve a large proportion of fragmentarily assembled viral genomes and a method to estimate the size as well as the whole genome organization had to be developed. It is important to note that the approach described in this chapter is highly experimental and not yet part of the pipeline but it is a ﬁrst simple attempt to make the sequence evaluation more meaningful and comparable between the samples.

Several properties of the potential viral sequences can be derived from the interpretation of the output of TRAVIS that can be used for genome estimation. First, the closest known relative. If the non-redundant protein database for the reciprocal BLAST is up to date, it is possible to ﬁnd the latest publicly available closest related virus. Second, the position of the match between the suspicious sequence and its closest known relative. Based on these two properties, the completeness of the genome or at least segment of the potential new virus can be reckoned by following the concept of reference mapping. For example, if a transcript has a length of 1000 bp and matches well starting from position 1000 of a virus with a length of 3000 bp, the new virus is probably missing 1000 bp at the beginning as well as at the end of the sequence. Of course this principle can also be applied to e.g. three diﬀerent fragments that match diﬀerent regions of the same reference virus. If one fragment matches the start, the second in the middle and the third at the end of the reference, it is likely to have a nearly full segment where the connective regions of the new virus have either not been properly sequenced or assembled.

However, mapping or aligning the suspicious sequences to the reference viruses is very diﬃcult and error-prone at nucleotide level if the sequences are very distant to each other.

Since the identiﬁcation and veriﬁcation of the suspicious sequences is already based on the well alignable region of the particular ORFs, similar methods should be able to make reference mapping possible based on the respective amino acid sequences. To achieve that, the suspicious ORFs were aligned with the corresponding ORF of the reference by MAFFT on amino acid level. Pal2Nal was then used to infer the original nucleotide sequences of the respective amino acid sequence and thus a complete nucleotide alignment has been created. The suspicious sequences were then used to calculate a consensus sequence with FASconCAT-G to obtain the complete estimated sequence including gaps also to indicate

54 2.3 TRAVIS

the missing trails. Additionally, a consensus sequence was calculated for the amino acid alignment to also have an estimate about the protein (see Fig. 16 and Fig. 17).

To make the generated sequences comparable and give an additional objective measure for the obtained consensus sequences, the Gapless Forced Alignment Score (GFAS) was introduced. In its essence, it is an identity expressed as percentage of two given sequences.

In contrast to the more sophisticated BLAST, GFAS scores the complete sequences based on a pairwise alignment that strongly penalizes gaps. GFAS thus yields lower scores and does not take into account InDels or ambiguities compared to BLAST. This alignment is created by using MAFFT with high gap penalty costs. The number of positions in the alignment, where both sequences had an identical character state, were counted and divided by the number of positions where both sequences do have character states except gaps.

To test the explanatory power of GFAS, simulations have been made. For that, one million pairs of random sequences of lengths between 1 and 10000 amino acids have been created. The GFAS of each pair has been calculated and the median was 4% GFAS whith an upper quartile of 5% GFAS. These statistics in combination with the density estimate (see Fig. 18) imply that most likely GFAS-identities above 5% probably indicate non-randomness.

1: procedureGenome Estimation(Suspi ci ousSequences,Re f er ence) 2: Align Ami no Aci d Sequences into Ami no Aci d Al i g nment

3: GenerateConsensus Ami no Aci d Al i g nment ◃ FASconCAT-G 4: CalculateGaplessForcedAlignmentScore Ami no Aci dC onsensus vsRe f er ence

5: ReverseTranslate Ami no Aci d Al i g nment into Nucl eot i d e Al i g nment ◃ Pal2Nal 6: GenerateConsensus Nucl eot i d e Al i g nment ◃ FASconCAT-G 7: end procedure

Figure 16: Genome Estimation Algorithm.

2.3TRAVIS55

Virus Database (NCBI)

Reference Gene ORF_001

Sample Sequence A

Full NCBI Database

Sample Sequence B ORF_002

ORF_003 ORF_001

??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

ORF_003 ORF_002

Reference Virus

Sample Sequence C ORF_004

ORF_006 ORF_005

ORF_001 ORF_003 ORF_002

Figure 17: Core Concept of the Genome Estimation.

Depicted is an example for three suspicious sequences ’Sample Sequence A, B and C’ that are supposed to be closest related to the ’Reference Virus’. It is a summarized plot for the diﬀerent sequences that are part of the output of TRAVIS Scavenger. Each sample sequence matches diﬀerent regions of the reference virus. Since the matching regions in this case are unambiguous the missing parts of the potential new virus can be estimated based on the reference virus. The missing parts are represented as question marks.

56 2.3 TRAVIS

Figure 18: Simulation of GFAS-identities for Randomized Sequences.

Density estimates of GFAS-identities for one million randomly drawn amino acid sequences of up to 10000 amino acids in length. The density of the lengths of simulated was distributed in such a way, that nearly all potential lengths were covered (above). GFAS-identities peaked at 4% - 5% suggesting a deviation of sequence similarity from random chance above 5%

GFAS-identity (below).

2.3 TRAVIS 57

Im Dokument Computational Exploration of Virus Diversity on Transcriptomic Datasets (Seite 57-61)