• Keine Ergebnisse gefunden

Development of computational tools for the transcriptome-wide analysis

4 Discussion

4.1 Development of computational tools for the transcriptome-wide analysis

In addition to the long known “epigenetic” methylations in DNA and various post-translational modifications of proteins (e.g. phosphorylation, ubiquitination, acetylation, etc.), chemical modifications are also found in most cellular RNAs. A wide variety of such RNA modifications exist in nature and in general, they serve to expand the chemical properties of the four basic nucleotides, meaning that they can regulate the functions of the RNAs that carry them. Many enzymes that mediate RNA modifications contain conserved protein domains that harbour their catalytic activity. Although the enzymes that are responsible for introducing some RNA modifications are known, the specific substrate RNAs and target nucleotides of many other putative RNA modifying enzymes remain to be determined.

The human genome encodes seven 5-methylcytosine (m5C) RNA methyltransferases that belong to the Nol1/Nop2/SUN (NSUN) family (reviewed in Motorin et al., 2010).

So far, proteins of this family have been linked to modifications at specific sites in cytoplasmic tRNAs (NSUN2; Brzezicha et al., 2006), mitochondrial and cytoplasmic rRNAs (NSUN4, NSUN1 and NSUN5; Camara et al., 2011; Schosserer et al., 2015;

Sharma et al., 2013b) and enhancer RNAs (NSUN7; Aguilo et al., 2016), but the targets of the NSUN6 and NSUN3 m5C RNA methyltransferases remained elusive. A strategy that can be used for the identification of the RNAs associated with RNA binding proteins is in vivo cross-linking followed by isolation of RNA-protein complexes, isolation of RNA and deep sequencing of a corresponding cDNA library (Bohnsack et al., 2012).

This approach (CRAC) was employed in the Bohnsack lab to identify RNA-interaction partners of NSUN6 and NSUN3 (Haag et al., 2016; Haag et al., 2015b). However, the identification of the cellular RNAs bound by the proteins in this method requires that the obtained sequence reads are quality controlled and mapped to a well annotated version of the human genome. Therefore, bioinformatic algorithms and mapping tools were employed and further developed to generate a systematic pipeline, specifically adapted for the mapping and analysis of CRAC data derived from human cells.

Many newly developed techniques for the transcriptome-wide mapping of RNA modifications, determining the RNA-interactome of RNA-binding proteins and the analysis of gene expression, are based on the analysis of next generation sequencing

data (Bohnsack et al., 2012; Darnell, 2012; Hafner et al., 2010; Ingolia et al., 2009; Krogh et al., 2016; Nagalakshmi et al., 2008). For example, RNA-Seq, in which total cellular RNA (depleted of ribosomal RNA) is isolated, fragmented and sequenced, is used to investigate the transcriptome of a cell population at a given time under certain conditions (Nagalakshmi et al., 2008), as the relative number of unique sequence reads mapping to individual genes allows a quantitative statement about the expression level of the corresponding mRNAs. Similarly, ribosome profiling (Ingolia et al., 2009) can provide a snapshot of the mRNAs that are being translated in a given cell population by enabling sequencing and identification of ribosome-associated mRNAs. The analysis of CRAC data relies on similar principles as the sub-population of cellular RNA that is attached to the protein of interest is isolated, sequenced and mapped to the genome. The accumulation of multiple sequence reads mapping to a specific region of the genome then indicates binding of the protein to the corresponding RNA transcript. As RNA modifications can occur in the majority of transcripts and are highly abundant in non-coding RNAs, in contrast to approaches for analysis of RNA-Seq and ribosome profiling data, a bioinformatic pipeline for the mapping of CRAC data generated for RNA modification enzymes, requires a well-annotated and complete reference genome or transcriptome. Another difference between the analysis of CRAC data and the analysis of gene expression by RNA-Seq is that in CRAC, the cross-linked RNA-protein complexes are purified on matrices and the non-specific binding of RNAs to such beads could lead to background. Alternatives to the standard UV254 cross-linking, such as cross-linking with light at 365 nm after treatment of the cells with 4-thiouridine (PAR-CRAC), can be used to increase the specificity of cross-linking and furthermore, modules were developed within the bioinformatic pipeline to enable sorting and mapping of reads containing only specific mutations that are introduced by the direct cross-linking of the RNA and protein, thereby significantly reducing the non-specific background in the final data output of the analysis pipeline.

Also in contrast to RNA-Seq, in which only the number of reads mapping to the genes coding for individual transcripts is considered, one of the aims of CRAC is to identify the specific binding site of the protein on the RNA transcript. In the case of the RNA methyltransferases NSUN6 and NSUN3, analysis of the read distribution between different classes of RNA transcript and between different tRNA genes suggested that the cytoplasmic tRNAs tRNACys and tRNAThr are bound by NSUN6 and the mitochondrial tRNAMet is associated with NSUN3. These putative target RNAs were confirmed by additional in vivo experiments, but close analysis of the distribution of mapped sequence reads on the tRNA sequences also provided the basis for the identification of the

modification target nucleotides of these enzymes. These could subsequently be determined by mutational analysis combined with in vitro methylation assays (Haag et al., 2015; Haag et al., 2016). The identification of the specific binding sites of proteins on their target RNAs is especially relevant for characterisation of proteins that contact the (pre-) ribosomal RNAs and for such proteins, additional scripts were added to the basic CRAC pipeline to enable mapping of the obtained sequence reads onto the available 2D structures of the mature rRNAs and the 3D structure of the human 80S ribosome.

Such modelling significantly helps the interpretation of the obtained CRAC data, as it allows it to be determined if multiple cross-linking sites that may be distant on the linear sequence of a particular rRNA come in close proximity to each other on the folded RNA.

It also enables the identification of other features in close proximity of the protein cross-linking sites, such as RNA modifications and the binding sites of other proteins that need to be considered in the context of the assembled RNP.

In the case of proteins that cross-link to mRNAs, one of the limitations of the current pipeline is the simplification of the annotation of the protein coding genes in the genome version to which sequences are mapped. This means that reads mapping to 5’ and 3’ UTRs cannot be distinguished from reads that map to the coding sequences. Similarly, the present pipeline is not able to map exon-exon spanning reads and such information can be highly valuable for understanding the functions of proteins involved in mRNA processing/mRNP biogenesis and can also be relevant for the analysis of proteins involved in RNA modification as an asymmetric distribution of RNA modifications is often observed, e.g. m6A modifications are enriched around stop codons, long internal exons and in 3’ UTRs (Chen et al., 2015; Dominissini et al., 2012; Linder et al., 2015) and m1A modifications are typically clustered in 5’ and 3’ UTRs (Dominissini et al., 2016; Li et al., 2016). The analysis of such features can be done by using alignment tools, such as HISAT2 (Hierarchical Indexing for Spliced Alignment of Transcripts; Kim et al., 2015) or STAR (Spliced Transcripts Alignment to a Reference; Dobin et al., 2013) instead of the currently used Bowtie sequence alignment tool, as these algorithms are specially designed for the alignment of spliced reads spanning exon-exon junctions in mRNA.

An alternative strategy for the mRNA analysis could be to use a dedicated mapping collection, such as HOMER (Hypergeometric Optimization of Motif EnRichment; Heinz et al., 2010) that was originally designed to identify binding motifs within deep sequencing data but which can also be used for genome-wide analysis of next generation sequencing data. Lastly, the recent availability of transcriptome-wide maps of sites of specific RNA modifications, such as m6A, m1A, pseudouridine and m5C, means that it would also be interesting to also extend the CRAC pipeline to enable the overlap

between the cross-linking sites of a particular protein and the known sites of RNA modification to be automatically determined.

4.2 The YTH domain-containing proteins associate with different