• Keine Ergebnisse gefunden

Thermodynamic modeling of multivalent binding by RBPs

In the second part of this thesis, I introduced BMF for de novo discovery of bipartite motifs in RNA-protein interaction data (Sohrabi-Jahromi and Söding, 2021). To the best of our knowledge BMF is the first approach that adopts a full thermodynamic viewpoint for RNA motif identification. Instead of considering only the best binding configuration, BMF aggregates contributions from all possible binding configurations on the RNA sequence. This facilitates discovering motifs in repetitive and degenerate RNA sequences such as mRNA UTRs. We applied BMF on a HTR-SELEX dataset of 86 human RBPs (Jolma et al., 2020) and showed widespread bipartite binding with a factor-dependent preferred gap length between the motif cores. These results demonstrate the importance of multivalent binding for RBPs and contribute to our understanding of the mRNP code underlying mRNA regulation.

Differences between TF and RBP targeting and their implications. In silicoRNA motif discov-ery resembles TF motif discovdiscov-ery as both problems aim at the identification of over-represented sequence features to explain the specific targeting by a certain protein of interest. This has resulted in the initial repurposing of many genomic motif discovery tools forde novoRNA motif discovery (such as Frith et al., 2008; Bailey et al., 2015; Alipanahi et al., 2015). However, RBP binding fundamentally differs from TF targeting in a number of ways that merit a careful consideration when designing computational motif detection tools. Transcription regulation relies on the binding of TFs to specific regulatory elements

5.2 Thermodynamic modeling of multivalent binding by RBPs

126

around gene promoters. To this end, TFs typically read out DNA stretches spanning 6-12 base pairs (Lambert et al., 2018). RBPs on the other hand mostly regulate a large number of target RNA molecules and their binding is dynamically regulated through other binding partners that facilitate or inhibit their binding to RNA (Sternburg and Karginov, 2020). For many RBPs the precise binding locations on their target RNA molecules are not important for performing their function. For instance, proteins that bind sequence elements on mRNAs to transport them across the cytoskeleton could identify any part of the RNA molecule.

Another feature of RBPs is their high modularity, with multiple domains binding adjacent or spaced RNA fragments in a semi-flexible RNA chain (Lunde et al., 2007). RBPs therefore achieve their dynam-ically regulated targeting by identifying short and often degenerate sequences with each of their domains (Dominguez et al., 2018). This allows the binding of many RBPs to repetitive and degenerate RNA UTRs through cooperative effects. Their specificity can be boosted through binding preferences for certain RNA structures (Li et al., 2014), getting help from an associated small RNA (Djuranovic et al., 2011; Bartel, 2018), or interactions with other RBPs (Sternburg and Karginov, 2020; Müller-McNicoll and Neugebauer, 2013). The binding selectivity is further boosted through compartmentalization. Higher concentrations of RBPs in P-bodies can for example enhance new RBP-RNA interactions. Similarly, sequestering one of the interaction partners in the nucleus or a membraneless compartment can prevent their interaction (Hubstenberger et al., 2017; Mittag and Parker, 2018). Out of these features, RNA secondary structure has received the most attention as many recent RNA motif discovery tools utilize the RNA structure as input data. (Pan et al., 2018; Maticzka et al., 2014; Budach and Marsico, 2018; Zhang et al., 2016;

Ben-Bassat et al., 2018; Su et al., 2019; Deng et al., 2020). BMF complements such approaches by incor-porating multivalency, searching for pairs of short sequence patterns enriched in bound RNA fragments.

Previous reports on bivalent binding. Spacedk-mer approaches have previously suggested bipartite binding modes for about one third of RBPs in HTR-SELEX and RBNS datasets (Dominguez et al., 2018;

Schneider et al., 2019; Jolma et al., 2020). BMF finds bipartite binding for half of the 78 studied RBPs.

In these cases, the motif cores match in their sequence preference and the motifs have low complexity and high repetitiveness (Figure 2). The low complexity and repetitive nature of RNA motifs has been described before (Dominguez et al., 2018). The identification of low complexity sequences produces multiple binding surfaces on repetitive RNA sequences which allows RBPs to interact with higher affinity.

BMF’s motif model takes this combinatorial complexity into account. Overall, the models learned by BMF match previously reported motifs while providing additional information on the distance preference of the motif cores, capturing the optimal geometry of the protein binding sites.

The choice of the experimental dataset for bipartite motif discovery. In-vitro datasets are advantageous for motif discovery purposes as they are free of cellular complications such as effects of interactions with protein cofactors, non-specific background binding, and most protocol-induced sequence biases (Friedersdorf and Keene, 2014; Kishore et al., 2011; Dominguez et al., 2018). Moreover, the availability of a large pool of random RNA oligomers ensures a sufficiently large selection pool to discover RBP binding preferences in comparison toin-vivodata that are limited by the non-random transcriptome composition. In order to capture bipartite binding modes in such datasets, the RNA oligomers would need to be sufficiently long to accommodate the two motif cores as well as their linker sequence. The HTR-SELEX dataset by Jolma et al. is the only in-vitro large-scale dataset available that uses longer oligomers (40 nucleotides) in its selection process, and hence was used here to discover bipartite motifs.

Cross-platform validation. We introduce a cross-platform benchmark to evaluate the quality of BMF predictions. As noted in section 1.3.2, highly parametric motif models can learn biases in experimental datasets to distinguish bound from unbound RNA fragments (Ghanbari and Ohler, 2020; Kishore et al., 2011; Orenstein and Shamir, 2014). We therefore evaluate the performance of BMF in predictingin vivo binding sites based on models trained onin vitrodata. To establish a baseline, we similarly gauged the prediction accuracy of the frequently used and highly parametric motif models GraphProt and iDeepE.

GraphProt uses sequence and structural information and relies on support-vector machines, while iDeepE is based on deep learning (Maticzka et al., 2014; Pan and Shen, 2018). Overall, BMF can predict in vivo binding sites with competitive accuracy (Figure 3). This could be due to the over-fitting of more complex models to experimental artifacts that prevent them from generalizing across platforms.

However, GraphProt and iDeepE excelled at predicting new binding sites for few proteins with long sequence preferences such as CSTF2T. In such cases, longer BMF models are needed to best describe the binding motifs (Figure S7).

Possible limitations. While BMF models show promising accuracy in predicting new binding sites, they do not take the RNA structure into account. Numerous studies have shown preferential binding to certain RNA structure elements or a mere binding preference towards single-stranded accessible parts of the RNA molecule (Li et al., 2014; Dominguez et al., 2018; Jolma et al., 2020). Building hybrid models of sequence and structure has therefore improved the performance of RNA motif models (Pan et al., 2018; Maticzka et al., 2014; Budach and Marsico, 2018; Zhang et al., 2016; Ben-Bassat et al., 2018; Su et al., 2019; Deng et al., 2020). We expect that incorporating the RNA structure can further improve the accuracy of BMF models. Another assumption made by BMF is that the two domains always bind in the same order to the RNA sequence and that their binding cannot be swapped. This allows us to more easily enumerate all possible binding configurations in the learning phase. However, this simplifying assumption will be justified for the vast majority of factors as one of the two orders of binding will usually lead to a much less favorable configuration due to the spatial constraints and the need for a flexible and long peptide linker between the two domains.

Thermodynamic modelling to estimate RBP dissociation constants. We further show the im-pact of multi-domain binding by modeling the protein RNA binding kinetics for RBPs with any number of binding sites (Stitzinger et al., 2021). We model protein and RNA linkers connecting the binding sites as flexible chains and estimate total dissociation constants (Kds) based on Kds derived for individual domains. This model proves promising based on its accuracy in predicting RBPKds, and additionally provides interesting insights on how cooperative binding can result in affinities and specificities that are orders of magnitude higher than those achieved by single domains. The kinetic simulations show that small changes in motif density can significantly boost the binding probability of multi-domain RBPs.

These results together with the motif models found by BMF indicate that binding affinities may be en-coded in the low-complexity RNA sequences through small variations in the number of potential binding sites.

Overall, we contribute to a better understanding of RNA-protein interactions in the following regards.

(1) We introduce the first tool to incorporate cooperativity in motif discovery enabling detection of bipartite motifs as well as the distance preferences between the motif cores. (2) We perform in depth analysis of binding motifs learned from 78 RBPs. This indicates that bipartite binding is widespread, the two motif cores are often identical, and the bound sequences are low in complexity. (3)We made BMF available both as a command-line tool with detailed documentation (see A1.4) and as a web server. The server minimizes the technical skills required for analysing new interaction datasets for further discovery