Consensus Structure Prediction - Multiple Alignment

7.2 Multiple Alignment

7.2.2 Consensus Structure Prediction

In this Section, I exemplify the structure prediction strategy proposed in Sec-tion 5.5 that is based on a multiple structure alignment of thermodynamically predicted structures. Throughout this section this strategy is referred to as the structure alignment strategy. The converse to the the structure alignment strategy is a strategy that first calculated a multiple sequence alignment and then derives a consensus structure by analyzing covariance and thermody-namic considerations. This strategy is referred to asthe sequence alignment strategy.

Structure prediction strategies that build upon an initial multiple se-quence alignment are limited in their success if the sese-quence identity is too high or too low. In the first case, the covariance of conserved base-pairs is low and the prediction is guided mainly by thermodynamics. In the second case, the quality of the sequence alignment is often too low in a biological sense and, hence, covariance can not be inferred from the multiple

align-ment. In particular, the objective function for a multiple sequence alignment aims for maximization of identity and penalizes covariance. According to McCutcheon & Eddy, for multiple sequence alignment based strategies, “the

’sweet spot’ is at ∼= 75−85% sequence identity” [134]. Washietl & Hofacker gave a slightly lower bound stating “we can conclude that there is obvi-ously no need for structure alignments above 65% pairwise identity [223].

Thus, a good candidate family for exemplifying my strategy should have lower sequence homology than 70% to demonstrate that the structure align-ment strategy is suitable to predict a common fold. The structure alignalign-ment strategy depends on predicted structures from single sequences and the pre-diction accuracy gets the worse the longer the sequences are. From personal experience, the sequence length should be less than 300.

The RNA families for my experiments are taken from theRfam database (Version 6.1, August 2004) [67, 68]. Rfam is a large collection of multiple se-quence alignments and covariance models covering many common non-coding RNA families. The covariance models in Rfam result from hand-crafted multiple sequence alignments that were collected from serious publications.

These alignments are the seed alignments in the Rfam database. From sev-eral interesting candidates, I choose two families of riboswitches, the Lysine Riboswitch and the TPP Riboswitch, and a family of splicosomal RNA, the U1 spliceosomal RNA.

For the following experiments, I usedRNAforester for the structure align-ment strategy. Note that the prediction of structures is done automatically by RNAforester as proposed in Section 5.5 where the base-pairs of the pre-diction are weighted according to the base-pair probabilities. I use the pure structure scoring scheme proposed in Section 4.5. The clustering threshold c is zero. According to the observations of Gardner & Giegerich, pruning high entropy base-pairs can improves the results of structural comparison for predicted structures [55]. Therefore, I set the minimum probability pthat is required for base-pairs to occur in the predicted structures to 0.8. The cluster join thresholdtis set to 0.7. Except for the minimum probabilitypthese are

the standard setting for RNAforester. The command line for RNAforester for these settings is: RNAforester -p -2d -pmin=0.8 -f=sequences.fas where sequences.fas is the file containing the RNA sequences in Fasta format. For the sequence alignment strategy I calculate multiple sequence alignment using the online Version ofClustalW from theEuropean Bioinfor-matics Institute [23]. I use the default parameters. The structure prediction form the multiple alignment is done by RNAalifold again using default pa-rameters [82]. The score of an RNAalifold prediction consists of an energy term (first term) and a covariance term (second term). Recently Washietl &

Hofacker provided a method how to test a multiple sequence alignment for the existence of an unusually stable prediction. Their method relatesRNAalifold predictions of a given multiple sequence alignment to the predictions of shuf-fled alignments. The significance is assessed in terms of z-scores². In their experiments, a Z-score below −3 have a false positive rate below 1%. For the calculation of Z-scores, I used the Perl program alifoldz.pl as provided in the supplemental material of [223]. alifoldz.pl computes two scores, one for the forward and one for the backward strand of the sequences. I did no further fine tuning of parameters for any of the tools used for the following experiments.

Lysine Riboswitch

Riboswitches are metabolite binding domains within certain messenger RNAs that serve as precision sensors for their corresponding targets. Allosteric rearrangement of mRNA structure is mediated by ligand binding, and this results in modulation of gene expression. This family includes riboswitches

2A Z-score is a measure of the distance from the mean of a distribution normal-ized by the standard deviation of the distribution. Mathematically: Z-score = (value-mean)/standard deviation. Z-scores are useful for quantifying how different from normal a recorded value is. Z-scores are particularly useful when combining or comparing different features or measures. A Z score of 0 represents the mean of counts for all periods. Assum-ing a normal distribution, Z scores of -1, -2, -3 and +1, +2, +3 indicate that about 67%, 95% and 99%, respectively, of all values are expected by change to fall within this count.

In short, higher (in absolute value) Z scores are likely to be more statistically significant in their deviation from the mean.

that sense lysine in a number of genes involved in lysine metabolism [126].

The 48 sequences from theRfamseed alignment for theLysine Riboswitch (Accession number: RF00168) have an average length of 181.3 and an average identity of 48%. The published consensus structure is shown in Figure 7.11.

RNAforester outputs six clusters that contain more than one structure. The consensus structure drawings for these clusters are shown Figure 7.12-7.17.

The structure in Figure 7.12 is in good correspondence with the published one. The clusters in Figure 7.13-7.17 share at most smaller regions with the published structure. Apparently, The relative sum-of-pairsσ_{SP REL} score for the clusters does not correlate with the reliability of the predictions.

However, looking at the sequence level, the consensus structure in Figure 7.12 have a considerable amount of sequence variation while the others are highly sequence conserved. I identify correct predictions based on the following hypothesis: The more structurally conserved and the less sequence conserved a multiple alignment is, the more reliable are the predicted structures. In contrast to the sequence alignment strategy that uses covariation to predict structures, in the structure alignment method thermodynamic predictions are validated by covariance. So far, I only consider sequence identity to identify the best cluster.

Figure 7.11: Consensus structure of the Lysine riboswitch as published in [126].

Figure 7.12: Lysine Riboswitch. Consensus structure of 18 sequences as predicted by RNAforester. The sum-of-pairs score σ_SP for this cluster is 436.177.

Figure 7.13: Lysine Riboswitch. Consensus structure of 7 sequences as predicted by RNAforester. The sum-of-pairs score σSP for this cluster is 485.696.

Figure 7.14: Lysine Riboswitch. Consensus structure of 7 sequences as predicted by RNAforester. The sum-of-pairs score σ_SP for this cluster is 312.722

Figure 7.15: Lysine Riboswitch. Consensus structure of 5 sequences as predicted by RNAforester. The sum-of-pairs score σSP for this cluster is 349.197

Figure 7.16: Lysine Riboswitch. Consensus structure of 3 sequences as predicted by RNAforester. The sum-of-pairs score σ_SP for this cluster is 562.263

Figure 7.17: Lysine Riboswitch. Consensus structure of 2 sequences as predicted by RNAforester. The sum-of-pairs score σ_SP for this cluster is 414.111

A RNAforester structure alignment produces a sequence alignment as a coproduct³. In the following, I compare the results of the structure alignment strategy to results of the sequence alignment strategy. Figure 7.18 shows the RNAalifold prediction for the hand-crafted seed alignment from the Rfam database. This prediction is in good correspondence with the published one.

Figure 7.19 shows the prediction for the ClustalW alignment of the seed sequences. Clearly, the sequence alignment can not arrange the bases such that RNAalifold can derive a common structure. Since RNAforester does a clustering of the structures, I also compare the RNAalifold prediction for sequence alignment derived from RNAforester’s best alignment (7.12) and the ClustalW alignment for the sequences that belong to this cluster. The results are shown in Figure 7.20 and Figure 7.21. In Figure 7.22, I show the RNAalifold prediction of theRfamseed alignment restricted to the sequences belonging to RNAforester’s best cluster.

I do the same experiments for the TPP Riboswitch and theU1 spliceoso-mal RNA and then discuss the results.

3In the extended forest representation the sequence alignment is the alignment of leaf nodes.

AA U U GA G

G U A AG GG _C G GC UU A UA _C UA AG UG _A _ G UC UU CU G _G __ __ _ _ __ GA GG U G A G U A C GA

UG _ _ _ _ _ _____AA G A A A A GU

AA A G G _GAG _C A U C G C

C GA A

G U G A U U A A A A GG _ _

_C GCA _ A_ C U U U U A U U_ U G U UG G

G U U U G_U A

U U __ _ GA

A AU G_ UC UG AA AG UC G UCACAAU

A_ _ _ U UA___ _ _ _ __ __ _ __ __ _ _ _ _ A UA GU GU AG CG G C UA C C G

UG U AG

C G

C G U U A U C A U

G G C

UG GU AU UU UA AU AU AU AU GC

G UA

UA UU

GG UU

A G U UA

U A AU

Figure 7.18: Lysine Riboswitch. RNAalifold prediction for the seed alignment taken from the Rfam database. RNAalifold score: −37.70 =−22.44 +−15.26, Z:

−3.1(−2.3).

_ A A U UG

AG G U A AG GG C G C G A U G _ UA AC GU A_ UG AA CU UU CU GA AA CG U G A _ _ _ _ G G A C

C UG

UG AU

G AA U A A U G A A_ AGGGGAGCA UCG C CAG GA GU AA __ _ _ A A A A U G

CU CG

C A A A U U UGAUUUGU U G G G U UA G _ UA

UU GA A

AU GU G _ CA U G G

AC UG U C ACAAUAA_ _ _ _ _ __ __ __ _ _ C_ CU GU U G AG CG CG

AC C U AAUUG__ _ _ _ _ __ __ __ _ __ __ __ _

GC AU GC G C

U AG A G

Figure 7.19: Lysine Riboswitch. RNAalifold prediction for the ClustalW align-ment of seed sequences taken from the Rfam database. RNAalifold score: −2.68 =

−0.50 +−2.19, alifoldz score: n.a.

A_ A U U G A G G UA

G_ _ _UA G GGGC GC

A_ UA AC GA GA U A UG GC UU GC _ G_ _A _ G_ __ G _ A U GA

GC A C C AU_GA

A G A AU G G U GA

AA G G _ GAU_U A U C GCC

G A AGU

G A A U A A_A AA U

G_ UCA AA U U C U G U U G_CU

U__ _G GG G U UG

_U_A U _C GAA

AU GG U CA AA AC UC G UCAC _GA

A A _U__ _ __ A __ A _ _ CU _ UG GAGGA

G C U A C_ GC GU G G _A _

C G U G A U

U G G C A

G UG

GU AU

UG AU AC AU

C G A

UC G C

G UA

Figure 7.20: Lysine Riboswitch. RNAalifold prediction for the RNAforester se-quence alignment from the consensus structure in Figure 7.12. RNAalifold score:

−25.65 =−12.04 +−13.62, alifoldz score: −1.9(−2.7).

GA U U G

AG G U A G A G CG CG GG _U AA CU AA _G GA AU UG C UA CU U AG GG GA AG U A A _ __

_ C G AUGA

A G A A UG G U GAA _ AGG

G AU U _ A U C G C CGAAG GU A A C _ _ A U

UU CU C A AAUUUUA UA U U G UC GG G UGU GU AU A

G AA AU GU GU AU CAA

CUGUCACA G A A U___

U A _C __ __ _ _ _ _ G_ GU AG AG CG U A C U

AU G GG A

C G

C G U G A U G C U G G C U A A U A U G C A U

C GA

UA UU

GG UU

AG U G C

Figure 7.21: Lysine Riboswitch. RNAalifold prediction for the ClustalW align-ment for the sequences belonging to the consensus structure in Figure 7.12.

RNAalifold score: −27.72 =−19.08 +−8.65, alifoldz score: −2.3(−1.1).

AA U U GA

A G U A AG GG _C CG G UG AA U _U AA AG UG _A G_ U AU UU GC G _ __ __ __ __ GG UG U A A C A C CA

G _ _ _ _ ______A A G A A U GG U

A A A GG_G A U_U A U CG C C

GA A G U G A U G A A A CA _ _

__ C U CUGUUUCGA__ U U_ U G C UG G

G G U U G_U A

U A __ _ G_AUAA GG GU AU CA CA U G UCACAGU

A_ _ _ U AU____ _ _ __ __ _ __ __ __ _ _ _ CA GU GU AG AG CG U A C U

AU G GG A

C G

C G U G A U U A U A A U U G G C A G

UG GU AU UU GU AG AU AU

C G A

UU GG

UU A G

A CG

A U

Figure 7.22: Lysine Riboswitch. RNAalifold prediction for the Rfam seed align-ment restricted to sequences belonging to the consensus structure in Figure 7.12.

RNAalifold score: −44.05 =−25.75 +−18.29, alifoldz score: −6.8(−6.6).

Figure 7.23: Consensus structure of TPP riboswitch as published in [164].

TPP Riboswitch (THI Element)

Vitamin B(1) in its active form thiamin pyrophosphate (TPP) is an essential coenzyme that is synthesized by coupling of pyrimidine and thiazole moieties in bacteria. The previously detected thiamin-regulatory element, thi box was extended, resulting in a new, highly conserved RNA secondary structure, the THI element, which is widely distributed in eubacteria and also occurs in some archaea [164].

The 141 sequences from the Rfam seed alignment for theTPP riboswitch (Accession number: RF00059) have an average length of 104.9 and an average identity of 52%. Figure 7.23 shows the consensus structure as published in Rfam. The Figures 7.24-7.29 show the structure predictions analog to the experiments for Lysine Riboswitch.

Figure 7.24: TPP Riboswitch. Consensus structure of31 sequences as predicted by RNAforester.

A A U

AAU C A C U GA GG _G __ GU CC UU _ _ __ __ __ __ __ __ __ _ _ _ _ _ _ _ _ _ _ A U A G

G C

U GA

GA UG

__ _

_ _ _ _ _ _ _ _ _ __ ____________ _ _ _ _ _ _ _ _ _ _ A G A CCCU

U U GA _ __

C C UGA A _ UC C GGU U

A UA A CC GG GC U_ GA G_GA G_ G U G A GUAU

U A UU UU AU

UA CG AU CG UG U A GC GC GC

G C G C C G

Figure 7.25: TPP Riboswitch. RNAalifold prediction for the seed alignment taken from the Rfam database. RNAalifold score: −12.11 =−9.26+−2.85, alifoldz score:

0(1.4).

__ _ __ _ _ _ __ __ __ __ __ __ __ __ __ _ __ _ _ A A U A U C U A C U A G G G G U G C C U G U G _ _ _

_ __

__ __

GGG CUGAG A

G GA G A G __ _ _ _ _ _ __ __ __ __________________ _A

G AC

C C U U U G A A C C UG A _U

C CG G U UA UA A CC GG GC AU GG G A A G G U G G U UA AG AU UA AU __ __ _

G C G U

G C G C

Figure 7.26: TPP Riboswitch. RNAalifold prediction for the ClustalW align-ment of seed sequences taken from the Rfam database. RNAalifold score: −5.64 =

−4.33 +−1.31, alifoldz score: 0.9(0.9).

__ _A AA A CA _C CA AU GG GG G_ G C C C _ C _ U _ A U _ _ G _ G G C U G

A G

A _

UG _A

G_ _G G

U U UU _ U _ _ G _ __ CU _

UU_AACCCUU_ G

A _ A

_C C

UG _ _ U_C U G G U UA

A AU CC GA GC AU GG G A A _ G U _ G G G C U A UG CA AG UA G

G C G C

Figure 7.27: TPP Riboswitch. RNAalifold prediction for the RNAforester se-quence alignment for the consensus structure in Figure 7.24. RNAalifold score:

−4.98 =−3.35 +−1.64, alifoldz score: −1.4(−1.1).

A_ CA AA CC AC UA GG G UG CG UC AU AU __ __ __ _ _ _ G G C U G A G A G

A G

GU GC G A U _ ____UCUU _

_ _ _ _ A A C CCU

U U _ G A A

C C

UG A UC U GG U U A

A AU CC GA GC U _A GGAG A_ UG GG UU AU UA AA A UU _U __

U G

C A

Figure 7.28: TPP Riboswitch. RNAalifold prediction for the ClustalW alignment for the sequences belonging to the consensus structure in Figure 7.24. RNAalifold score: −6.81 =−5.29 +−1.53, alifoldz score: −1.7(−1.2).

AA AC AC CA CUA GG GG __ U_ CG UC U _ _ __ __ __ __ _ _ _ _ _ _ _

_ _ _ _ ___AAG G G CU GAGAG

A G G CA GU _ _ _ _ _ _ _

_ _ _ _ ____ _

_ _ _ _ _ G_ UC U U U UG

A CCCU

U U G A _ _

_ A C C UGA _U

C U GGU U A UA A CC GA GC U_ GA GG _A G_ CG GG UG U A AU GC AA GU

U A

G C G U

U G AU U G

U G AU

G C A U C G

Figure 7.29: TPP Riboswitch. RNAalifold prediction for the Rfam seed align-ment restricted to sequences belonging to the consensus structure in Figure 7.24.

RNAalifold score: −14.56 =−12.50 +−2.06, alifoldz score: −18.7(−12.3).

Figure 7.30: Consensus structure of TPP riboswitch as published in [107].

U1 spliceosomal RNA

U1 is a small nuclear RNA (snRNA) component of the spliceosome (involved in pre-mRNA splicing). Its 5’ end forms complementary base pairs with the 5’ splice junction, thus defining the 5’ donor site of an intron. There are significant differences in sequence and secondary structure between metazoan and yeast U1 snRNAs, the latter being much longer (568 nucleotides as compared to 164 nucleotides in human). Nevertheless, secondary structure predictions suggest that all U1 snRNAs share a ’common core’ [107].

The 54 sequences from the Rfam seed alignment for the U1 spliceoso-mal RNA (Accession number: RF00003) have an average length of 154.9 and an average identity of 59%. This family does not contain the larger yeast sequences. Figure 7.30 shows the consensus structure as published in Rfam. The Figures 7.31-7.36 show the structure predictions analog to the experiments for Lysine Riboswitch.

Figure 7.31: U1 RNA. Consensus structure of 14 sequences as predicted by RNAforester.

A U A C U U A C

C UG

CG GC _G GG CU __ A UA GG UG AG CU A A

GA A GG CC CA U G G

C CU _ A GG UC A G U AG C UC CC _A U U G

CA C U U__ GCG_GG _ _

GG GGU G A

_ C CCU

A _ C G

A UC

UC C C C A AA_GU G G _ _ _ GG AA UC GC A_ CGGC

A U A A U U U GUGGUAG__G_ G G GG CG UGCCG U

U C_ _ G

C G GC CG C CC UC C

GC CG CG GC C CGG GU G A C CU CGG CG GU

GC GC

GU UG GG

CG UG

GU CG

GC CG

Figure 7.32: U1 RNA. RNAalifold prediction for the seed alignment taken from the Rfam database. RNAalifold score: −23.27 = −14.69 +−8.58, alifoldz score:

1.0(0.4).

A U A C U U A C

C UG

G CC GG GG CU AA _ _ _ U_ GG _G GU UA AC GA AA GG CC CA UG G_ CC

UG GGU

GAG G AC C C CU _C A U U G C

AC U U C __

_ G GGAGG

GCCGA C C C C U A CG AUC UC

C CC A A

G U GG GG G

A A

A _ _ _ C G A CGUC

AU AA U U U G UG GU___AGU G G G G G C C U G C UG CU G_ GC GC CG CC UC _U __ _ _

GC CU CG GC

GC GC CG CG CG UG

C GG U GC

Figure 7.33: U1 RNA. RNAalifold prediction for the ClustalW alignment of seed sequences taken from the Rfam database. RNAalifold score: −5.02 = −1.98 +

−3.05, alifoldz score: 0.9(n.v.).

UA A C U U A C C U

GG A C G

_ _ _ G _ GG CU AA GU GG GC A U C A A G A A

CC C A U G G _C_U A G G U AU G U G A

C C U C C UA U G C

AC UUA _ _ G

G A_G_G G GUGCC

U G_

C C

UA A G GU

CU GC

C CAA

G GU UG GA GA CC AU CG

UC A U A A UUUG U

G G C A G GGUGG

GCCUG CG U

U GC _ C G C G G C C C C _U _C

GC UG

UA UG

CG CG

GU CG

GC CG

Figure 7.34: U1 RNA. RNAalifold prediction for the RNAforester sequence align-ment for the consensus structure in Figure 7.31. RNAalifold score: −36.50 =

−31.85 +−4.65, alifoldz score Z: −5.3(−1.3).

UA CA U U A C C UG

AG GC GG UG AC UA GG CG AG CU A A

GA A GG CC CA U

G G_ C CU A G UG U A

G GU A

C UC CC UA U G

CAC U U A

_ GG_AGG GGGUC CGU_CCU

A AG GU

CG GC

C CA A

G U GG AU AG CG UC

CGUC AA U A A UUUGUG G C A UG GGGGGCCUGCG U

U _C G C CG GG CC C UC CG _

CG UA CG CG GC

GC UG

UG CG

CG CG

Figure 7.35: U1 RNA. RNAalifold prediction for the ClustalW alignment for the sequences belonging to the consensus structure in Figure 7.31. RNAalifold score:

−51.73 =−45.54 +−6.19, alifoldz score: −5.3(−2.6).

UA A C U U A C C U

GG CA GG G_ UG _C A_ UA GG CG G UA AC A GA A GG C

C CA U G GC C

U_ A G G UU A

G GU A C C CU AC U_ U G

CA C U U_ _ A

GGAGG__UGGGC CGU_CCU A _ AG GU

CG GC

C C

A A _ _ UG G_G U_ GC A_ CG UC

_ A

CGUC AU A A UUUGUGG C A G _ __ GGGGGG

C_ UGCG U

U C_ _ G

C G C GG C C CC U CU

CG U C A CG G GC

GC UG

GC UG GU

CG UA

UGG C U

G CG

CG CG

Figure 7.36: U1 RNA. RNAalifold prediction for the Rfam seed alignment restricted to sequences belonging to the consensus structure in Figure 7.31.

RNAalifold score: −34.18 =−27.36 +−6.82, alifoldz score: −7.8(−4.2).

Discussion

Evidently, the sequence alignment strategy is not a successful strategy to predict a consensus structure for RNA families that are distantly related (applied to the completeRfam seed sequences). For the structure alignment strategy, the RNAforester cluster with the highest sequence diversity was always in good correspondence with the published consensus structure. The clusters that are not shown for the TPP riboswitch and the U1 splicosomal RNAwere either diverse in their sequence and similar to the published struc-ture⁴, or similar in their sequence with a structural topology that is different to the published one. It seems to be unlikely that different sequences fold into a similar structure just by chance. Interestingly, RNAalifold was not able to repredict all stems of the consensus structure for theTPP riboswitch and theU1 splicosomal RNAfor the hand-crafted seed alignments taken from the Rfam database.

To assess the quality of the sequence alignment that can be derived from RNAforester’s best cluster, I ranRNAalifold on the sequence alignment that was derived from the structural alignment. Additionally, I considered the RNAalifold predictions for the ClustalW alignment and the resticted seed alignment for the sequences belonging to this cluster. The predictions from the ClustalW alignments achieved a similar quality as the predictions from theRNAforester derived sequence alignments. However, theRNAalifold pre-dictions detected different parts of the consensus structure. In particular, for the Lysine riboswitch, a stem that was detected with the RNAforester se-quence alignment was not detected with the ClustalW alignment, and vice versa (see Figure 7.20 and 7.21). What remains is to observe whether the improved quality of the ClustalW alignments is simply due to a reduced se-quence identity or a good pre-selection by RNAforester. In contrast to the unrestricted seed alignments, the restricted seed alignments let RNAalifold predict consensus structures that are in almost perfect correspondence to the published ones.

4A tuning of theRNAforester parameters could join them in a larger cluster.

My initial strategy was to use the zscores as a measure for the quality of the alignment. In contrast to my expectation, the zscores did not strongly identify the (unrestricted) seed alignments as an alignment of functional non-coding RNA sequences (A zscore below−4 would be a good indicator). The restricted seed alignments always achieved negative scores that gave strong evidence for a functional RNA.

Alignments of predicted minimal free energy structures can rightfully be criticized, because structure prediction may produce “optimal” structures quite different to the (suboptimal) native structure. The use of sequence similarity, if sufficient, is advocated as a means to avoid this dilemma. How-ever, my experiments contribute two new considerations to this issue:

• They demonstrate an effect that, at the first sight, is paradoxical:

strong sequence similarity can mislead the determination of the con-sensus structure. This happens because very similar sequences tend to fold into a similar structure, be it wrong or right.

• They demonstrate that a multiple structure alignment when applying the cutoff value in the clustering step, may produce meaningful align-ments even in the presence of incorrect predictions.

As a consequence, a new approach to consensus construction becomes feasi-ble, where first a good candidate consensus (or several) is constructed and subsequently, sequences that do not fall into a consensus cluster are refolded, given the candidate consensus as a target structure.

Conclusions

In this thesis, I have analyzed the tree alignment model for the comparison of RNA secondary structures. I gave a systematic generalization of the align-ment model from strings to trees and forests. I provided carefully engineered dynamic programming implementations using dense, two-dimensional tables which considerably reduces the space requirement. I introduced local simi-larity problems on forests and provided efficient algorithms that solve them.

Since the problem of aligning trees occurs in many different disciplines, I untied my algorithmic contributions from the problem of aligning RNA sec-ondary structures. For instance, using my algorithms I could contribute to address problems in the field of robotics [48, 165].

However, the main focus of this thesis is to provide algorithms to analyze RNA secondary structures. To improve the biological semantic of aligning RNA secondary structures as forests, I introduced an extended forest repre-sentation and a refined forest alignment model. The local similarity variants that were introduced on an abstract level of forests turned into local simi-larity notions for RNA secondary structures. The joined work with Thomas T¨oller showed that local structural motifs in RNA molecules can be success-fully detected using my algorithms [203]. To make the results of structure comparison visually available, I invented a 2d-plot for RNA secondary struc-ture alignments that highlights the differences and similarities of strucstruc-tures.

This visualization is more intuitive than comparing abstract representations

of RNA secondary structures, e.g. dot-plots, mountain plots, and makes it efficient to present results from structure comparison.

I generalized the forest alignment model to the case of multiple forests and, thus, made it applicable to compare multiple RNA secondary structures.

My approach is a faithful generalization of established techniques used in sequence comparison. All the experience that has accumulated for multiple sequence alignments therefore carries over now to RNA secondary structures.

I generalized the idea of sequence profiles to forests profiles, resulting in a profile of RNA secondary structures which groups different RNA secondary structures into a single data structure. To visualize a common consensus structure, I proposed a 2d-plot visualization that, in addition to structural similarity, can display the sequence diversity of the aligned structures. Based on these techniques, I proposed a consensus structure prediction strategy for families of RNA molecules that have low sequence homology. I demonstrated that this is a promising approach by successfully predicting the consensus structures for low sequence conserved RNA families taken from the Rfam database.

I implemented all algorithms presented in this thesis in the RNA struc-ture comparison tool RNAforester. RNAforester is designed in spirit of the programs in the Vienna RNA package and will be distributed in the forthcoming Vienna RNA Package Version 1.6. The online version and the stand-alone application is publicly available athttp://bibiserv.techfak.

uni-bielefeld.de/rnaforester.

Future Work Several research activities open directly from the contribu-tions in this thesis:

• The success of the structure prediction strategy that was presented in this thesis depends largely on the quality of thermodynamic pre-dictions. It is well known that the biologically meaningful structure often hides in the space of suboptimal solutions. I argue that results of my structure prediction strategy can be improved significantly by

considering suboptimal solutions. However, the exponential number of suboptimal solutions prohibits a straightforward strategy. Recently, Giegerich et al. provided the structure prediction programRNAshapes based on thermodynamics that compartmentalizes the suboptimal so-lution space into different shapes [59]. A combination of RNAshapes and RNAforester is the logically next step.

• That locally similar structures can be detected with RNAforester with-out prior knowledge was demonstrated in this thesis. The application of my algorithms on a genome-wide scale is a challenging task. Lo-cally stable structures could be predicted on genome-wide surveys us-ing RNALfold [85] and the resulting data could be analyzed for locally conserved structures usingRNAforester, after it has been preprocessed for length and energy constraints. Thorough statistics have to be done to rank the locally conserved structures and distinguish biologically relevant conservations from those that are found just by chance.

• A well known problem of the progressive strategy is that errors made early in an alignment cannot be rectified when further sequences are added. Notredame et al. present a strategy that can minimize this effect in the multiple sequence alignment tool T-Coffee [149]. Instead of using substitution scores for the calculation of pairwise alignments, they propose a position dependent scoring. A primary library gathers information from heterogeneous sources for pairwise alignments, such as sequence alignments (global and local), structural alignments and manual alignments. These sources are combined in an extended li-brary such that each pair of characters in the sequences has a position specific weight. The pairwise alignments are then optimized accord-ing to this extended library. Misplacaccord-ing gaps in the earlier steps of the progressive calculation become less likely and significantly improves the quality of the alignment in comparison to ClustalW and other tools.

An analogous strategy for trees could further improve the quality of multiple tree alignments.

• Various tree distances have been discussed in the introductory chapter of this thesis. However, a thorough analysis of their quality for RNA secondary structures is missing. It would be interesting to observe whether, and under which circumstances, the distances can be replaced by each other and provide similar results. The complexities of the tree distances depend on different parameters of the tree structure, e.g. the number of nodes, the depth, the number of leaves, and the degree. All these parameters are known and, thus, the computational effort can be determined in advance. At the end, a flexible strategy could always chose the “cheapest” model.

• Today, the detection of unknown non-coding RNA from genomic data is one of the biggest challenges in molecular biology. First successes were achieved with tools that infer a structure from a (multiple) sequence alignment by thermodynamic and phylogenetic information, comparing the result of the predictions with randomized data [163, 223]. However, there is an inherent problem: If the sequences are highly conserved, the alignment is good but the covariance of base-paired regions is low.

Thus, the thermodynamic considerations dominate the structure pre-diction. Unlike stated by Maizel and coworkers, energy seems not to be a good discriminator to separate structural from non structural RNAs [19, 162]. If the sequence conservation is too low, regions of covariance are not aligned accurate and the alignment can mislead the predictions.

As in my structure prediction strategy, I am thinking about a strategy that goes the other way around: I could start with thermodynamic considerations and then use phylogenetic information to estimate the reliability of predictions.

A multiple and local structure alignment program will become a basic tool, just like the sequence counterparts. With RNAforester, I provide a program that can be embedded in a larger framework of structure analysis, contribut-ing to solve problems beyond the ones I proposed.

Im Dokument The tree alignment model : algorithms, implementations and applications for the analysis of RNA secondary structures (Seite 161-200)