• Keine Ergebnisse gefunden

3.2 The intrastrand triplex motif “TM” in E. coli

3.2.1 Intrastrand triplex motifs in bacteria

Different studies have identified triplex motifs in eukaryotes and prokaryotes by means of computation. Most algorithms search for TFO binding sites (162-164), potential triplex target sites (165), or focus on inverted repeats (166,167) and H-DNA (168,169). As described above (see Chapter 1.1.2), intramolecular triplexes do not necessarily have to form H-DNA within a DNA double strand, but can also occur within one single-stranded DNA oligonucleotide. Databases with selective search functions for such intrastrand triplex motifs are rare. In 2000, Maher and co-workers used the Palingol program (318) to search for intrastrand triplexes, describing them as two hairpins sharing a common homopurine strand.

They defined 4 different triplex classes: Class I and II contain purine motif triplexes with reverse Hoogsteen bonding, whereas class III and IV contain pyrimidine motif triplexes stabilized by Hoogsteen bonds (described in Chapter 1.1.2.1). Using their search strategy, they found representative intrastrand triplex motifs in the genomes of E. coli K-12, Synechocystis sp. and Haemophilus influenza, with the class II motif being the most abundant (10). However, they did not search other prokaryotic genomes. In our studies we aimed at a more general and simplified identification for intrastrand triplex motifs. Thus, we developed a search algorithm for finding potential intrastrand triplexes among the different triplex classes in prokaryotes. Our Intrastrand Triplex Finder (ITxF) database contains 5,246 different genomes of bacterial and archaeal species, based on fully sequenced genomes and plasmids from the NCBI webserver (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/). The basic layout performs searches for homopurine-homopyrimidine regions that are either A-, T-, C- or G-rich, defining the three stems of the triplex structure. The regions in-between are defined as loops and can contain any nucleotide. The search identifies potential triplex structures

61

with a stem size of 5-12 nt and a loop size of 1-6 nt. As occurrence of imperfect triplexes has been described in different studies (58-60), our algorithm allows one mismatched base pair in the triplex if the stem length has a minimal size of 7 nt. Using the ITxF program, we identified large numbers of A/T- and G/C-rich triplexes in different prokaryotes: In total, 2,485,777 triplex sequences were found within the 5,246 analyzed genomes and plasmids (examples for E. coli subspecies are shown in Table 13.2 in the appendices). When looking for a specific type of triplex only, it is easy to choose the nucleotide composition of the stem region and define certain stem and loop lengths in the database. The program also shows the triplex class (class I to IV) for specific sequences. Analyzing all genomes (including plasmids), we found that class II triplexes are the most abundant: 40.8% of the triplexes found belonged to class I, 46.6% to class II, 7% to class III and 5.6% to classIV. When Hoyne et al. searched for intrastrand triplexes in E. coli K-12, Synechocystis sp. and H. influenza they only found small numbers of triplex motifs in total (25, 18 and 21). When we performed the search in the ITxF database, we found much higher number of triplex sequences: 431 triplexes in E. coli K-12 MG1655, 652 triplexes in Synechocystis PCC 6803 and 824 triplexes in Haemoophilus influenzae Rd KW20. However, our search strategy is completely different to the one of Hoyne et al.: Whereas Hoyne et al. defined their triplexes via a pattern recognition program searching for hairpin structures, we used a new algorithm specifically aimed at intrastrand triplexes. Hoyne et al. searched for triplexes with a stem length of perfectly matched triplets of greater than 7 nt and loops from 0 to 10 nt; whereas we searched for triplex structures having a stem length from 5 to 12 nt allowing a mismatch when larger than 7 nt and having loops with a size of 1-6 nt. In contrast to Hoyne et al., we found sequences with the potential to form class I triplex structures present in all 3 species, but – similar to their findings – class II triplexes were the most abundant. Furthermore, the program shows the orientation of the identified triplex structures within the circular genome.

The ITxF database is available online at http://bioinformatics.uni-konstanz.de/utils/showtriplex/. To our knowledge, it is the first database that allows searches for intrastrand triplexes (not necessarily H-DNA) in prokaryotes. The abundance of triplex motifs in bacteria suggests potential regulatory, organizational or adaptive functions, as proposed by others (56). During our searches, our interest was drawn to one particular motif:

5’-CCCTCTCCCCTTTCGGGGAGAGGGTTAGGGTGAGGGG-3’, which is the consensus sequence of a purine motif triplex (class II) bearing a C-rich stem (9 nt), a first loop (3-6 nt), a first G-rich stem (9 nt), a second loop (3 nt), a second G-rich stem (9 nt) and one mismatched base pair in the stem (see Figure 3.15). This potential triplex motif, in the following named TM, has already been described in an earlier study by Maher and co-workers, but so far no function was assigned. In contrast to earlier publications (10,85) and due to the enormous amount of available prokaryotic genome sequences, we found the TM

62

in 174 proteobacterial genomes (see Table 13.3 in the appendices): With a total number of 192 TM sequences, Herpetosiphon aurantiacus ATCC 23779 contains the most potential triplex motifs of this type. Closely related genera like E. coli and Shigella species carry comparable copy numbers, whereas other genera such as Enterobacter include far more TM sequences (up to 175). Intriguingly, in some closely related species, such as Salmonella, the TM is absent. However, a search in our database yields results for other triplex-forming sequences in these genomes. Comparing all analyzed strains, we found that most bacterial genomes contain less than 30 TMs (see Figure 3.15 D). We were interested in the function of this particular TM and decided to characterize it in E. coli K-12 substrain MG1655 in more detail.

Figure 3.15: The TM sequence in prokaryotes.

A Consensus motif of TM sequences found on the E. coli K12 MG1655 genome. B TM sequence folding into a DNA class II triplex motif. Hoogsteen hydrogen bonds are indicated by dashed lines. (modified from Maher et al.

(10)). C DNA triplets found in the TM motif. D Frequency of TM sequences found in different proteobacterial genomes (listed in Table 13.3 in the appendices).

63