• Keine Ergebnisse gefunden

Characterization of G-rich Repeat Sequences in Xanthomonas sp

3 Results and Discussion

3.2 G-rich Bacterial Repeat Sequences with the Potential to Fold Quadruplexes

3.2.1 Characterization of G-rich Repeat Sequences in Xanthomonas sp

The abberant distribution of putative G-quadruplex forming sequences and heptameric repeats reported by Mrázek and Huang motivated us to investigate the potential G-quadruplex forming sequences in the plant pathogen Xanthomonas in more detail. We first used the ProQuad Pattern Search (261) to determine potential G-quadruplex folding sequences in the Xcc genome. An intriguing over-representation of GGGAATC repeat patterns among quadruplexes of the type

“G3L1-5”, tract length of 3 nt and loop length of 1-5 nt was noticed (in total 270 potential G-quadruplexes with G3-L1-5, of these 97 sequences contain at least once GGGAATC; data not shown).

This led us to manually screen the genomes of Xcc and the closely related species Xac for GGGAATC-containing tandem repeats.

The following parameters were used to define these G-rich SSRs: the total length must be ≥14 bp (at least 2 units) and contain at least once the GGGAATC heptamer. Repeats can be either perfect repeats (GGGAATC)n or heterogeneous (GGGANTN)n. In total 186 G-rich repeat patterns were identified in Xcc and 183 in Xac. A summary can be found in Table 2. The complete record of repeat sequences from Xcc is shown in Table 49, repeat sequences from Xac are compiled in Table 55.

Table 2: Overview of GGGAATC Repeat Sequences in Xanthomonads

Xcc ATCC 33913 Xacstrain 306 number of inverted repeats 70 in total:

51 between G3AATC repeats only (102

The frequency plot in Figure 27A shows the consensus motif of a heptamer unit. In both organisms position 1-4 and 6 show high sequence conservation, while position 5 and 7 are more variable.

Extensive length variation was noted for repeats ranging from 2 to 26 units in Xcc (14-183 nt) and 2 to 18 units in Xac (14-126 nt), covering in total 0.11% and 0.10% of the whole chromosome.

83% of all repeats have ≥4 iterations of a G-rich unit in Xcc. In Xac the percentage is slightly lower with 75%. In fact the majority of sequence motifs was found to comprise 4 repeat units, as shown in the histogram in Figure 27B.

Figure 27: Overview of GGGAATC Repeat Sequences in Xcc and Xac

A: Frequency plot (262) shows the consensus nucleotide sequence of a heptameric repeat unit in Xcc (top) and Xac (bottom). B: Histogram shows the count of repeat iterations per repeat sequence in Xcc (dark blue) and Xac (light blue).

C: Examples of GGGAATC patterns in Xcc. Repeat #08 located upstream of the hypothetical gene xcc0178 is the longest, perfect repeat present (top). Repeat #03 and #04 represent an inverted repeat with two short repeat sequence located in convergent orientation on the plus and on the minus strand of the genome (bottom). D: Distribution of GGGAATC on the Xcc (AE008922, top) and Xac (NC_003919, bottom) genomes. Repeats located on the plus strand are marked in blue (84 Xcc, 85 Xac), repeats on the minus strand in red (102 Xcc, 98 Xac). Locations of repeat associated genes groES, dnaE, flgF, pilU, ruvA, pyrE and xpsF have been marked for orientation. E: Orientation of neighboring genes relative to repeat sequences in Xcc (left) and Xac (right). Intergenic repeats can be located on the same strand that will serve as the coding strand of the aligned ORFs (dark-blue) or on the non-coding strand (light blue), between convergent (dark green) or divergent (light green) ORFs. Intragenic repeats can be located on the coding strand (dark gray) or non-coding strand (light gray). For an explanation of the pictograms see Figure 28.

Interestingly, the number of 4 repeat units corresponds exactly to the number of consecutive G-tracts needed for the formation of a G-quadruplex. In total 56% of all repeats in Xcc and 42% in Xac are made up of ≥4 iterations and no point mutations in the tract, which would prevent G-quadruplex formation. An example for the longest perfect repeat with 14 GGGAATC iterations from Xcc is given in Figure 27C (top). In a few cases we noticed single units with a point mutation at the second position of the G-tract (Figure 27A), which were surrounded by heptamers with intact G-tracts. GGGAATC repeat sequences were found dispersed all over the chromosome in both species and do not show preference for a defined region on the chromosome, such as the origin or terminus of replication (the origins of replication for both chromosomes are predicted to be located between the genes rpmH and gyrB (263)). Repeats are about equally distributed on the plus and minus strand of the chromosome and show no preference in regard to presence in the leading or lagging strand during replication (Figure 27D). In contrast to Xcc, Xac carries two plasmids, pXAC33 and pXAC64, in addition to the chromosome, but no repeat sequences were identified on the plasmids. It should be noted that so far we have only identified these types of G-rich patterns within the genus Xanthomonas, but not in other γ-proteobacteria.

Figure 28: Scheme Showing Possible Arrangements of Repeats and Neighboring Genes in the Genome Intergenic repeats (outside of ORFs) can be located on the same strand that will serve as the coding strand of the aligned ORFs (A) or on the non-coding strand (B), between convergent (C) or divergent (D) ORFs. Intragenic repeats (within ORFs) can be located on the coding strand (E) or non-coding strand (F). ORFs are depicted by light green arrows, dark green arrows show the direction of transcription. ATG marks the start codon. GGGAATC repeats are depicted by light blue boxes. In the pictograms used in Figure 27 and Figure 30 ORFs are represented by black arrows and the repeat with R.

When analyzing the relative location of a repeat sequence to the neighboring ORFs, we distinguished between six possible possible arrangements of a repeat and its nearest neighboring genes (Figure 28): A repeat can be located intergenically (outside of ORFs) between aligned ORFs, either on the strand that will serve as the coding strand of the genes or on the complementary non-coding strand. It can also be located intergenically between convergent or divergent ORFs. In addition repeats can occur intragenically (within ORFs) either on the coding or on the non-coding strand of the gene. In the latter case the C-rich complementary repeat sequence is located on the coding strand. The repeats were most often found in intergenic regions (89% Xcc, 93% Xac) (Figure 27E) and are almost exclusively located at a shorter distance to the next 5’ neighboring ORF (average distance 28 bps) than to the next downstream ORF (average distance 160 bps).

Regarding the orientation of the neighboring ORFs to the intergenic repeats, we found that the majority of ORFs were oriented in the same direction, with the repeats localized in the intergenic region. In Xcc 30% of the G-rich patterns are present on the same strand as the aligned ORFs, and in 35% of cases are present on the opposite strand than the aligned ORFs. In Xac there are 35% of all repeats assigned to each of these categories. In both xanthomonads 16% of repeats were located between convergent ORFs, while only 8% in Xcc and 7% in Xac were located between divergent ORFs. Intragenic repeats are similarly rare, accounting to 11% in Xcc and 7% in Xac (Figure 27E).

Table 3: Longest Repeats in Xac

Sequence of the longest repeats (≥ 7 units) in Xac with G-tracts underscored. Total length, number of repeat units, locus tags and description of upstream and downstream neighboring genes are given. Occurrence of a repeat as part of the inverted repeat is shown (inv rep), and number of partnering repeat is listed.

# sequence

upstream gene / downstream gene length

(nt) units inv rep 050 5‘-CGGAATC(GGGAATC)4(GGGATTC)12GGGCAAT-3‘

radA, DNA repair protein (xac1263) / hypothetical protein (xac1262) 126 18 no 092 5‘-(GGGAATC)7GGGAAGCGGGAATCGGGAAGCGGGAATC(GGGAAGC)4-3‘

flgA, flagellar basal body P-ring biosynthesis protein FlgA (xac1988) / cheV, chemotaxis protein (xac1987)

105 15 091

037 5‘-(GGGAATC)3(GGAAATC)8GGAAAAG-3‘

oxyR, oxidative stress transcriptional regulator (xac0905) / hypothetical protein (xac0904)

84 12 no

087 5‘-(GGGATTC)5GGGATTGGGATAATC(GGGAATC)3-3‘

cheA, chemotaxis protein (xac1930) / Isxac1, Isxac1 transposase (xac1929) 71 10 no 152 5‘-(GGGAATC)8-3‘

cebR, transcriptional regulator (xac3487) / suc1, sugar transporter (xac3488) 56 8 no 008 5‘-GAGAATC(GGGAATC)6-3‘

lipid kinase (xac0475) / trpE, anthranilate synthase component I (xac0476) 49 7 no 051 5‘-(GGGATTG)2GCGAGTC(GGGAATC)4-3‘

hypothetical protein (xac1288) / ffh, signal recognition particle protein (xac1289)

49 7 no

Table 4: Longest Repeats in Xcc

Sequences of the longest repeats (≥ 7 units) in Xcc with G-tracts underscored. Total length, number of repeat units, locus tags and description of upstream and downstream neighboring genes are given. Occurrence of a repeat as part of the inverted repeat is shown (inv rep), and number of partnering repeat is listed.

# sequence

upstream gene / downstream gene length

(nt) units inv rep 014 5‘-(GGGATTC)5(GGGAATT)4GGGAATCGGGAGTC(GGGAATC)2

(GGGAGTC)2GGGAATC(GGGAGTC)3(GGGAATC)3GGGAGTC GGCGATTGGGGATTTGGGATTC-3‘

conserved, hypothetical protein (xcc0513) / prmA, 50S ribosomal protein L11 methyltransferase (xcc0512)

183 26 no

120 5‘-TGGAATT(GGGAATT)6GGGAATC(GGGAATT)7(GGGAATC)8-3‘

osmC, osmotically inducible protein (xcc2745) / pyrB, aspartate carbamoyltransferase (xcc2746)

161 23 no

155 5‘-(GGGATTC)6GGGATC(GGGAATC)11-3‘

cebR, transcriptional regulator (xcc3356) / suc1, sugar transporter(xcc3357) 132 19 no 008 5‘-(GGGAATC)14-3‘

conserved, hypothetical protein (xcc0176) / cls, cardiolipin synthase (xcc0177) 98 14 no 124 5‘-(GGGAATC)4(GGGAGTC)2(GGGAATC)3GGGAGTC(GGGAATC)3GGGAAAA-3‘

acetyltransferase (xcc2770) / nifS, cysteine desulfurase (xcc2769) 98 14 123 006 5‘-(GGGGATT)3(GGGAATC)8-3‘

lldD, L-lactate dehydrogenase (xcc0106) / ATP-dependent DNA ligase (xcc0105)

77 11 no

034 5‘-(GGGAATC)10-3‘

kdpD, two-component system sensor protein (xcc0705) / kdpC, potassium-transporting ATPase subunit C (xcc0704)

70 20 no

044 5‘-(GGGATTG)4(GGGAATC)5GGGTGCA-3‘

voltage-gated potassium channel beta subunit (xcc0766) / yeiM, nucleoside transporter (xcc0765)

acnA, aconitate hydratase (xcc1033) / prpC, methylcitrate synthase (xcc1032) 56 8 no 112 5‘-GAGATTCGGGAATC(GGGAAGC)5GGGAATC-3‘

maltose transporter gene repressor (xcc2464) / cgt, cyclomaltodextrin glucanotransferase (xcc2465)

56 8 no

166 5‘-GGGAATGGGGAGTCGGGAATGGGAAATGGGGAATCGGGAATGGGGAATC GGGATTC-3‘

conserved, hypothetical protein (xcc3710) / yagR, oxidoreductase (xcc3709)

56 8 no

The 12 longest repeat sequences of Xcc and 7 longest of Xac (>7 units) and associated genes are listed in Table 4 and Table 3. The genomes of Xcc and Xac show substantial colinearity between their genomes, despite their distinct phenotypes and host specificity more than 80% of the genes are shared (263,264). Interestingly, although this high degree of sequence homology exists between the two xanthomonads, repeats of similarly prominent length are not found in association with the same genes in these species. One exception is #155 in Xcc and #152 in Xac, which are both located between cebR and suc1 coding for transcriptional regulator and a sugar transporter, respectively. Long G-rich patterns are often made up of combinations of different heptameric tandem repeats, e.g. #155 in Xcc 5‘-(GGGATTC)6GGGATC(GGGAATC)11-3‘, while long perfect repeats are rarer. The longest perfect GGGAATC repeat in Xcc is repeat #008 with 14 iterations (Figure 27C, top), which is located between a conserved, hypothetical gene and a copy of the cls gene. In Xac the longest perfect repeat is #152 comprising 8 units (see above).

Remarkably, in 70 cases in Xcc and 75 cases in Xac two repeats were found in convergent orientation in close proximity to each other located once on the plus and once on the minus strand of the genome. An example for an inverted repeat is shown in Figure 27C (bottom). This rearrangement is of particular interest as inverted repeats have the potential to give rise to stem-loop-structures or cruciforms, which have been implicated to influence replication or recombination, or regulate gene expression (42).

Figure 29: Inverted Repeats

A: Number of units of repeats taking part in inverted repeat formation for Xcc (dark blue) and Xac (light blue). Majority of inverted repeats are formed between shorter repeats (unit size 3-5) and not in combination with long repeats B: Size distribution of intrarepeat sequences found between inverted repeats shown for Xcc (dark blue) and Xac (light blue).

Sequences were grouped into 8 categories 0-20 nt, 20-40 nt, 40-60 nt, 60-80 nt, 80-100 nt, 100-120 nt, 120-140 nt and

≥140 nt.

Inverted repeats were often found in the same intergenic region, but also overlapping with open reading frames (ORFs). Most inverted repeats are formed between shorter repeats (unit size 3-5) and not in combination with long repeats (Figure 29A). The average distance between inverted repeats was 70 nt, few interrepeat sequences were shorter than 20 nt or longer than 100 nt (Figure 29B). We did not find the sequence between the repeats to be complementary to any other region in the respective chromosome. Examples of possible stem-loop structures formed by intergenic and intragenic inverted repeats of different length (100-200 bp) from Xcc were modeled on the DNA and RNA level using the mfold web server (http://mfold.rna.albany.edu/) (265) and the most stable are shown in Figure 51 and Figure 52. Gibbs free energies ΔG for folding of the predicted DNA structures increase with length of the inverted repeat and range between -26 and -45 kcal/mol, predicted RNA structures were more stable with ΔG ranging from -53 to -93 kcal/mol (Table 5). This is comparable to the Chunjie MITEs (178-235 bp), for which ΔG’s of -98 to -130 kcal/mol were reported (163).

Table 5: DNA Sequences and Calculated Gibbs Free Energies for the Most Stable DNA and RNA Structures Formed for Selected Inverted Repeats from Xcc.

Modeling and calculations were performed with the mfold webserver. Respective repeat sequences are shown in bold.

Locus tags of upstream and downstream neighboring genes are given. For the intragenic sequence 071_072 the start codon is shown in green and the stop codon is underscored.

# DNA sequence

intragenic: xcc1359 (rnhB, start codon)/ xcc1360 (lpxB, stop codon)

-37.53 -79.10