• Keine Ergebnisse gefunden

coelicolor.aln. An alignment of signature words of Streptomyces coelicolor A3(2)

Supporting information

S. coelicolor.aln. An alignment of signature words of Streptomyces coelicolor A3(2)

Fig. S1. Distribution of the coding sequences percentage in 684 completely sequenced bacterial chromosomes.

Part 4: Genome diversity of Pseudomonas aeruginosa sublines

4. Genome Diversity of Pseudomonas aeruginosa PAO1 laboratory strains 4.1 Background

This laboratory has maintained a large collection of bacterial strains for many years. These strains were isolated in the past, some as long ago as the 1940s, and have been passed on to different laboratories across the world. Various events such as subcloning, contamination or maintenance at inappropriate temperatures have putatively driven microevolution in these sublines. Microevolution is already a known problem in microbiology laboratories (Spira et al. 2008). Most probably, the extent of genomic modifications has likely been underestimated to date given the lack of tools to examine these with. One example is hypermutator strains. These strains, which have suffered mutations in their genomic mutation repair genes such as mutT or mutS (Woodford and Ellington 2007, Winstanley et al. 2009), then accumulate mutations in the rest of their genome very quickly and typically display a highly abnormal phenotype. Hypermutators are in most cases unsuitable for future research. Should sequencing costs continue to decrease isolates in microbiological strain databanks may in future be routinely sequenced to better quality control.

4.1.1 Genome assembly with short reads

Genome assembly, the art of clustering and positioning reads into large contiguous stretches of sequence termed contigs, has always been a core element of bioinformatics. Two variant methodologies are recognised: de novo or reference assisted assemblies. Both were required in this P. aeruginosa resequencing project. De novo assembly assumes no prior information other than that in the reads to build contigs. In most cases this is done via an overlap-layout-consensus algorithm (Pop et al. 2004). First reads with some similarity are aligned to one another to generate overlapping reads, then the layout step positions reads with respect to one another resulting in a multiple alignment of all reads, and finally a consensus sequence based on the most common base at each position from the numerous reads is produced. Popular algorithms which use this system are Phrap (Ewing and Green 1998), Minimus (Sommer et al. 2007) and Edena (Hernandez et al. 2008).

Reference assembly is dependent upon the availability of a closely related reference sequence.

Reads are then mapped onto the reference and a consensus is generated. This approach is relatively rapid and computationally undemanding. However regions not in the reference sequence will be simply ignored. Repeats, which are a serious, long term and unavoidable problem in sequence assembly, complicate matters in both scenarios. Reads mapping to multiple positions in a reference

Part 4: Genome diversity of Pseudomonas aeruginosa sublines

genome are typically mapped randomly to one of the two (Li and Durbin 2009). In contrast, contigs created by de novo assemblers cannot be continued beyond a repeat (Zerbino and Birney 2008, Pop and Salzberg 2008). Repeats are an even more serious problem in short read assembly (Zerbino and Birney 2008). This is because assembly programs experience the aforementioned problems when repeat length exceeds read length. When using short 36 bp reads, as in this manuscript, many more common short repeat regions limit assemblies which would otherwise be spanned by a much longer read from a traditional Sanger sequencing project.

Modern short read resequencing projects commonly have an initial reference assembly phase followed by a de novo assembly phase. In this second phase regions of DNA which are putatively unique to the organism being sequenced are identified by assembly of reads not mapped to the reference sequence. The closer the genome under study to the reference genome, the fewer reads will be from novel DNA. Care must be taken to discern between missassemblies created by the de novo alignment algorithm and high confidence contigs which are truly novel genomic regions. Typically a conservative coverage cutoff of 10 reads is also used here to filter out low confidence data.

4.2 About the manuscript

This manuscript follows the investigation of two Pseudomonas aeruginosa strain PAO1 sublines, termed MPAO1 and PAO-DSM. These sublines have been present in various laboratories for a number of years after isolation in Australia in the 1940s and differences have long been apparent.

Unpublished wet laboratory data had already been collected which illustrate the phenotypic anomalies of these sublines. Thorough analysis of these data was only feasible with a resequencing approach to discover their genetic basis. As these strains are only likely to be subtly different to the already sequenced PAO1 (Stover et al. 2000), a sequencing technology able to reliably detect SNPs was required. Over ten million short 36 bp reads from an Illumina device were thus used. The main differences from PAO1 of the two sublines include a missing large inversion, deletions of between 3 and 1006 bp in length, and 39 single nucleotide polymorphisms, 17 of which affect protein sequences. Examples include substitutions in the FtsZ cell division protein and NapA nitrate reductase, which is involved in anaerobiosis. A prophage was also implicated by de novo assembly of reads into five contigs, some of which span the phage cargo region containing phosphotransferases and kinases. On the phenotypic level, in a murine airway infection model PAO1 sublines responded differently to nutrient limitation and heat stress, also showing modified

Part 4: Genome diversity of Pseudomonas aeruginosa sublines

virulence. A PAO-DSM subpopulation also displayed an arginine to leucine substitution in 9/79 reads in the LasR locus, leading to defective quorum sensing and pleiotropic phenotypic effects in a subpopulation. This observation exemplifies the sensitivity of the Illumina Genome Analyzer system. Lastly, results from competition experiments showed a clear advantage for the original PAO1 and subline MPAO1, which immediately killed PAO-DSM cells when they were grouped in a mixed population. In summary, the finding that various P. aeruginosa strains can evolve despite precautions in the laboratory environment has ramifications for the reproducibility of research on apparently identical model organisms.

This manuscript is in revision at the Journal of Bacteriology at the time of writing. My contribution was restricted to the establishment of a short read assembly pipeline, bioinformatic assistance and annotation of the novel phage DNA.

4.3 References

Ewing, B. & Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8(3) 186-194.

Hernandez, D., François, P., Farinelli, L., Osterås, M. & Schrenzel, J. (2008) De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res 18(5) 802-809.

Li, H. & Durbin, R. (2009) Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform. Bioinformatics 25(14) 1754-60.

Pop, M., Phillippy, A., Delcher, A. L. & Salzberg, S. L. (2004) Comparative genome assembly.

Brief Bioinform 5(3) 237-248.

Pop, M. & Salzberg, S. L. (2008) Bioinformatics challenges of new sequencing technology. Trends Genet 24(3) 142-149.

Sommer, D. D., Delcher, A. L., Salzberg, S. L. & Pop, M. (2007) Minimus: a fast, lightweight genome assembler. BMC Bioinformatics 8, 64.

Spira, B., Hu, X. & Ferenci, T. (2008) Strain variation in ppGpp concentration and RpoS levels in laboratory strains of Escherichia coli K-12. Microbiology 154(Pt 9) 2887-2895.

Stover, C. K., Pham, X. Q., Erwin, A. L., Mizoguchi, S. D., Warrener, P., Hickey, M. J., Brinkman, F. S., et al. (2000) Complete genome sequence of Pseudomonas aeruginosa PA01, an opportunistic pathogen. Nature 406(6799) 959-964.

Part 4: Genome diversity of Pseudomonas aeruginosa sublines

Winstanley, C., Langille, M. G. I., Fothergill, J. L., Kukavica-Ibrulj, I., Paradis-Bleau, C.,

Sanschagrin, F., et al. (2009) Newly introduced genomic prophage islands are critical determinants of in vivo competitiveness in the Liverpool Epidemic Strain of Pseudomonas aeruginosa. Genome Res 19(1) 12-23.

Woodford, N. & Ellington, M. J. (2007) The emergence of antibiotic resistance by mutation. Clin Microbiol Infect 13(1) 5-18.

Zerbino, D. R. & Birney, E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18(5) 821-829.