Bioinformatics - Materials and Methods - Genome-based characterization of Yersinia enterocoliti

3. Materials and Methods

3.7. Bioinformatics

3.7.1. Bioinformatics tools

All the tools used in this study for bioinformatics analysis and their respective references/sources are listed in Table 19.

Table 19 List of used bioinformatics tools.

Software/

Platform Description Reference/Link

BioEdit Biological sequence alignment editor (Hall, 1999)

Blast Basic Local Alignment Search Tool (Altschul et al., 1990) BRIG Interactive generation of comparative

genomic images (Alikhan et al., 2011)

BWA Mapping of short read using

Burrows-Wheeler transform (Li and Durbin, 2009)

CDD Conserved Domain Database of the NCBI (Marchler-Bauer et al., 2013) CLC DNA

Workbench

Comprehensive workbench for DNA, RNA,

and protein analyses CLC bio, Aarhus, Denmark CLC Genomic

Workbench High-throughput sequencing data analysis CLC bio, Aarhus, Denmark ClustalW Multiple sequence alignment (Larkin et al., 2007) Edgar Comparative analysis of annotated genomes (Blom et al., 2009) FastQC Quality control of reads from all sequencing

platforms

http://www.bioinformatics.babraham.

ac.uk/projects/fastqc/

FreeBayes Bayesian genetic variant detector http://arxiv.org/abs/1207.3907 Galaxy Web-based platform for data intensive

biomedical research (Goecks et al., 2010) GATK Post-processing and SNP calling (DePristo et al., 2011) Gene

Ontology

Ontology of defined terms representing gene

product properties (Ashburner et al., 2000) KAAS Server for automatic genome annotation and

pathway reconstruction (Moriya et al., 2007) Maple Metabolic And Physiological potential

Evaluator (Takami et al., 2012)

Mauve Generation of multiple genome alignments

for comparative genomics (Darling et al., 2010) MLST Web-based method for MLST of bacterial

species based on WGS data (Larsen et al., 2012) MUSCLE Multiple sequence alignment (Edgar, 2004)

NCBI Biomedical and genomic databases www.ncbi.nlm.nih.gov Pfam Large collection of protein families (Finn et al., 2006)

PGAAP Prokaryotic Genome Automatic Annotation Pipeline

http://www.ncbi.nlm.nih.gov/genome/

annotation_prok/

PHAST Rapid and accurate identification, annotation

and graphically displaying of prophages (Zhou et al., 2011) Phylip Programs for inferring phylogenies (Felsenstein, 1989)

Picard Manipulation of SAM files http://picard.sourceforge.net/

Prophage Finder

Prediction of prophage loci in prokaryotic

genomes (Bose and Barber, 2006)

PsortB Bacterial localization prediction tool (Yu et al., 2010) PubMLST Public databases for molecular typing and

microbial genome diversity (Jolley and Maiden, 2010) RAST Automated service for high-quality annotation

of bacterial genomes (Aziz et al., 2008) SAMtools Manipulation of SAM files, SNP calling (Li et al., 2009)

SEED Comparative genomic environment (Overbeek et al., 2005) SignalP Predicts of signal peptide cleavage sites (Petersen et al., 2011) tRNAscan-SE Web-server for tRNA gene search in genomic

sequences (Schattner et al., 2005)

3.7.2. General sequence analysis

Whole genome, gene and protein sequences were acquired from the online NCBI databases.

Y. enterocolitica strains and accession numbers of their sequenced genomes are given in Table 20. Basic sequence manipulation and visualization were performed using BioEdit Sequence Alignment Editor, version 7.0.5.3, and CLC DNA Workbench, version 6.0. Homology searches in public databases were done using Blast, while identification of conserved protein domains was obtained by CDD and Pfam database search. MLST profiles of Y. enterocolitica strains were determined in silico using the MLST (MultiLocus Sequence Typing) service (version 1.7); the Yersinia MLST allele sequences, in particular, are based on the scheme hosted at the PubMLST database.

3.7.3. Sequencing of bacterial DNA and read quality control

The genomic sequence of Y. enterocolitica strain WA-314 was acquired by BGI-Hongkong Co.

(Hong Kong), who performed high-throughput Illumina paired-end sequencing, quality control of raw data and assembly of the short reads with SOAPdenovo. The methods described in sections 3.7.4 and 3.7.5 were therefore not applied to this sample. All the YE serotype O:3 genomes generated in this study were obtained by re-sequencing with the Illumina MiSeq technology, with a 150-bp paired-end library (IMGM Laboratory, Martinsried) or a 250-bp paired-end library

(collaboration with Prof. Kalinowski, University of Bielefeld). Reads were subject to quality control using the program CLC Genomics Workbench versions 6 and 7 and the FastQC tool embedded in the Galaxy platform (Table 19). In particular, sequencing accuracy was assessed by measuring the base calling accuracy using the Phred quality score (Q score), which indicates the probability (P) that a given base is incorrect in a logarithmic scale (Q = -10 log10 P). For example, a Phred score of 30 means that the chances that a base is called incorrectly are 1 in 1,000 with an accuracy of 99.9%, while a Phred score of 40 corresponds to a base call accuracy of 99.99%.

Table 20 Y. enterocolitica genome sequences used in this study.

Strain Bioserotype Genome Accession N° Reference

8081 1B/O:8 NC_008800 and

NC_008791 (pYV) (Thomson et al., 2006)

WA-314 1B/O:8 AKKR00000000 This study (Garzetti et al., 2012) a127/90 1B/O:8 NC_004564 (pYV) (Foultier and Cornelis, 2003)

Y11 4/O:3 NC_017564 and

NC_017565 (pYV) (Batzilla et al., 2011c) Y8265 4/O:3 CACU01000000 (Batzilla et al., 2011a) Y5307 4/O:3 CACV01000000 (Batzilla et al., 2011a) W22703 2/O:9 FR718488-FR718797 (Fuchs et al., 2011)

105.5R(r) 3/O:9 CP002246.1 (Wang et al., 2011)

Y5.27P 3/O:5,27 CACW00000000 (Batzilla et al., 2011a) NF-O 1A/O:5 CACY00000000 (Batzilla et al., 2011b) IP2222 1A/O:36 CACZ00000000 (Batzilla et al., 2011b) IP10393 4/O:3 CAOV01000000 (Savin et al., 2013) PhRBD_Ye1 n.a./O:3 AGQO00000000 (Klinzing et al., 2012)

YE12/03 4/O:3 HF933425 (Reuter et al., 2012)

O3-gb 4/O:3 n.a. Provided by Dr. A. McNally, Nottingham

3.7.4. Read mapping and alignment post-processing

Reads from each isolate were mapped against the chromosome of Y. enterocolitica strain Y11 serotype O:3 reference genome using the BWA (Burrows-Wheeler Aligner) Galaxy wrapper with default parameters. Mapping quality reports were generated by CLC Genomics Workbench, providing coverage statistics and read distributions. Alignments were filtered using SAMtools (Galaxy wrapper for sam_bitwise_flag_filter), in order to remove unmapped and unpaired reads,

and Picard (Galaxy wrapper for MarkDuplicates), to reduce the data set by including only one copy of the duplicate sequences, which may arise from PCR amplification artifacts during library construction. The GATK tools embedded in galaxy were then used to perform a local re-alignment, to transform regions with mismatches around indels into clean reads. This step is recommended, since these misalignments are easily mistaken as SNPs. A base quality score recalibration is also performed with GATK tools, in order to make error rates more accurate and informative than the ones provided by the sequencers.

In case a de novo assembly of genomes was required, paired reads were assembled by the CLC Genomics Workbench, with the following settings: Minimum contig length = 200; Mismatch cost = 2; Insertion cost = 3; Deletion cost = 3; Length fraction = 0.5; Similarity fraction = 0.8.

3.7.5. Variant calling and filtering

Variants were identified with different approaches, depending on the type of genome sequence.

For assembled genomes downloaded from public databases, variants were called using the web server snpTree version 1.1, with 10 bp as minimum distance between SNPs and 20 bp as minimum distance to end of sequence. For the genomes obtained in this study and mapped against a reference, variants were detected combining the results from two tools (FreeBayes and SAMtools). The FreeBayes algorithm was run in Galaxy with the following parameters:

Minimum mapping quality = 30; Minimum base quality = 20; Minimum coverage = 10; Minimum variant probability = 0.8. Variants were identified by SAMtools with Minimum mapping quality = 30 and Minimum base quality = 20. GATK tools allowed merging of the variants called by both methods, to obtain a more robust variant detection procedure. Variants called in assembled genomes were then combined with the ones obtained from mapped sequences, and filtered to exclude the ones within phage regions and repetitive elements. After manual inspection, low-confidence alleles with a quality score < 130 or heterozygous base call were removed.

Functional consequences (e.g. non-synonymous mutations and amino acid changes) of the detected variants were computed by CLC Genomics Workbench.

3.7.6. Genome annotation

For annotating the assembled bacterial genomes, the RAST (Rapid Annotations using Subsystems Technology) online server was employed. This automatic annotation service is based on a library of manually curated subsystems and on protein families, mainly derived from the subsystems, called FIGfams. When genomes were to be submitted to the GenBank database, the PGAAP pipeline of the NCBI was used, combining ab initio gene prediction

algorithms with homology based methods. Identification of tRNA genes was done by tRNAscan-SE and prophage regions were determined by PHAST and Prophage Finder.

3.7.7. Whole genome comparison

Comparison of multiple whole genome sequences was conducted by means of various programs. To align regions conserved in subsets of the genomes, the Mauve genome alignment package (version 2.3.1) was used. In particular, alignments were done by the progressiveMauve algorithm, which is able to align a large number of genomes with high accuracy, also taking care of regions differentially conserved among the analyzed organisms. For the generation of comparative genomic circular images, the BRIG (BLAST Ring Image Generator) software was chosen. Genome comparison based on the identification of orthologous genes, using bidirectional best hits (BBHs) as orthology criterion, was achieved by Edgar (Efficient Database framework for comparative Genome Analyses using BLAST score Ratios), allowing, among other features, classification of genes as core or pan genes. In particular, core genome development plots were calculated in Edgar according to the approach described in (Tettelin et al., 2005), while pan-genome statistics, based on a Heaps' law approach (Tettelin et al., 2008), were kindly provided by Dr. Jochen Blom (Justus-Liebig-University Giessen). Comparison analyses were also performed within the SEED environment, connected to the RAST server.

3.7.8. Functional annotation and comparative functionome analysis

Genes were functionally classified using KAAS (KEGG Automatic Annotation Server) by BLAST best-hit comparisons against KEGG Genes, a database created from public resources. The KAAS results contain KO (KEGG Orthology) assignments, which consist of manually defined ortholog groups for all proteins and functional RNAs, allowing automatic generation of KEGG Pathways and KEGG Modules (Moriya et al., 2007). Comparison among comprehensive functional categories (functionomes) could be obtained by MAPLE (Metabolic And Physiological potentiaL Evaluator), which calculates the completion ratio of all KEGG functional modules in each organisms based on KEGG Module (Takami et al., 2012). KEGG Module is a collection of pathway modules, which are manually defined functional units (such as consecutive reaction steps or operons); structural complex modules, which comprise multiple molecules (such as the subunits of transporters); functional essential sets; and signature modules. The query sequences are assigned to KO identifiers, whose combinations define each module, allowing functional annotation.

3.7.9. Phylogenetic analyses

Construction of the phylogenetic tree for analysis of the Y. enterocolitica species was performed using the protein sequences of all the core genes (calculated by the Edgar software, section 3.7.8) of a selection of genomes. Orthologous CDSs found in all genomes were concatenated and aligned using the multiple alignment tool MUSCLE. Uninformative characters not matching the alignment were removed by Gblocks. A distance matrix was then calculated from this alignment and finally a phylogenetic tree was constructed under the Neighbor-Joining method.

The two latter methods were used in the Phylip implementations. A majority rule-consensus tree of 100 bootstrap replicates was also computed to evaluate node support.

A SNP-based phylogenetic tree was reconstructed using the filtered variants vcf file from section 3.7.5 as input and the sequence from YE strain Y11 as reference, removing insertions and deletions. For each strain, a nucleotide sequence containing the concatenated base calls at each SNP location was obtained. After performing a multiple alignment by MUSCLE, a DNA-based phylogeny was calculated using the Neighbor-Joining method with Phylip.

Im Dokument Genome-based characterization of Yersinia enterocolitica (Seite 66-71)