Contamination analyses for the selected Platanus short read assembly It has been shown that filtering out the contaminating reads can improve an assembly (Kumar et

Supplemental Table S21 A-F: Binomial tests for enrichment or depletion of DNA modifications in various genomic features

Section 4: Detailed Bioinformatics Methods

4.1 Illumina Assemblies

4.1.4 Contamination analyses for the selected Platanus short read assembly It has been shown that filtering out the contaminating reads can improve an assembly (Kumar et

al. 2013). Therefore, to remove reads from contaminating species, we adopted a procedure similar to that used for the Tardigrade genome (Koutsovoulos et al. 2016) with the help of BlobTools (Laetsch and Blaxter 2017). We focused on the Platanus assembly produced from quality filtered, error-corrected reads. Instead of using the final gap-closed scaffolds, we used the Platanus contigs (largest contig ~77 kb) to allow each contig to be characterized separately and avoid discarding data due to misjoins.

Ideally one could use coverage and/or GC content information associated with annotated bacterial contigs to also eliminate the likely bacterial contigs that were not annotated. Unfortunately, the annotated bacterial contigs were of equivalent coverage compared with those annotated as arthropod (see Supplemental Figure S3). Moreover, the GC content of the bacterial and eukaryotic contig clusters overlapped enough to limit its usefulness. Therefore, we used an alternative set of reads from pre-amplification stage salivary glands (Urban et al. 2016), reasoning that since this dataset was from a different tissue, from a different stage, and prepared by a different person, the contaminating contigs would have much lower coverage. Over 95% of these reads mapped to the assembly, suggesting most or all of the somatic Sciara genome is represented in the Platanus contigs. As expected, the BlobPlot demonstrates that the coverage of the bacterial cluster is much lower in this dataset (see Supplemental Figure S3). We therefore marked for removal non-Arthropod-labeled contigs (and associated reads) from the assembly that were either labeled as super-kingdom "Bacteria" or had coverage < 0.1. To filter out contaminating reads for re-assembly, we removed all paired-end reads where both mates mapped to contigs marked for removal, and retained pairs when at least one mate mapped inside a retained contig or when both mates did not map to the assembly. The retained paired-end reads were used for assembly with Platanus in four ways: (i) no quality filtering nor error-correction, (ii) no quality filtering but error-correction with BayesHammer, (iii) quality filtering but no error-correction, and (iv) both quality filtering and error correction.

Overall, the contiguity statistics of the re-assembled Platanus contigs did not improve over simply eliminating the contaminating contigs from the original assembly, with the longest contig being

~28 kb in both scenarios. The scaffold contiguity statistics were slightly smaller than the original Platanus assemblies due to the absence of the very large bacterial scaffolds. Evaluation scores were similar among the four Platanus re-assemblies from different sets of filtered reads. REAPR's percent error-free bases metric improved drastically from ~56% error-free bases in the original assemblies to ~80% error-free bases in the assemblies produced from contamination-filtered reads. The assembly from quality-filtered, error-corrected reads had the highest percentage of error-free bases. We attempted a second round of contamination-filtering on this assembly, but found few additional benefits. There were only 15 contigs labeled as Bacteria, all Rickettsia at the genus level, summing in length to only 2,277 bp.

4.1.4.1 Obtaining taxonomy IDs from BLAST hits for each short read Platanus contig:

The Platanus fasta file containing contigs (not scaffolds) from the assembly of quality filtered, error-corrected reads (Short Read Set #4) was subdivided into 1000 different fasta files and submitted to SLURM as a batch script array that ran the following on each:

NT=/Path/to/entire/nucleotide/database blastn -task megablast -query $Q -db $NT \

After ensuring all jobs completed successfully (and re-running if not), final blast results were all combined into a single file:

platanus.q5.bh.blast.results

The NCBI Taxonomy database containing nodes.dmp and names.dmp was downloaded into a subdirectory referred to as “./tax/” in commands below. This was used with BlobTools and custom taxonomy scripts as described below to classify contigs given the taxonomy IDs associated with each.

4.1.4.2 Using BlobTools and custom scripts to label each contig with a taxonomy:

Blob plots using platanus kmer coverage from the assemblies:

B=platanus.q5.bh.blast.results BASE=platanus.q5.bh.contig

REF=${BASE}.fasta DB=${BASE}.BlobDB.json

# Create blob database using kmer coverage from the assemblies

blobtools create -i ${REF} -y platanus --nodes tax/nodes.dmp --names tax/names.dmp -o ${BASE} -t $B

# Plot

blobtools blobplot -i ${DB} [--format pdf]

blobtools blobplot -i ${DB} -r superkingdom [--format pdf]

blobtools blobplot -i ${DB} -r order -p 10 [--format pdf]

blobtools blobplot -i ${DB} -r family -p 12 [--format pdf]

# Make Table with Taxon Summaries for each contig blobtools view -o ${BASE}.table -i $DB -r all

Blob plots using read coverage:

The GC content for bacterial contigs was similar to the eukaryotic contigs. Moreover, the Bacterial and Eukaryotic contigs had similar coverage levels given the male embryo dataset used in the assemblies. Therefore, we tried a separate dataset generated from dissected out pre-amplification stage larval salivary glands (Urban et al. 2016). Reads were aligned with Bowtie 2 (Langmead and Salzberg 2012).

PFX=pre-amp-sal-gland-reads READS=${PFX}.fastq.gz

BASE=platanus.q5.bh.contig REF=${BASE}.fasta

B=platanus.q5.bh.blast.results DB=${PFX}.${BASE}.BlobDB.json

# Map reads

bowtie2 -p $P -q --very-sensitive -N 1 -x ${BASE} -U ${READS} 2>

${PFX}.${BASE}.bt2.err | samtools sort --threads $P -T ${PFX}.tmp -o

${PFX}.bam

blobtools blobplot -i ${DB} -r superkingdom [--format pdf]

blobtools blobplot -i ${DB} -r order -p 10 [--format pdf]

blobtools blobplot -i ${DB} -r family -p 12 [--format pdf]

We also used custom scripts (https://github.com/JohnUrban/sciara-project-tools/tree/master/taxon) that segregate sequence names based on the closest taxonomy level (species, genus, family, order, class, phylum, kingdom, superkingdom, othersuperkingdoms) to our target species (Bradysia coprophila, Bradysia, Sciaridae, Diptera, Insecta, Arthropoda, Metazoa, Eukaryota, Bacteria/Archaea) that is found in the BLAST hits for each sequence. We required only 1 BLAST hit for a given level for it to be considered the closest level (i.e. if 1 BLAST hit says "Bradysia coprophila", then it will be assigned to the species level as the closest level even if other levels have more hits).

NODES=tax/nodes.dmp NAMES=tax/names.dmp

BFILE=platanus.blobfilt1.q5.bh.contig.blast.results TAX=$BFILE.taxonomy.out

PRE=platanus.blobfilt1.q5.bh.contig.taxsum ALL=q5-bh/platanus.blobfilt1.q5.bh_contig.names

taxonomyFromTaxID.py -i $BFILE -nc 1 -tc 2 -no $NODES -na $NAMES > $TAX taxonomy-summarizer.py -i $TAX -o $PRE -a $ALL

This reproduced the BlobTools approach, returning the same contigs labeled as Bacterial.

4.1.4.3 Obtaining contamination-filtered reads for re-assembly:

The set of male embryo paired-end reads (Short Read Set #1) was filtered as follows, making use of BEDtools (Quinlan and Hall 2010), SAMtools (Li et al. 2009), and linux commands.

# Make Table with Taxon Summaries for each contig blobtools view -o ${PFX}.${BASE}.table -i $DB -r all

## Grab contigs that are not labeled Bacterial at Super Kingdom level that are above a coverage level of 0.1

grep -v ^# ${PFX}.${BASE}.table | \

awk '$6 != "Bacteria" && $5 >= 0.1' | \ awk 'OFS="\t" {print $1,0,$2}' | \ sortBed -i - > contigs.to.keep.bed

## Add to those, contigs below the coverage cutoff of 0.1 that are explicitly labeled as Arthropoda sequences

grep -v ^# ${PFX}.${BASE}.table | \ awk '$5 < 0.1' | \

grep Arthropoda | \

awk 'OFS="\t" {print $1,0,$2}' | \ sortBed -i - >> contigs.to.keep.bed

# Get read pairs from male embryo dataset (SRS#1 above) where at least one read in the pair maps to one of the contigs-to-keep

BAM=${PFX}.bam

BED=contigs.to.keep.bed

samtools sort -n | \

samtools fastq -1 ${PFX}.u_u.R1.fastq -2 ${PFX}.u_u.R2.fastq -n –

# Combine ${PFX}.m.R1.fq and ${PFX}.u_u.R1.fastq

# Combine ${PFX}.m.R2.fq and ${PFX}.u_u.R2.fastq

4.1.4.4 Platanus contamination-filtered re-assemblies:

Four new Platanus assemblies were made using the set of contamination-filtered reads as well as quality-filtered and/or error-corrected derivatives of it (Short Read Sets #5 - #8).

(i) Contamination-filtered reads (Short Read Set #5) were assembled as described in Platanus sub-section of Section 2.1.2.

(ii) Contamination-filtered reads that were quality-filtered using the Trimmomatic command below (Short Read Set #6) were subsequently assembled as described in Platanus sub-section of Section 2.1.2.

java jar $TRIMMOMATIC_BASE/Trimmomatic0.32/trimmomatic0.32.jar PE -trimlog trim.log $IN $OUT ILLUMINACLIP:${PRIMERS}:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:5 MINLEN:100

(ii) Contamination-filtered reads that were then error-corrected with BayesHammer using the SPAdes command below (Short Read Set #7) were subsequently assembled as described in Platanus sub-section of Section 2.1.2.

spades.py --pe1-1 $PE1 --pe1-2 $PE2 -o assembly -t 16 --only-error-correction

(iv) Contamination-filtered reads that were already quality-filtered (SRS#6) were then subsequently error-corrected with the following SPAdes command to produce Short Read Set #8, which was subsequently assembled as described in Platanus sub-section of Section 2.1.2.

spades.py pe1-1 $PE1 pe1-2 $PE2 pe1-s $SE -o assembly -t 16 --only-error-correction

4.1.4.5 Evaluations of contamination-filtered re-assemblies:

The four Platanus assemblies (gap-closed scaffolds) from contamination-filtered datasets were evaluated in the same way as the evaluations of the first 40 assemblies were as described in Section 4.1.3 (Size statistics, Percent reads mapped, ALE, FRC^bam, ALE, REAPR, BUSCOv1) except that the set of contamination filtered reads (Short Read Set #5) was used for these evaluations. This means some of the scores based on reads are not directly comparable to the

number of features detected by FRC^bam. Assembly “iv” had the highest percentage of error-free bases as determined by REAPR. Since “iv” had the highest REAPR score and since the combination of quality-filtering and error-correction also won out in the previous set of Platanus evaluations before contamination filtering, we chose this assembly as the final/only Platanus assembly to take into a second round of contamination filtering. The contigs and scaffolds were separately BLAST for taxonomy information as above. BlobTools databases and plots were constructed as above for kmer coverage as well as coverage from pre-amplification salivary gland reads. However, there was no additional contamination filtering needed.

Im Dokument Supplemental Materials Single-molecule sequencing of long DNA molecules allows high contiguity (Seite 92-97)