Supplemental Table S21 A-F: Binomial tests for enrichment or depletion of DNA modifications in various genomic features
Section 4: Detailed Bioinformatics Methods
4.1 Illumina Assemblies
4.1.2 Assembling the short reads
A total of 44 short read assemblies were reported on in this study: 40 prior to initial evaluations and contamination analysis and filtering, and 4 thereafter using the top performing assembler given the evaluations.
4.1.2.1 ABYSS
A total of 8 Abyss assemblies were made from Short Read Sets #1-4 (commands for each represented below), each with two different kmer sizes.
For all: K=k, where k was either 55 or 77
abyss-pe name=Sciara j=16 k=${k} in='R1.fastq R2.fastq'
abyss-pe name=Sciara j=16 k==${k} lib='pe432' pe432='R1.q5.fastq R2.q5.fastq' se='SE.q5.fastq'
abyss-pe name=Sciara j=16 k==${k} lib='pe432' pe432='R1.bh.fastq.gz R2.bh.fastq.gz' se='SE.bh.fastq.gz'
abyss-pe name=Sciara j=16 k==${k} lib='pe432' pe432='R1.q5.bh.fastq.gz R2.q5.bh.fastq.gz' se='SE.q5.bh.fastq.gz'
4.1.2.2 Megahit
Megahit was run on Short Read Sets #1-4, generating four assemblies.
Megahit iterates over kmer sizes such that no specific kmer size need be provided. It was run on Short Read Sets #1-4 as follows.
megahit -1 R1.fastq -2 R2.fastq
megahit -1 R1.q5.fastq -2 R2.q5.fastq -r SE.q5.fastq
megahit -1 R1.bh.fastq -2 R2.bh.fastq -r SE.bh.fastq
megahit -1 R1.q5.bh.fastq -2 R2.q5.bh.fastq -r SE.q5.bh.fastq
4.1.2.3 Platanus
Platanus was first run on Short Read Sets #1-4, generating 4
assemblies. Platanus iterates over kmer sizes such that no specific kmer size need be provided. Evaluations showed that Platanus was a top performing short read assembler given our data, so we subsequently used Platanus with the contamination filtered read sets as well (Short Read Sets #5-8). For all Short Read Sets (#1-8) described in section
4.1.1.3, the commands were as represented below.
PREFIX=out
contig=${PREFIX}_contig.fa
bubble=${PREFIX}_contigBubble.fa kmerfreq=${PREFIX}_kmerFrq.tsv
platanus assemble -f $PE1 $PE2 [$SE] -o $PREFIX -t $T -m $M
platanus scaffold c $contig b $bubble IP1 $PE1 $PE2 n1 83 a1 432 -d1 21 -t $T -o $PREFIX
platanus gap_close -o $PREFIX -c $scaffold -IP1 $PE1 $PE2 -t $T
4.1.2.4 SGA
SGA has its own error-correction so only the raw reads and quality-filtered reads were used with it (i.e. Short Read Sets #1-2).
SGA -- unfiltered:
#SAMtools 0.1.19 and ABYSS 1.9.0 in PATH.
PE1=R1.fastq PE2=R2.fastq
See SGA recipe below.
SGA -- filtered:
#SAMtools 0.1.19 and ABYSS 1.9.0 in PATH.
PE1=R1.q5.fastq PE2=R2.q5.fastq
See SGA recipe below.
Note: SE.q5.fastq not used.
SGA recipe -- same for unfiltered and filtered reads:
# Largely drawn from example here:
https://github.com/jts/sga/blob/master/src/examples/sga-celegans.sh
# SGA has its own read correction, so did not use with BayesHammer corrected reads
NAME=Sciara-male OL=75
T=16 D=4000000 CK=51
COV_FILTER=2 MOL=55
R=10
MIN_PAIRS=5 MIN_LENGTH=200
SCAFFOLD_TOLERANCE=1 MAX_GAP_DIFF=0
CTGS=assemble.m$OL-contigs.fa GRAPH=assemble.m$OL-graph.asqg.gz
SGA_ALIGN=~/software/sga/sga/src/bin/sga-align sga preprocess --pe-mode 1 $PE1 $PE2 > $NAME.fastq sga index -a ropebwt --no-reverse -t $T $NAME.fastq
sga fm-merge -m $MOL -t $T -o ${NAME}.merged.k$CK.fa
${NAME}.ec.k$CK.filter.pass.fa
sga index -d 1000000 -t $T ${NAME}.merged.k$CK.fa sga rmdup -t $T ${NAME}.merged.k$CK.fa
sga overlap -m $MOL -t $T ${NAME}.merged.k$CK.rmdup.fa sga assemble -m $OL -g $MAX_GAP_DIFF -r $R -o assemble.m$OL
${NAME.merged.k$CK.rmdup.asqg.gz
sga assemble -m $OL -g $MAX_GAP_DIFF -r $R -o assemble.m$OL
${NAME}.merged.k$CK.rmdup.asqg.gz
$SGA_ALIGN --name ${NAME}.pe $CTGS $PE1 $PE2
sga-bam2de.pl -n $MIN_PAIRS --prefix libPE ${NAME}.pe.bam
sga-astat.py -m $MIN_LENGTH ${NAME}.pe.refsort.bam > libPE.astat sga scaffold -m $MIN_LENGTH --pe libPE.de -a libPE.astat -o scaffolds.n$MIN_PAIRS.scaf $CTGS
sga scaffold2fasta -m $MIN_LENGTH -a $GRAPH -o scaffolds.n$MIN_PAIRS.fa -d $SCAFFOLD_TOLERANCE --use-overlap --write-unplaced
scaffolds.n$MIN_PAIRS.scaf
4.1.2.5 SOAPdenovo2
A total of 8 SOAPdenovo2 assemblies were made from Short Read Sets #1-4 (commands for each represented below), each with two different kmer sizes (K=55 or K=77).
PE1=R1.fastq PE2=R2.fastq Or
PE1=R1.q5.fastq PE2=R2.q5.fastq SE=SE.q5.fastq Or
PE1=R1.bh.fastq PE2=R2.bh.fastq SE=SE.bh.fastq Or
[LIB]
#average insert size avg_ins=432
#if sequence needs to be reversed reverse_seq=0
#in which part(s) the reads are used asm_flags=3
#use only first 100 bps of each read rd_len_cutoff=100
#in which order the reads are used while scaffolding rank=1
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
#a pair of fastq file, read 1 file should always be followed by read 2 file
q1=${PE1} ## dependent on which set of reads being used q2=${PE2} ## dependent on which set of reads being used
#fastq file for single reads
q=${SE} ## dependent on which set of reads being used (some do not have)
Commands:
SOAPdenovo-127mer pregraph -s $config_file -K $K -R -o $graph_prefix -p
$P
SOAPdenovo-127mer contig -g $graph_prefix -R -p $P
SOAPdenovo-127mer map -s $config_file -g $graph_prefix -p $P SOAPdenovo-127mer scaff -g $graph_prefix -F -p $P
4.1.2.6 SPAdes
The inputs to SPAdes were Short Read Sets #1-2. Since SPAdes used BayesHammer error-correction by default prior to assembly, this actually means it was run only on Short Read Sets #3-4. For a given read set, we error-corrected the input reads (SRS#1 or #2) in only one of the SPAdes runs and subsequently used the error-corrected reads (SRS
#3 or #4) produced in that run with the “--only-assembler” option in subsequent runs. We tried SPAdes in three different ways for each set (see commands below), generating 6 SPAdes assemblies.
[square brackets only applicable when single-end read file used]
spades.py pe1-1 $PE1 pe1-2 $PE2 [pe1-s $SE] -o default -t 64 --only-assembler --cov-cutoff auto
spades.py --careful --pe1-1 $PE1 --pe1-2 $PE2 [--pe1-s $SE] -o k213355 -t 64 [--only-assembler]
## Specifying kmer size values to go up to k=77 (K=21,33,55,77)
spades.py --careful -k 21,33,55,77 --pe1-1 $PE1 --pe1-2 $PE2 [--pe1-s
$SE] -o k21335577 -t 64 [--only-assembler]
## Specifying kmer size values to go up to k=77 (K=21,33,55,77) and enabling the automatic coverage cutoff.
spades.py --careful -k 21,33,55,77 --pe1-1 $PE1 --pe1-2 $PE2 [--pe1-s
$SE] -o k21335577auto -t 64 --cov-cutoff auto [--only-assembler]
4.1.2.7 Velvet
Velvet was run on Read Sets #1-4 using two different kmer sizes each (K=55 or K-77), generating 8 assemblies.
[square brackets only applicable when single-end read file used]
K=55 Or K=77
DIR=hash${K}
velveth ${DIR} ${K} -fmtAuto -shortPaired -separate $R1 $R2 [-short
$SE]
velvetg ${DIR} -clean yes -exp_cov auto -cov_cutoff auto