Assembling the short reads - Illumina Assemblies

Supplemental Table S21 A-F: Binomial tests for enrichment or depletion of DNA modifications in various genomic features

Section 4: Detailed Bioinformatics Methods

4.1 Illumina Assemblies

4.1.2 Assembling the short reads

A total of 44 short read assemblies were reported on in this study: 40 prior to initial evaluations and contamination analysis and filtering, and 4 thereafter using the top performing assembler given the evaluations.

4.1.2.1 ABYSS

A total of 8 Abyss assemblies were made from Short Read Sets #1-4 (commands for each represented below), each with two different kmer sizes.

For all: K=k, where k was either 55 or 77

abyss-pe name=Sciara j=16 k=${k} in='R1.fastq R2.fastq'

abyss-pe name=Sciara j=16 k==${k} lib='pe432' pe432='R1.q5.fastq R2.q5.fastq' se='SE.q5.fastq'

abyss-pe name=Sciara j=16 k==${k} lib='pe432' pe432='R1.bh.fastq.gz R2.bh.fastq.gz' se='SE.bh.fastq.gz'

abyss-pe name=Sciara j=16 k==${k} lib='pe432' pe432='R1.q5.bh.fastq.gz R2.q5.bh.fastq.gz' se='SE.q5.bh.fastq.gz'

4.1.2.2 Megahit

Megahit was run on Short Read Sets #1-4, generating four assemblies.

Megahit iterates over kmer sizes such that no specific kmer size need be provided. It was run on Short Read Sets #1-4 as follows.

megahit -1 R1.fastq -2 R2.fastq

megahit -1 R1.q5.fastq -2 R2.q5.fastq -r SE.q5.fastq

megahit -1 R1.bh.fastq -2 R2.bh.fastq -r SE.bh.fastq

megahit -1 R1.q5.bh.fastq -2 R2.q5.bh.fastq -r SE.q5.bh.fastq

4.1.2.3 Platanus

Platanus was first run on Short Read Sets #1-4, generating 4

assemblies. Platanus iterates over kmer sizes such that no specific kmer size need be provided. Evaluations showed that Platanus was a top performing short read assembler given our data, so we subsequently used Platanus with the contamination filtered read sets as well (Short Read Sets #5-8). For all Short Read Sets (#1-8) described in section

4.1.1.3, the commands were as represented below.

PREFIX=out

contig=${PREFIX}_contig.fa

bubble=${PREFIX}_contigBubble.fa kmerfreq=${PREFIX}_kmerFrq.tsv

platanus assemble -f $PE1 $PE2 [$SE] -o $PREFIX -t $T -m $M

platanus scaffold c $contig b $bubble IP1 $PE1 $PE2 n1 83 a1 432 -d1 21 -t $T -o $PREFIX

platanus gap_close -o $PREFIX -c $scaffold -IP1 $PE1 $PE2 -t $T

4.1.2.4 SGA

SGA has its own error-correction so only the raw reads and quality-filtered reads were used with it (i.e. Short Read Sets #1-2).

SGA -- unfiltered:

#SAMtools 0.1.19 and ABYSS 1.9.0 in PATH.

PE1=R1.fastq PE2=R2.fastq

See SGA recipe below.

SGA -- filtered:

#SAMtools 0.1.19 and ABYSS 1.9.0 in PATH.

PE1=R1.q5.fastq PE2=R2.q5.fastq

See SGA recipe below.

Note: SE.q5.fastq not used.

SGA recipe -- same for unfiltered and filtered reads:

# Largely drawn from example here:

https://github.com/jts/sga/blob/master/src/examples/sga-celegans.sh

# SGA has its own read correction, so did not use with BayesHammer corrected reads

NAME=Sciara-male OL=75

T=16 D=4000000 CK=51

COV_FILTER=2 MOL=55

R=10

MIN_PAIRS=5 MIN_LENGTH=200

SCAFFOLD_TOLERANCE=1 MAX_GAP_DIFF=0

CTGS=assemble.m$OL-contigs.fa GRAPH=assemble.m$OL-graph.asqg.gz

SGA_ALIGN=~/software/sga/sga/src/bin/sga-align sga preprocess --pe-mode 1 $PE1 $PE2 > $NAME.fastq sga index -a ropebwt --no-reverse -t $T $NAME.fastq

sga fm-merge -m $MOL -t $T -o ${NAME}.merged.k$CK.fa

${NAME}.ec.k$CK.filter.pass.fa

sga index -d 1000000 -t $T ${NAME}.merged.k$CK.fa sga rmdup -t $T ${NAME}.merged.k$CK.fa

sga overlap -m $MOL -t $T ${NAME}.merged.k$CK.rmdup.fa sga assemble -m $OL -g $MAX_GAP_DIFF -r $R -o assemble.m$OL

${NAME.merged.k$CK.rmdup.asqg.gz

sga assemble -m $OL -g $MAX_GAP_DIFF -r $R -o assemble.m$OL

${NAME}.merged.k$CK.rmdup.asqg.gz

$SGA_ALIGN --name ${NAME}.pe $CTGS $PE1 $PE2

sga-bam2de.pl -n $MIN_PAIRS --prefix libPE ${NAME}.pe.bam

sga-astat.py -m $MIN_LENGTH ${NAME}.pe.refsort.bam > libPE.astat sga scaffold -m $MIN_LENGTH --pe libPE.de -a libPE.astat -o scaffolds.n$MIN_PAIRS.scaf $CTGS

sga scaffold2fasta -m $MIN_LENGTH -a $GRAPH -o scaffolds.n$MIN_PAIRS.fa -d $SCAFFOLD_TOLERANCE --use-overlap --write-unplaced

scaffolds.n$MIN_PAIRS.scaf

4.1.2.5 SOAPdenovo2

A total of 8 SOAPdenovo2 assemblies were made from Short Read Sets #1-4 (commands for each represented below), each with two different kmer sizes (K=55 or K=77).

PE1=R1.fastq PE2=R2.fastq Or

PE1=R1.q5.fastq PE2=R2.q5.fastq SE=SE.q5.fastq Or

PE1=R1.bh.fastq PE2=R2.bh.fastq SE=SE.bh.fastq Or

[LIB]

#average insert size avg_ins=432

#if sequence needs to be reversed reverse_seq=0

#in which part(s) the reads are used asm_flags=3

#use only first 100 bps of each read rd_len_cutoff=100

#in which order the reads are used while scaffolding rank=1

# cutoff of pair number for a reliable connection (at least 3 for short insert size)

pair_num_cutoff=3

#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)

map_len=32

#a pair of fastq file, read 1 file should always be followed by read 2 file

q1=${PE1} ## dependent on which set of reads being used q2=${PE2} ## dependent on which set of reads being used

#fastq file for single reads

q=${SE} ## dependent on which set of reads being used (some do not have)

Commands:

SOAPdenovo-127mer pregraph -s $config_file -K $K -R -o $graph_prefix -p

SOAPdenovo-127mer contig -g $graph_prefix -R -p $P

SOAPdenovo-127mer map -s $config_file -g $graph_prefix -p $P SOAPdenovo-127mer scaff -g $graph_prefix -F -p $P

4.1.2.6 SPAdes

The inputs to SPAdes were Short Read Sets #1-2. Since SPAdes used BayesHammer error-correction by default prior to assembly, this actually means it was run only on Short Read Sets #3-4. For a given read set, we error-corrected the input reads (SRS#1 or #2) in only one of the SPAdes runs and subsequently used the error-corrected reads (SRS

#3 or #4) produced in that run with the “--only-assembler” option in subsequent runs. We tried SPAdes in three different ways for each set (see commands below), generating 6 SPAdes assemblies.

[square brackets only applicable when single-end read file used]

spades.py pe1-1 $PE1 pe1-2 $PE2 [pe1-s $SE] -o default -t 64 --only-assembler --cov-cutoff auto

spades.py --careful --pe1-1 $PE1 --pe1-2 $PE2 [--pe1-s $SE] -o k213355 -t 64 [--only-assembler]

## Specifying kmer size values to go up to k=77 (K=21,33,55,77)

spades.py --careful -k 21,33,55,77 --pe1-1 $PE1 --pe1-2 $PE2 [--pe1-s

$SE] -o k21335577 -t 64 [--only-assembler]

## Specifying kmer size values to go up to k=77 (K=21,33,55,77) and enabling the automatic coverage cutoff.

spades.py --careful -k 21,33,55,77 --pe1-1 $PE1 --pe1-2 $PE2 [--pe1-s

$SE] -o k21335577auto -t 64 --cov-cutoff auto [--only-assembler]

4.1.2.7 Velvet

Velvet was run on Read Sets #1-4 using two different kmer sizes each (K=55 or K-77), generating 8 assemblies.

[square brackets only applicable when single-end read file used]

K=55 Or K=77

DIR=hash${K}

velveth ${DIR} ${K} -fmtAuto -shortPaired -separate $R1 $R2 [-short

$SE]

velvetg ${DIR} -clean yes -exp_cov auto -cov_cutoff auto

Im Dokument Supplemental Materials Single-molecule sequencing of long DNA molecules allows high contiguity (Seite 84-89)