• Keine Ergebnisse gefunden

Canu Falcon

STANDARD DEVIATION

4.3.4 Maker2 gene annotation

4.3.4.1 Overview:

Maker2 (Holt and Yandell 2011) was used for protein-coding gene annotation of the two final genome assembly candidates. The gene set for a given assembly was produced using gene predictors trained only on the given assembly. In all cases, training was performed using only contigs labeled as Arthropoda in the BlobTools analysis. GeneMark-ES (Ter-Hovhannisyan et al.

2008) was trained on soft-masked contigs. Augustus (Hoff and Stanke 2019) was trained using BUSCO v3 using the 2799 genes in the Dipteran lineage from OrthoDB v9. SNAP (Korf 2004;

Campbell et al. 2014) was first trained using gene models output from the first round of Maker2 that used only transcript and protein evidence (est2genome=1, protein2genome=1), and then re-trained on the output of the second Maker round that included gene predictors as described below.

There were essentially three Maker2 rounds. Transcript evidence was provided to Maker2 as the Trinity de novo transcriptome and StringTie transcriptome constructed on the given genome assembly being annotated. Alternative transcript evidence was provided from fly species

spanning the Nematocera and Brachycera suborders of Diptera, including the transcriptomes from Drosophila melanogaster and Anopheles gambiae. Protein homology evidence was provided using all Swiss-Prot proteins from Arthropoda (The UniProt Consortium 2019), as well as from proteomes spanning the holometaboulous insects, including Drosophila melanogaster (fruit fly), Anopheles gambiae (mosquito), Apis mellifera (honey bee), Bombyx mori (silkmoth), and Tribolium castaneum (red flour beetle). Maker2 was first run to create gene models from the transcript evidence alone to use as an initial training set for gene prediction with SNAP. For the second round of Maker2, gene models were created from the three gene predictors (SNAP, Augustus, GeneMark-ES) using the RNA-seq and homology evidence. SNAP was re-trained using the output of the second round, and a third round of Maker was run. In the third round, Maker2 was instructed to output all models and predictions (keep_preds=1). InterProScan (Quevillon et al. 2005) was subsequently run on all of the protein models to annotate them with protein domain information. Models were kept if they had an Annotation Edit Distance (AED) between 0 and 1 (inclusive) and/or had a recognizable protein domain as recommended (Campbell et al. 2014).

To rescue gene models buried inside masked repetitive regions, we also re-ran the third round of Maker2 exactly as above, but using a repeat library that was filtered to remove repeat models that contained genes, such as Odorant Receptor or Histone genes. We compared the Maker2 transcriptome outputs from using the original repeat library and the filtered repeat library by looking at their AED distributions, and using BUSCO, RSEM-Eval, and TransRate (see Supplemental Table S15A-B). For both Canu and Falcon, the gene sets produced using the original repeat library were higher quality as they had more genes, better AED distributions, more complete (e.g. more BUSCOs), and better evaluations. Therefore, we chose to use the transcript models generated using the original repeat library (stringent masking) as the primary set of annotations, and included transcript models generated when using the filtered repeat library (less masking) only if those gene structures had no overlaps with the primary set (using BEDtools intersect). This resulted in adding 1132 and 607 gene models to the primary Canu and Falcon gene model sets, respectively. We found that of these additional genes added back to the Canu set, for example, 30 were labeled as “Core histone H2A/H2B/H3/H4” and 11 were labeled as “7tm Odorant receptor”. Thus, we were able to rescue many genes originally modeled as repeats.

annotated, where possible, by using InterProScan (Quevillon et al. 2005) for protein domain information and BLASTp against the entire UniProt/SwissProt protein database (Consortium 2019). The steps described here are detailed further below.

4.3.4.2 Training GeneMark-ES for the first gene prediction round of Maker

GeneMark-ES (Ter-Hovhannisyan et al. 2008) was trained separately on the final Canu and Falcon scaffolds. For both, the scaffolds labeled as Arthropod in the BlobTools analysis were extracted from the soft-masked assemblies (soft-masked using the arthropoda and species-specific library as described above). These sequences made up over 96.6% of each assembly and the majority was also labeled as Dipteran (>98%). This was done to avoid training on any bacterial or any other contaminating sequences present. GeneMark-ES was trained on the soft-masked arthropod-labeled scaffolds with the following command:

gmes_petap.pl --ES --max_contig 50000000 --cores 9 --max_intron 500000 --sequence ${ASM} --soft_mask 5000

The output file produced located under the working directory at “./output/gmhmm.mod” was used as the GeneMark HMM in all Maker runs involving gene prediction (see below).

4.3.4.3 Using BUSCO to train Augustus for the first gene prediction round of Maker

Augustus (Hoff and Stanke 2019) was trained separately on the final Canu and Falcon scaffolds using BUSCO (Simão et al. 2015). For both, we used BUSCOv3 with the Dipteran lineage from ODB9, specifying the “--long” option, and starting with “fly” as the species (for Augustus). “Canu”

is used in the examples below – the same was done for Falcon.

run_BUSCO.py --in ${ASM} -o Sciara_canu -l ${DIPTERA} -m genome --cpu 9 --limit 10 --long -sp fly -z --augustus_parameters="--progress=true"

The optimized re-training parameters were then copied into the species directory for Augustus in subdirectory named “Sciara_long_unmasked_rl10_canu”:

cp run_Sciara_canu/augustus_output/retraining_parameters/*

/Path/to/augustus-3.2.2/config/species/Sciara_long_unmasked_rl10_canu/

The files therein were then re-named such that their prefixes matched the directory name (Sciara_long_unmasked_rl10_canu) instead of the prefixes given by BUSCO. Finally, the innards of the “*_parameters.cfg*” files that pointed to those files needed to be changed from the BUSCO prefix to the new prefix as well.

4.3.4.4 Using Maker est2genome/protein2genome to train SNAP for the first gene prediction round of Maker

SNAP (Korf 2004) was trained separately on the final Canu and Falcon scaffolds starting with evidence-only gene models from Maker2 (Holt and Yandell 2011; Campbell et al. 2014). Maker2 is an annotation tool that can be run iteratively (with training and re-training of gene predictors) to improve annotations. In the first round of Maker (described in more detail in the Maker section below), we did not use any gene predictors, instead generating gene models from EST and protein evidence alone. We used our Trinity and StringTie assemblies for EST evidence and a comprehensive set of arthropod/insect proteins (detailed in the Maker section) as protein homology evidence. Maker gene models were then extracted from only scaffolds labeled as Arthropod/Diptera and used to train SNAP in the following way:

In a directory outside of the directory Maker was launched in:

ln -s /Path/to/*maker.output/*datastore . grep -w -f arthropod_scaffs.txt

/Path/to/*.maker.output/*_master_datastore_index.log > arthropod.log gff3_merge -o Sciara_rnd1.arthropod.gff -d arthropod.log

maker2zff Sciara_rnd1.arthropod.gff

fathom genome.ann genome.dna -gene-stats > gene-stats.log 2>&1 fathom genome.ann genome.dna -validate > validate.log 2>&1

fathom -categorize 1000 genome.ann genome.dna > categorize.log 2>&1 fathom -export 1000 -plus uni.ann uni.dna > uni-plus.log 2>&1

mkdir pars cd pars

forge ../export.ann ../export.dna > ../forge.log 2>&1 cd ..

hmm-assembler.pl Sciara_rnd1 pars > Sciara_rnd1.hmm

The file “Sciara_rnd1.hmm” was provided to Maker for the first round of gene prediction (second Maker round).

4.3.4.5 Repeat Libraries

Repeat masking inside Maker was performed with the combined set of repeats from (i) all Arthropoda in Dfam/RepBase (Bao et al. 2015; Hubley et al. 2016), (ii) the species-specific repeat library generated from RepeatModeler, and (iii) known repeat sequences. The “te_proteins.fasta”

that comes with Maker2 was used as the set of repeat proteins for repeat masking.

4.3.4.6 EST evidence

Both the Trinity and StringTie assemblies constructed from both sexes across multiple life stages were passed as EST evidence, Trinity as a fasta and StringTie as a GFF.

4.3.4.7 Alternative EST evidence

The following 68,299 Dipteran transcript sequences were passed to Maker as alternative EST evidence.

• Cecidomyiidae family (13,582 sequences; 5 species) o 449 Mayetiola avenae (Oat midge)

o 458 Mayetiola hordei (Barley midge)

o 1546 Sitodiplosis mosellana (Wheat midge) o 9866 Mayetiola destructor (Hessian fly) o 1263 Orseolia oryzae (Asian rice gall midge)

• Sciaridae family

o 8570 Rhynchosciara Americana o Culicomorpha infraorder

§ Anopheles gambiae transcriptome (downloaded May 4, 2017):

• https://www.vectorbase.org/download/anopheles-gambiae-pesttranscriptsagamp46fagz

• 15,648 sequences - Brachycera suborder of Diptera

o Drosophila melanogaster transcriptome (Downloaded Jan 20, 2015):

§ ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/fasta/dmel-all-transcript-r6.03.fasta.gz

§ 30,485 sequences

4.3.4.8 Protein homology evidence

The following protein sequences from the Arthropoda phylum were passed to Maker as protein homology evidence:

UniProt Arthropoda:

All protein sequences available for Arthropoda (the phylum in which the class Insecta resides) from UniProt (12,408 fasta entries) were passed to maker for protein homology evidence.

- https://www.uniprot.org/uniprot/?query=reviewed:yes%20taxonomy:6656

- This spanned 843 species from 463 genera with a median of 2 sequences per species and 3 per genus. There were 3541 sequences from Drosophila melanogaster, an order of magnitude higher than that from the next highest (Drosophila pseudoobscura). The UniProt Arthropoda dataset contained at most 7812 differently-named protein sequences, about half of which must have had an entry from Drosophila.

All other protein evidence came from representatives from four major orders of the Insecta class.

- Diptera order

o Nematocera suborder

§ Bibionomorpha infraorder

• Known Bradysia coprophila mRNA sequences from NCBI nucleotide database (txid38358; downloaded on Sep 01, 2016)

o There were 16 protein sequences

• Rhynchosciara proteins o 20

§ Culicomorpha infraorder

Anopheles gambiae proteome (downloaded May 4, 2017):

• Drosophila represents a selection from the Culicomorpha infraorder under the Nematocera suborder of the order Diptera

- Hymenoptera order

o Apis mellifera protein sequences

§ https://www.ncbi.nlm.nih.gov/protein; txid7460[Organism:exp]; RefSeq entries only; downloaded Dec 15, 2018

§ 23,491 protein sequences - Lepidoptera order

o Bombyx mori protein sequences

§ https://www.ncbi.nlm.nih.gov/protein; txid7091[Organism:noexp]; RefSeq entries only; downloaded Dec 15, 2018

§ 22,590 protein sequences - Coleoptera order

o Tribolium castaneum proteins

§ https://www.ncbi.nlm.nih.gov/protein?LinkName=bioproject_protein&from_uid

=15718; clicking RefSeq; RefSeq entries only; downloaded Dec 15, 2018

§ 22,597 protein sequences

4.3.4.9 Other parameter choices

- The maximum DNA length was set to 2 Mb so contigs and scaffolds would mostly not be split up.

- The minimum contig size was kept as default (1 bp).

- Maker2 was instructed to look for alternative splicing transcripts.

- Our choices for the minimum exon size (200 bp) and maximum intron size (500 kb) that Maker2 was instructed to expect were guided by the minimum exon and maximum intron sizes found in the StringTie transcriptome assembly.

- Similarly, Maker2 was instructed to use single exon ESTs as evidence given the that a fair number of transcripts appeared to be single exon (~3500; ~10%) from the StringTie assemblies.

- est2genome and protein2genome were both turned on in the first round of Maker2 to generate gene models strictly from EST and protein homology evidence for training SNAP prior to the second round of Maker2.

4.3.4.10 Maker2 Round 1