• Keine Ergebnisse gefunden

Maker2 Round 3 combined standard gene sets and finalization

Running subsequent rounds of Maker2 with gene predictors:

4.3.4.14 Maker2 Round 3 combined standard gene sets and finalization

We evaluated the various Maker Round 3 outputs with AED distributions, BUSCO, RSEM-Eval, and TransRate as described elsewhere. The gene sets produced using the original Repeat Library had more genes, better AED distributions, more BUSCOs, and better evaluations otherwise.

Therefore, we used only gene annotations generated when using the filtered repeat library only if they had no overlaps with the set of annotations generated when using the original repeat library (using BEDtools). This resulted in adding 1132 and 607 gene models to the original Canu and Falcon gene model sets, respectively. We found that of these additional genes added back to the Canu set, for example, 30 were labeled as “Core histone H2A/H2B/H3/H4” and 11 were labeled as “7tm Odorant receptor”. Thus, this increases the representation of those gene families in our final set as we set out to do in the first place as described above.

The combined gene sets were then finalized by removing gene models corresponding to contaminating (bacterial) contigs to give the final set specific to Sciara coprophila (Bradysia coprophila). The Maker2 gene names were changed to have a consistent naming scheme (e.g.

for Canu all were given the prefix: Bcop_v1_g for Bradysia coprophila, version 1, gene number).

In addition to InterProScan and Gene Ontology information obtained previously, proteins were functionally annotated, where possible, by using BLASTp against the entire UniProt/SwissProt protein database. For every Sciara protein, we kept only the BLASTp hit with the highest bitscore.

Best hits were only retained if they had a bitscore >= 50 or an expected value <= 5e-4. The GFF and Fasta files were updated with the BLASTp hit and all other functional information obtained.

For further evaluation of the Maker gene sets for the Canu and Falcon assemblies, we also found the number of genes with BLASTp hits (performed as above) in the Drosophila melanogaster and Anopheles gambiae proteomes. Gene statistics (on isoforms, exons, introns, etc) were obtained from both the final Maker2 gene model sets as well as the input StringTie transcriptome assemblies using a custom script.

# VARIABLES PRE=sciara_rnd3 ORIGDIR=../rnd3

ALTDIR=../../with_filtered_repeats/outputs_keeppred1/rnd3 ORIGGFF_PRECLEAN=${ORIGDIR}/${PRE}.standard.functional_ipr.gff ORIGGFF=cleaned-original-${PRE}.standard.functional_ipr.gff ORIGTRANS=${ORIGDIR}/standard-transcripts.fasta

ORIGPROT=${ORIGDIR}/standard-proteins.fasta

ORIGIPS=${ORIGDIR}/sciara_rnd3.all.maker.proteins.fasta.tsv ALTGFF_PRECLEAN=${ALTDIR}/${PRE}.standard.functional_ipr.gff ALTGFF=cleaned-alt-${PRE}.standard.functional_ipr.gff

ALTTRANS=${ALTDIR}/standard-transcripts.fasta ALTPROT=${ALTDIR}/standard-proteins.fasta NONGFF=${PRE}.nonoverlapping-genes.gff OUTGFF=${PRE}.nonoverlapping-families.gff

OUTTRANS=${PRE}.nonoverlapping-families.transcripts.fasta OUTPROT=${PRE}.nonoverlapping-families.proteins.fasta OUTGFFMOD=${PRE}.nonoverlapping-families-modifiedNames.gff OUTTRANSMOD=${PRE}.nonoverlapping-families.transcripts-modifiedNames.fasta

OUTPROTMOD=${PRE}.nonoverlapping-families.proteins-modifiedNames.fasta COMBGFF=${PRE}.combined.standard.gff

COMBTRANS=${PRE}.combined.standard.transcripts.fasta COMBPROT=${PRE}.combined.standard.proteins.fasta COMBIPS=${PRE}.interproscan.results.tsv

#0. Clean up Parent attribute in Maker Standard GFF (it has references to mRNAs that did not make the cut, which can cause issues for some programs) - cleanParentSlotInGFF.py found at

https://github.com/JohnUrban/sciara-project-tools

cleanParentSlotInGFF.py -g ${ORIGGFF_PRECLEAN} --notfound cleaned-original-notfound.txt > ${ORIGGFF}

cleanParentSlotInGFF.py -g ${ALTGFF_PRECLEAN} --notfound cleaned-alt-notfound.txt > ${ALTGFF}

#1. Obtain gene locations from alternative gene set that have no overlaps with the original set

intersectBed -v -a <(awk '$3=="gene"' ${ALTGFF} ) -b <( awk '$3=="gene"' ${ORIGGFF} ) > ${NONGFF}

#2. Create a text file with just the “parents” (gene names) of the non-overlapping add-ons from alternative gene set.

awk '$3=="gene" {print $9}' ${NONGFF} | awk '{gsub(/=|;/,"\t"); print

$2}' | sort | uniq > parents.txt

awk '$3=="mRNA" {print $9}' ${OUTGFF} | awk '{gsub(/=|;/,"\t"); print

$2}' | sort | uniq > transcripts.txt

extractFastxEntries.py -n transcripts.txt -f ${ALTTRANS} > ${OUTTRANS}

extractFastxEntries.py -n transcripts.txt -f ${ALTPROT} > ${OUTPROT}

#5. Modify the names of the genes/transcripts/proteins so they do not collide with names in the original gene set

parseFamiliesfromGFF.py -i ${OUTGFF} -p parents.txt --namechanger addons_ > ${OUTGFFMOD}

fasta_name_changer.py -f ${OUTTRANS} -F addons_ > ${OUTTRANSMOD}

fasta_name_changer.py -f ${OUTPROT} -F addons_ > ${OUTPROTMOD}

#6. COMBINE the name changed “Add-ons” with the original gene set and sort

cat ${ORIGGFF} ${OUTGFFMOD} > tmp.gff

gff3sort.pl --precise --chr_order natural tmp.gff > ${COMBGFF}

#7. COMBINE the name changed “Add-on” transcripts and proteins with original sets

cat ${ORIGTRANS} ${OUTTRANSMOD} > ${COMBTRANS}

cat ${ORIGPROT} ${OUTPROTMOD} > ${COMBPROT}

#8. COMBINE the InterProScan domain/functional information grep –w -f parents.txt

${ALTDIR}/sciara_rnd3.all.maker.proteins.fasta.tsv | awk '{print

"addons_"$0}' > parent-functions-modifiedNames.txt

cat ${ORIGIPS} parent-functions-modifiedNames.txt > ${COMBIPS}

Commands used to finalize gene sets:

## Note: find the following at: https://github.com/JohnUrban/sciara-project-tools

## grep.py : This utility was created for simple tasks one might normally perform with grep. However, we noticed that some grep

#1. Make copies of final files into new directory such that in-place name-changing will be fine.

mkdir –p newdir && cd newdir cp ../*combined* .

cp ../sciara_rnd3.interproscan.results.tsv .

#2. Partition Maker GFF into bacterial and nonbacterial (Sciara) given the set of contigs classified as bacterial

grep.py -p bacterial.txt -f sciara_rnd3.combined.standard.gff -c 1 -C 1

> bacterial.sciara_rnd3.combined.standard.gff

grep.py -p bacterial.txt -f sciara_rnd3.combined.standard.gff -c 1 -C 1 -v > nonbacterial.sciara_rnd3.combined.standard.gff

#3. Identify mRNA names in Maker GFF associated with bacterial contigs awk '$3=="mRNA" {print $9}' bacterial.sciara_rnd3.combined.standard.gff

| awk '{gsub(/;|=/,"\t"); print $2}' > bacterial-mRNA-names.txt

#4. Partition InterProScan Results grep.py -p bacterial-mRNA-names.txt -f

sciara_rnd3.interproscan.results.tsv -c 1 -C 1 >

bacterial.sciara_rnd3.interproscan.results.tsv grep.py -p bacterial-mRNA-names.txt -f

sciara_rnd3.interproscan.results.tsv -c 1 -C 1 -v >

nonbacterial.sciara_rnd3.interproscan.results.tsv

#5. Partition Protein Fasta

extractFastxEntries.py -n bacterial-mRNA-names.txt -f sciara_rnd3.combined.standard.proteins.fasta >

bacterial.sciara_rnd3.combined.standard.proteins.fasta

extractFastxEntries.py --exclude -n bacterial-mRNA-names.txt -f sciara_rnd3.combined.standard.proteins.fasta >

nonbacterial.sciara_rnd3.combined.standard.proteins.fasta

#6. Partition Trascript Fasta

extractFastxEntries.py -n bacterial-mRNA-names.txt -f sciara_rnd3.combined.standard.transcripts.fasta >

bacterial.sciara_rnd3.combined.standard.transcripts.fasta

extractFastxEntries.py --exclude -n bacterial-mRNA-names.txt -f sciara_rnd3.combined.standard.transcripts.fasta >

#7a. Make kev:value maps between old names and new names for Sciara models (nonbacterial)

maker_map_ids --prefix Bcop_v1_g --justify 6 nonbacterial.sciara_rnd3.combined.standard.gff >

nonbacterial.sciara_rnd3.combined.standard.map

#7b. Make kev:value maps between old names and new names for contaminating/removed models (bacterial)

maker_map_ids --prefix Bcop-bacterial_v1_bg --justify 6 bacterial.sciara_rnd3.combined.standard.gff >

bacterial.sciara_rnd3.combined.standard.map

#7c. Rename in place using Maker-provided utilities for PRE in bacterial nonbacterial; do

map_gff_ids ${PRE}.sciara_rnd3.combined.standard.map

${PRE}.sciara_rnd3.combined.standard.gff

map_fasta_ids ${PRE}.sciara_rnd3.combined.standard.map

${PRE}.sciara_rnd3.combined.standard.transcripts.fasta

map_fasta_ids ${PRE}.sciara_rnd3.combined.standard.map

${PRE}.sciara_rnd3.combined.standard.proteins.fasta

map_data_ids ${PRE}.sciara_rnd3.combined.standard.map

${PRE}.sciara_rnd3.interproscan.results.tsv done

#8. Perform BLASTP of proteins against entire UniProt/SwissProt Database(Optionally also do on contaminating/bacterial gene models);

These commands were also used to look at the Maker proteins compared to Drosophila and Anopheles proteomes

#8a. Download uniprot_sprot FASTA and make BLAST database wget

ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgeba se/complete/uniprot_sprot.fasta.gz

gunzip uniprot_sprot.fasta.gz

makeblastdb -in uniprot_sprot.fasta -dbtype prot -out uniprot_sprot

#8b. BLASTp – the BLAST command used is given here although we broke the gene sets up to compute the results in parallel. (Below PRE is either nonbacterial or bacterial corresponding to files made above) blastp -query $Q -db $DB -num_threads $P -outfmt 6 -evalue 1e-2 -out

${PRE}.final-filtered-sciara_rnd3.combined.standard.proteins.all.output.blastp

#9. Update GFF and FASTA with BLAST info -- finalize for PRE in nonbacterial bacterial; do

## variables

UNIPROT=uniprot_sprot.fasta

BLASTP=${PRE}.final-filtered-sciara_rnd3.combined.standard.proteins.all.output.blastp GFF=${PRE}.sciara_rnd3.combined.standard.gff

BASE=$( basename ${GFF} .gff ) PROT=${BASE}.proteins.fasta TRANS=${BASE}.transcripts.fasta NEWGFF=${BASE}.putative_function.gff

NEWPROT=${BASE}.proteins.putative_function.fasta NEWTRANS=${BASE}.transcripts.putative_function.fasta

## Annotate Maker GFF with BLASTp information

maker_functional_gff ${UNIPROT} ${BLASTP} ${GFF} > ${NEWGFF}

## Annotate Protein Fasta with BLASTp information

maker_functional_fasta ${UNIPROT} ${BLASTP} ${PROT} > ${NEWPROT}

## Annotate Transcript Fasta with BLASTp information maker_functional_fasta ${UNIPROT} ${BLASTP} ${TRANS} >

${NEWTRANS}

## Create GFF annotating genome with InterProScan Domain/GO Information

iprscan2gff3 ${PRE}.sciara_rnd3.interproscan.results.tsv

${PRE}.sciara_rnd3.combined.standard.putative_function.gff >

${PRE}.sciara_rnd3.combined.standard.putative_function.visible_ip rscan_domains.gff

## Annotate Protein Fasta with InterProScan Domain/GO/Pfam info pfammer.py -p ${PRE}.sciara_rnd3.interproscan.results.tsv -f

${NEWPROT} > final.${NEWPROT}

## Annotate Transcript Fasta with InterProScan Domain/GO/Pfam info

pfammer.py -p ${PRE}.sciara_rnd3.interproscan.results.tsv -f

${NEWTRANS} > final.${NEWTRANS}

## Annotate Maker GFF with InterProScan Domain/GO/Pfam info pfammer.py -p ${PRE}.sciara_rnd3.interproscan.results.tsv -g

${NEWGFF} > final.${NEWGFF}

done

Commands used to get gene statisitcs from Maker and StringTie GFFs: