• Keine Ergebnisse gefunden

Running subsequent rounds of Maker2 with gene predictors:

4.3.5 Final Assembly Selection

There were a total of 66 metrics (reported/described below) weighing in on the final evaluations between Canu and Falcon. Canu outperformed in the majority of individual metrics, winning in 54 of the 66 (81.8%). The scores for individual metrics can be found in Supplemental Figures S13 and S14, and Supplemental Tables S14, S15, and S16. That Canu won the majority of metrics could have been an artifact of simply outperforming in “categories” that had more metrics than categories that Falcon won. To discount this, we partitioned the 66 metrics into 12 categories as organized below, and as represented in Figure 5F. Briefly, categories were organized by dataset used (e.g. Illumina, PacBio, Nanopore, BioNano, RNA-seq, de novo transcriptome), by the parameter being tested (e.g. contig length, gene content), or by the object being tested (e.g.

reference-guided transcriptome), and/or by target objective (e.g evaluation of Maker2 genes by external tools, evaluation of the annotation by Maker2 internal metrics, evaluation of annotation according to homology and predicted functions). After partitioning the metrics into categories we determined which assembler won the majority of metrics in each category, then which assembler won the majority of categories. Canu won the majority of metrics in 10 of the 12 categories (83.3%). There are no logical ways to collapse, reorganize, or further split the metrics into sensible categories to overturn this result. For example, since Falcon won just 11 of the 66 metrics, all 11 metrics would need to be split into their own single-metric categories to beat the 10 existing categories won by Canu. However, that categorization would lack obvious logical organizing principles and would be equivalent to arbitrarily multiplying the weight of a subset of the previously-logical categories, with 4 single-metric categories for size statistics and 4 single-metric categories for BioNano optical map metrics, for example. Overall, the Canu assembly was chosen over Falcon based on these analyses. Nonetheless, despite winning the majority of metrics, the differences in scores were often small, suggesting either assembly would be fine as a first draft.

Each metric below is described in previous sections (see subsections within each):

- Section 4.1.3 Evaluations of the Initial 40 Short Read Assemblies - Section 4.2.5 Evaluations of the Initial 50 long read assemblies - Section 4.3.2 Evaluations of transcriptome assemblies

- Section 4.3.4.15 Maker2 gene set evaluations

- Also see papers describing (and other documentation associated with):

o LAP (Ghodsi et al. 2013), o ALE (Clark et al. 2013), o REAPR (Hunt et al. 2013), o FRCbam (Vezzi et al. 2012), o BUSCO (Simão et al. 2015), o Pilon (Walker et al. 2014), o Sniffles (Sedlazeck et al. 2018), o TransRate (Smith-Unna et al. 2016), o RSEM-Eval (Li et al. 2014), and o Maker2 (Holt and Yandell 2011)

Category 1: Size statistics (4) (see Supplemental Figure S13 A-D) (01) NG50

(02) LG50

(03) Max contig length

(04) Expected contig length (normalized to expected genome size)

(07) Percent Illumina DNA-seq reads aligned with Bowtie2

(08) Percent Illumina paired-end DNA-seq reads aligned concordantly (09) LAP

(10) ALE

(11) REAPR Mean Base Score (12) REAPR Links

(13) FRCbam – number of features per Mb

(14) Pilon – percent of expected genome size confirmed

Category 4: PacBio metrics (8) (see Supplemental Figure S13 Q-X) (15) PacBio percent reads aligned with BWA

(16) PacBio average alignment score (17) PacBio average MAPQ

(18) PacBio percent reads split across multiple contigs (19) PacBio average number of split alignments per read (20) PacBio total number SVs reported by Sniffles (21) PacBio number translocations reported by Sniffles

(22) PacBio number other SVs (del, dup, ins, inv) reported by Sniffles Category 5: Nanopore metrics (8) (see Supplemental Figure S13 Y-f)

(23) Nanopore percent reads aligned with BWA (24) Nanopore average alignment score

(25) Nanopore average MAPQ

(26) Nanopore percent reads split across multiple contigs (27) Nanopore average number of split alignments per read (28) Nanopore total number SVs reported by Sniffles (29) Nanopore number translocations reported by Sniffles

(30) Nanopore number other SVs (del, dup, ins, inv) reported by Sniffles Category 6: BioNano optical map metrics (4) (see Supplemental Figure S13 g-j)

(31) BioNano Span

(32) BioNano average M-score

(33) BioNano average alignment length (34) BioNano average coverage

Category 7: RNA-seq evaluation of genome assembly (1) (see Supplemental Figure S13 G) (35) Percent RNA-seq reads aligned with HiSat2 (completeness)

Category 8: De novo transcriptome support for genome assembly (1) (see Supplemental Figure S13 H)

(36) Bitscore sum of de novo transcript BLAST alignments

Category 9: Evaluations of genome-guided transcriptome assemblies (9) (see Supplemental Table S14)

(37) Number of Complete Dipteran BUSCOs (v3) (more is better)

(43) TransRate Score

(44) TransRate Optimal Score (45) RSEM-Eval Score

Category 10: External evaluations of Maker2 gene annotations (11) (see Supplemental Table S15A-B, and Supplemental Figure S14 A, C-E)

(46) Number of Complete Dipteran BUSCOs (v3) found in transcripts (47) Number of Missing Dipteran BUSCOs (v3) found in transcripts (48) Number of Complete Dipteran BUSCOs (v3) found in proteins (49) Number of Missing Dipteran BUSCOs (v3) found in proteins

(50) TransRate - proportion of reference proteins with Conditional Reciprocal Best BLAST (CRBB) hits given the Maker2 annotation (Reference = UniProt Arthropod Proteins) (51) TransRate - proportion of reference proteome covered by transcripts from the Maker2

annotation (Reference = UniProt Arthropod Proteins)

(52) TransRate - proportion of reference proteins with Conditional Reciprocal Best BLAST (CRBB) hits given the Maker2 annotation (Reference = Maker2 Input)

(53) TransRate - proportion of reference proteome covered by transcripts from the Maker2 annotation (Reference = Maker2 Input)

(54) TransRate Score

(55) TransRate Optimal Score (56) RSEM-Eval Score

Category 11: Maker2-internal evaluation of Maker2 gene annotations (11) (see Supplemental Table S15A-B, and Supplemental Figure S14 A, C-E)

(57) Annotation Edit Distances (AED) output by Maker2

Category 12: Functional analysis of Maker2 gene annotations (9) (see Supplemental Table S16):

(58) Number of Genes with Ontology Term (59) Number of Genes with UniProt hit(s) (60) Number of Genes with Pfam domain

(61) Number of Genes with All 3 above (intersect) (62) Number of Genes with > 1 of 3 above (union) (63) Number of Genes with Drosophila hit(s) (64) Number of Genes with Anopheles hit(s)

(65) Percent Drosophila Proteome with Sciara hit(s) (66) Percent Anopheles Proteome with Sciara hit(s)

Notes on the newer metrics above (as compared to previous genome evaluation sections):

- Categories 1-6 feature datasets and many metrics used in previous evaluations. Other metrics therein have similar objectives.

o For example, Pilon was used to compute the number of bases in the genome that are confirmed (supported by evidence provided), which was represented as a percent of the expected genome size (to have the same denominator across assemblies).

o As another example, Sniffles was used to report not only on the number of SVs, but also the sub-categories (i) translocations (proportional to mis-assemblies) and (ii) other (deletions, insertions, duplications, inversions).

o As with the various DNA-seq datasets (Illumina, PacBio, Nanopore), the percent of RNA-seq reads that map is a measure of completeness, but restricted to transcribed regions.

o Mapping the de novo assembled transcripts (see Section 4.3.1.1) with blast (-culling_limit 1) and taking the sum of bitscores is a measure of completeness and of agreement or quality (better alignments). Note that the bitscore sum is used here, but the following metrics gave the same results: the percent of de novo transcripts that aligned, the percent of the total length of de novo transcripts that aligned, the percent of the total length of de novo transcripts that matched (i.e. not

only aligned, but matched).

- Categories 9-12 transfer to each genome assembly the evaluation results obtained from evaluating products associated with each genome assembly.

o Evaluating reference-guided transcriptomes that were guided by each genome assembly is, overall, an evaluation of how well each genome assembly was at guiding the transcriptome assembly given the same dataset and analysis pipeline.

§ The BUSCO and TransRate scores associated with a reference proteome were measures of completeness of the transcriptome assembly (and thereby transcribed regions of the genome assembly).

§ The other TransRate scores and the RSEM-eval score are measures of how well the RNA-seq data support the transcriptome assembly (and thereby the genome reference that guided its assembly).

o Similar to evaluating reference-guided transcriptome assemblies, evaluating the Maker2 gene annotations that were built iteratively and independently on each genome (e.g. gene predictors were trained separately on each genome, RNA-seq data and homology evidence were aligned to each genome, etc) also reflects on the underlying genome assembly, and is important anyway since the genome and its gene annotation are a package deal.

o Annotation Edit Distance measures agreement between annotated gene structures and overlapping evidence (de novo transcripts, reference-guided transcripts, homologous transcripts, homologous proteins) for the gene structures.

See Maker2 papers and documentation for more information. Smaller AEDs are better than larger ones, 0 being best. We compared the number of genes (using the transcript with the best AED) or the cumulative number of transcripts in the annotation (y-axis) as a function of AED (x-axis). Higher accumulations of genes (or transcripts) at lower AEDs indicated higher overall agreement with the evidence. See Supplemental Figure S14.

o For TransRate, the reference proteomes used are described in the TransRate analysis section (see Section 4.3.2.3). Either the entire set of proteins used as protein homology evidence input into Maker2 was used or a subset of those proteins limited to UniProt Arthropoda proteins (see Section 4.3.4.8 “Protein homology evidence”).