• Keine Ergebnisse gefunden

4.5 Taxonomic Assignment

4.5.2 Assignment Performance Evaluation

The aim of the next investigation was to evaluate the performance of the taxonomic assignment of the MPA software: therefore, MPA was compared against Unipept (see Section 3.4.5) with re-spect to the number of taxon-specific peptide identifications. In general, each of the tools follows a different assignment strategy: in Unipept, a central database provides information on each pep-tide that can be linked uniquely to a certain taxon. Conversely, the MPA relies on the peppep-tide to protein relations, since the provided protein accessions are used to resolve the taxonomic origin.

In this case, the NCBI taxonomy served as the reference database to retrieve the taxonomic lin-eage for each of the reported identifications in the MPA software. In these analyses, the same data

Figure 4.27: Phylogenetic classification of BGP data sets based on the number of peptides per phylum.

The major phyla per data set and search protein database are presented in the pie charts that display the distribution of the peptide hits: (a) GENT01 against SwissProt, (b) GENT01 against TrEMBL, (c) GENT07 against SwissProt, (d) GENT07 against SwissProt, (e) GENT16 against SwissProt and (f) GENT16 against TrEMBL. All eukaryotic phyla were filtered out. The total number of assigned peptides is provided above each chart panel (n).

were used that resulted from searching GENT01, GENT07 and GENT16 against SwissProt and TrEMBL as described in the previous section. In addition, the distinct peptide sequences from these results were subjected to Unipept. Consequently, unique taxonomic assignments could be retrieved for both software tools.

Figure 4.28 provides an overview of peptide assignments to different taxonomic ranks for MPA and Unipept. The relative proportions of successfully assigned peptides are displayed in

4.5 Taxonomic Assignment

the heatmaps. In comparison to Unipept, MPA achieved higher fractions of taxon-specific pep-tides across all data sets and taxonomic ranks for the results against SwissProt. For the TrEMBL searches, Unipept was in turn able to assign slightly more peptides for superkingdom. For each of the other taxonomic ranks, MPA outnumbered Unipept with few exceptions in case of data set GENT16. On average, Unipept could not assign 22% of the peptides to any taxonomic rank for the data sets searched against SwissProt.

Figure 4.28: Taxonomic assignment performance of MPA and Unipept for BGP data sets.The heat maps show the relative percentage of peptides that could be assigned to a specific taxonomic rank for (a) MPA and (b) Unipept. The respective peptides were obtained from searching the data sets GENT01, GENT07 and GENT16 against SwissProt and TrEMBL (FDR < 5% ). The white-blue color scale depicts the relative percentage of assigned peptides (white: low, blue: high).

Ground Truth Data of a Microbial Mixture. The previous investigation on the BGP data suf-fered from the shortcoming that it was not possible to evaluate the accuracy of the taxonomic assignment process due to the unknown microbial composition of the samples. To overcome this limitation, in the following analysis, metaproteomic data was used which originated from a lab-assembled microbial mixture containing nine bacterial and eukaryotic species (see Section 3.2.4) as published in a study by Tancaet al.[126]. Providing knowledge about the exact microbial com-position, the ground truth data allowed to perform benchmark experiments evaluating the exact performance of the taxonomic assignment process in MPA and Unipept. Therefore, database searches were performed against SwissProt by using the microbial mixture data sets 9MM_FASP and 9MM_PPID originating from two different sample preparation steps. These samples were regarded as technical replicates, since the differences of the experimental setup were out of scope in this work. The MS/MS data sets were searched independently and thereafter the identifications from each result were merged. From these results, the taxon-specific peptides for the taxonomic ranks family, genus and species were exported at 1% and 5% FDR. As proposed in the original study, the peptides were classified as correct and incorrect taxonomic assignments according to

the information of the nine organisms contained in the sample [126]. In this step, the strategy of the authors was evaluated by which a filter was used that determines whether a set of peptides contributes significantly to a certain taxon: the proportion of these peptides to the total amount of taxon-specific identifications needs to be higher than a specified threshold. The authors rec-ommended a value of 0.5% for this taxon significance threshold when testing the reliability of taxonomic assignment using Unipept [276] and MEGAN [207]. Eventually, different threshold values were applied to determine a robust default parameter for the MPA software.

It was found that the number of correct peptide assignments in the MPA results increased when elevating the taxonomic rank from species over genus to family (Figure 4.29). Further-more, it can also be recognized that at a taxon significance threshold of above 3%, the amount of incorrect taxon-specific assignments is reduced to a minimum for the taxonomic ranks under investigation. Furthermore, it can be observed that an increase from 1% to 5% FDR resulted in a higher number of correct and incorrect assignments.

Figure 4.29: Taxonomic assignment performance of MPA for 9MM data set. The line plots show the number of correct (solid) and incorrect (dashed) taxon-specific peptide assignments of MPA for species (black), genus (blue) and family (green) as function of the taxon significance threshold at (a) 1% and (b) 5% FDR. The peptides were assigned according to the LCA approach.

Figure 4.30 shows that significantly fewer peptides could be assigned when using Unipept in comparison to previous results from MPA. In particular, the number of correct assignments at the species level was reduced. Remarkably, the application of a taxon significance threshold of up to 5% did not weed out the wrong assignments at the species level. Examining this result in detail, it can be recognized that all incorrectly assigned peptides were attributed to the eukaryotic speciesGallus gallusby Unipept (data not shown). For the other taxa, the application of a taxon significance threshold of 3% could exclude most of the incorrect taxonomic assignments. In

4.5 Taxonomic Assignment

addition, applying a threshold value of 0.5% could remove the majority of incorrectly assigned peptides at 1% FDR.

Figure 4.30: Taxonomic assignment performance of Unipept for 9MM data set. The line plots show the number of correct (solid) and incorrect (dashed) taxon-specific peptide assignments of Unipept for species (black), genus (blue) and family (green) as function of the taxon significance threshold at (a) 1% and (b) 5% FDR.

Finally, it should be noted that peptides of the speciesEscherichia coliandLactobacillus casei were considered at a taxon significance threshold of 0.5% in the MPA (Table A.11 in the appendix).

In contrast, Unipept did not report any assignments to these two species.

Performance Comparison of LCA and MST.Besides the conventional LCA approach, the alter-native MST method was developed to preserve the specificity of the peptides at the phylogenetic level. Since both methods were implemented in theTaxonomy Definition process of the MPA software (see Section 3.1.2), the results between LCA and MST could be directly compared with respect to their correctness during the taxonomic peptide assignment for the microbial mixture data sets.

Figure 4.31 displays that the MST method resulted in a slightly better performance for the proportions of correct taxon-specific peptide assignments in comparison to the LCA method.

Finally, no significant differences were found between the result sets from both replicates with respect to the relative number of correct taxonomic assignments.

Figure 4.31: Performance comparison of LCA and MST taxonomic assignment methods. The line plots display the relative fraction of correct taxon-specific peptide assignments of MPA for species (black), genus (blue) and family (green) when using the LCA (solid) and MST (dashed) method as function of the taxon filter threshold for (a) 9MM_FASP and (b) 9MM_PPID data set results. TheTaxonomy Definitionfeature of the MPA software was used to specify the LCA and MST method.