• Keine Ergebnisse gefunden

3. Test of conservation liquids for traps

4.4 Discussion

56

Figure 4.9: Distribution of sequence read numbers for all 33 species used in the experiment show that T1 data is more scattered than T2. However, their mean sequence read number per species does not differ very much between T1 and T2.

Figure 4.10: The number of identified species sequences in the different treatments T1 and T2 shows a higher variance in T1 than in T2. However, sequence numbers reveal a trend to correlate with the species used for amplification.

57

output reads, however clear estimates about the original input quantity cannot be made.

The variations can be massive and are not only PCR based but also species specific (Figure 4.8; Figure 4.9).

4.4.1 Chimeras and sequencing errors

When the sequence reads of the first and second sequencing run of the identical samples (R1 and R2) were analyzed it initially seemed that data from R2 sequencing had considerably less chimeras. It soon turned out to be an artifact of extensive sequencing errors within R2. Checking for unique sequences in R2 data it revealed that over 90% of unique sequences were found in the first treatment and about 100% of unique sequences in the second treatment (Figure 4.3). This seems unlikely considering that only 33 different species were used in the 95 analyzed samples. In contrast to this the software detected 0% chimeras in some of the samples where chimeras should have been found according to the analyzis of R1 data. The non-normal distribution of chimeras in R2 then additionally indicated that the results were strongly biased (Table 4.4). After checking back with LGC Genomics in Berlin, the company admitted that an increased error type can be expected in the second reading process (R2) as the chemicals and COI strands can experience some sort of exhausten during processing.

This effect has also been described as “phasing” and one of the main sources of sequencing error on the Illumina platform (Kircher et al. 2009; Kircher et al. 2012).

After this confirmation R2 data was excluded from most of the following analyses.

However, it should not be underestimated how the ability to detect chimeras with current software can be inhibited due to sequencing errors. Most algorithms sort sequences after their frequency and then start to compare the less common sequences as potential chimeric sequences with the more frequent sequences as their potential sources. If the software does not recognize a chimeric sequence as similar enough to its sources (e.g. due to several sequencing errors) it will not be identified as a chimera. As most of these errors cumulated towards the end of the sequence, all R2 sequences were manually shortened to a length of 190 bp. A manually inspection revealed that after that position multiple “N’s” (IUPAC: N for any Nucleotid) started to accumulate in the strand (IUPAC 1997). At the same time this length still guarantees reasonable chances for a sequence to lead to species identification (Meusnier et al. 2008). When the amount of chimeras was compared between the original and the shortened sequences, contrary results were found for R1 and R2 (Figure 4.7). While in R1 the number of identified chimeric sequences declined it increased in R2. The explanation for effect is that in R1 the detection of chimeric sequences worked well. It can be assumed that almost all chimeric sequences have identified as such. In R2 chimeric identification did not work so well. Although results should be identical the amount of chimeric sequences detected in R2 is far less than the amount detected in R1 (Figure 4.7). What happens now, when a part of a chimeric sequence is being cut off differs in R1 from R2. In R1 chances are high that the sequence was already identified chimeric and that the chimeric part is being cut off. As a result the number of chimeric sequences decreases. In R2 the chimeric sequence was not identified as such because of the high error rate within the sequence.

When the erroneous part that has a disguising effect on the chimera is cut off, the software is able to identify the chimera as such again. This then leads to increasing numbers of chimeric sequences in R2 as most have not been identified before (Figure 4.7, right). This means that under certain conditions shortening a sequence can have a

58

positive effect on data quality. However, identifying chimeras and detecting sequencing errors goes hand in hand. Multiple tools have been and still are being invented to improve quality filtering and detect chimeric sequences (Huber et al. 2004; Ashelford et al. 2005; Haas et al. 2011; Edgar et al. 2011; Wright et al. 2012; Callahan et al. 2016;

Edgar 2016).

4.4.2. Factors inducing the forming of chimeras

It is known that DNA of multiple species witin a single sample makes it prone for the forming of chimeras when PCR is used to amplify the regions of interest (Kanagawa 2003). It can therefore be supposed that increasing numbers of species within a sample would lead to an increased formation of chimeras during PCR in relation to the number of different species. From the process of chimera forming it could also be supposed that similar template sequences further promote the origination of chimeras as uncompleted sequence copies should have a higher affinity to bind to more similar DNA strands during PCR than sequence templates that are less similar (Figure 1.2). This would mean that samples containing closely related species could be especially susceptible for the origination of chimeras.

The experiments confirmed that increasing the number of species in a sample significantly increases the origination of chimeras (Figure 4.6, left). Apparently this effect is closely associated with the usage of PCR and it must be expected that larger numbers of different species increase this effect further (Figure 4.10). However, this effect can be reduced when the number of PCR-cycles with mixed sequences is being minimized as done in the second treatment T2 where the species DNA was amplified seperately (Table 4.1; Figure 4.1 and Figure 4.6, left). Although PCR could not fully be eliminated, due to the process of specific sample tagging for the sequencing process.

Here PCR was used to mark sequences with individual tags to be able to assign them to their original sample (Figure 4.3). Still no significant influence was found between the number of identified chimeras and the number of species within a sample for samples with the treatment T2.

Surprisingly sequence similarity had no effect on the number of identified chimeras (Figure 4.6, right). Although it could be assumed that during the formation of chimeras incomplete complementary sequences bind more easily to a more similar sequence template no such effects were found. However, experiments included only Diptera sequences with a maximum distance of 19.3% (127 differing bases between the sequences of Muscina prolapsa and Liriomyza intonsa) and even more distant taxa could possibly lead to reduced numbers of chimeras.

59

Figure 4.10: A prediction based on a generalized linear model assumes fewer chimeras for T2 than T1 when the number of species in a sample increases. In grey: 95% confidence interval.

4.4.3 DNA ratios and species abundances

In certain aspects DNA-based assessments have proved their superiority over morphological assessments in multiple taxa (Hebert et al. 2004; Smith et al. 2006, 2007;

Stein et al. 2014; Janzen et al. 2017). This digital available comprehensive knowledge allows extensive evaluations of samples with these various taxa. However, in one aspect no sufficient progress was made in the last years. The measuring of species abundances based on genetic approaches is requested especially by ecologists. The modest results of the experiments analyzing the effect of DNA ratios in mock samples accords to the results found in literature (Elbrecht et al. 2015; Piñol et al. 2015). Although a significant influence can be confirmed, absolute numbers of reads are highly variable. The inconsistency found in the results does not allow a generalized assessment of the amount of species specific DNA used at the beginning. As the experiment based on quantified DNA, assessments under more realistic condictions, like using whole specimens, can only be assumed to vary even more due to large differences in the biomass of the different species. This means at least for PCR based methods that species abundances cannot be reliably assessed.

There is evidence that variation lowers when the amount of PCR cycles is reduced (Figure 4.8). But the evaluation of the non-ratio samples also showed that the number of species seuqences was very inconsistent and did not depend on the sample treatment but was rather species specific (Figure 4.9). However, both treatments T1 and T2 did make use of PCR although T2 only for applying sample indices to the sequences. Further experiments on the quantification of the abundance of species should therefore concentrate on the strict avoidance of PCR to make sure not to introduce any bias.

4.4.4 Recommendations

As the goal of this study is the development of workflows for high-throughput sample metabarcoding suggestions only concentrate on aspects relevant for such an approach.

Of course it can be discussed that if the number of species influences the amount of chimeric sequences, samples could be previously sorted to reduce the number of species for each sample. But this would also decrease time and cost efficiency. Also it is

60

unclear what would be the appropriate number of species in a sample, and if the species within a sample should be as homogeneous as possible, as diverse as possible, and if they also should be sorted after size. For all these approaches reasonable arguments can be found in literature, either describing enhanced taxon recovery or the loss of data during Illumina sequencing (Krueger et al. 2011; Morinière et al. 2016; Elbrecht et al.

2017). Therefore the most convenient and certainly fastest way would be the evaluation of samples as a whole. This means without any previous sorting or splitting of the sample. As long as the DNA acquisition is done in a non-destructive way, samples can always be sorted afterwards or when DNA results imply the necessity for it. At least when species abundances are required samples should still be manually sorted.

Hybrid-enrichment is one of the most promising approaches to enrich target sequences like COI as it does not rely on the amplification through PCR. Original species abundances might be affected to a lesser extend as their sequence ratios are not biased through amplification due to the selective binding of primers. It can also be aspect that the number of chimeras should be on a constant low level due to lesser PCR cycles compared to common PCR based methods.

61

5 Empirical biodiversity assessment