1
Figure S1: Histograms of the prevalence of species in the different datasets. The coloured bars represent 2
the species used in modelling and the white bars species that were observed but too infrequently (< 10 3
occurrences) for modelling.
4 5
6
Figure S2: Histograms of site species richness (SR) observed for the different datasets. The top row is 7
based on all species observed in the field surveys and the bottom row is based on the subset of species 8
used in the modelling.
9 10
11
12
Figure S3: Evaluation metrics for the individual SDMs for all four taxa. The solid line inside the box 13
indicates the median, boxes range from the 25th to the 75th percentile, and the whiskers indicate ± 2 14
standard deviations.
15
16
17
Figure S4: Boxplot of the community AUC (cAUC) for the four different taxa. The solid line inside 18
the box indicates the median, boxes range from the 25th to the 75th percentile, and the whiskers 19
indicate ± 2 standard deviations.
20 21
22
Figure S5: Standardised species richness deviation between observations and predictions based on 23
either “species-specific” thresholding (Fixed, Max.TSS, Max.Kappa), “site-specific” thresholding or 24
the simply sum of the predicted probabilities of all species in a site. Groups with significant 25
differences in their means (as measured by the pairwise-Wilcoxon test) are indicated by different 26
letters. The solid line inside the box indicates the median, the boxes range from the 25th to the 75th 27
percentile, and the whiskers indicate ± 2 standard deviations.
28 29
30
Figure S6: Correlation of the Jaccard similarity metrics based on the binary data with threshold- 31
independent Jaccard similarity metrics. The top row shows the correlation of the binary metrics with 32
maxJaccard, and the bottom row shows the correlation of the binary metrics with probJaccard. The 33
binary Jaccard similarity of panels a and d is based on a “species-specific” maxTSS threshold, of 34
panels b and e on a “species-specific” maxKappa threshold, and of panels c and f on a “site-specific”
35
probability ranking rule. The numbers in the legends indicate the Spearman correlation coefficient 36
between the binary and threshold independent metrics.
37 38
Appendix 1: Script to test the H0 that the observed and expected SR are not different 1
2
The here provided script creates a virtual community of species (sp.pool). Then each of the species 3
randomly gets a probability of occurrence/prevalence (sp.preval). Based on these two parameters a 4
realized community/SR gets randomly created by drawing from the Poisson-Binomial distribution of 5
the probabilities of the species at a given site (using independent Bernoulli trials). The “realized 6
communities” based on (100000 simulations, rep) can then be compared with the expectations directly 7
based on the probabilities (as a result of the probability mass function equation 2 under the assumption 8
of a Poisson-Binomial distribution).
9
The simulated frequencies of the species richness (Figure A1.1) is almost identical to the expectations 10
based on the probability mass function (Figure A1.2). This results in a strong correlation of the simulated 11
species richness and the theoretical expectations (Figure A1.3) given that the probabilities are “correct”
12
and follow a Poisson-Binomial distribution.
13
For all the calculation based on the Poisson-Binomial distribution we used the R-package poibin (Hong, 14
Y. 2013) providing efficient function for the cumulative distribution function (cdf), probability mass 15
function (pmf), quantile function, and random number generation.
16
References 17
Hong, Y. (2013). On computing the distribution function for the Poisson binomial distribution.
18
Computational Statistics & Data Analysis, Vol. 59, pp. 41-51.
19
20
21
Script 1. This scripts creates random communities to show the basic principles of the p-value 22
calculations mentioned in the main manuscript.
23 24
################################################################
25
### p-value Null models ########################################
26
################################################################
27 28
#Basic parameters
29
sp.pool <- round(runif(1, min=50, max=100)) #Random number of potental species to occur at a site (Regional species pool)
30
sp.preval <- round(runif(sp.pool, min=0, max=1), 3) #Random probability for each species to occur at the site(s)/prevalence
31
rep <- 100000 #Number of times the binomial distribution is drawn to create the "observed" species richness
32 33
#Site parameters calculation based on the input data
34
expected.SR <- sum(sp.preval) #The expected species richness
35
p.dist <- dpoibin(1:sp.pool, sp.preval) #The expected probability for all possible SR from 1 to sp.pool based on sp.preval
36 37
#Simple model to create n=rep, realisations of the probability distribution
38
obs.SR <- NULL
39
for(i in 1:rep){
40
obs.SR <- c(obs.SR, sum(rbinom(sp.pool,1,sp.preval)))
41 42 }
43
#Histogram of the observed SR based on independent Bernoulli trials
44
SR.hist <- hist(obs.SR, breaks=0:sp.pool)
45
SR.hist$counts <- SR.hist$counts/rep #Standardisation with the number of repetition
46
CI <- qpoibin(c(0.025, 0.975), sp.preval) #Calculation of the 95% confidence intervall based on the cumulative distribution
47
function
48 49
#Figure A1.1
50
plot(SR.hist, xlab="Species richness", col="grey", main="")
51
abline(v=CI, lty=3, col="red")
52
text(x=CI[1], y=max(SR.hist$counts), labels = paste(round(sum(obs.SR <= CI[1])/rep*100,1),"%", sep=""), adj=c(1,1), col="red")
53
text(x=CI[2], y=max(SR.hist$counts), labels = paste(round(sum(obs.SR >= CI[2])/rep*100,1),"%", sep=""), adj=c(0,1), col="red")
54 55
#Figure A1.2
56
plot(p.dist, type="b", pch=16, xlab="Species richness", ylab="Expectaion proportion based on mass function")
57 58
#Figure A1.3
59
plot(p.dist,SR.hist$counts, ylab="SR simulations based on Bernoulli trials", xlab="Expectation based on probabability mass
60
function", pch=16)
61 62 63
64
65
Figure A1.1 66
Figure A1.1: Histogram of the simulated species richness with independent Bernoulli trials. The red 67 lines represent the 95% confidence interval based on the cumulative distribution function and the 68
red numbers the percentage of simulations outside of the confidence interval. The presented 69
example is based on a species pool of 90 species and 100’000 simulations.
70
71
Figure A1.2: Expected proportions for all possible species richness values based on the probability 72 mass function of a Poisson-Binomial distribution. The presented example is based on a species pool 73
of 90 species.
74
75
Figure A1.3: Correlation of the expected proportion based on the mass function and the simulated 76 proportion based on independent Bernoulli trials.
77
Appendix 2: Simulations with virtual species
1
Reasoning behind these simulations 2
In this Appendix we describe the behavior of the presented evaluation approaches to errors in the species 3
data used for model calibration. These errors usually arise either due to detection issues (i.e. creating 4
false absences) or due to misidentification of species (i.e. creating false absences and presences). While 5
these biases might to some degree be present in almost all “real world” data sources, the bias usually 6
remains poorly known and one simply assumes the data to be “correct”.
7
Here, using virtual species (known truth) and adding errors in a controlled environment, we explored 8
how the different community evaluation approaches behave both in the case of detection issues (i.e.
9
false absences) and misidentification (i.e. false absences and presences). We will focus our analysis 10
solely on this (newly) suggested evaluation approaches and like to refer to another publication studying 11
the effect of detection issues and misidentification on SDMs in much more depth (Fernandes, Scherrer 12
& Guisan 2019).
13
In most published studies “the same data” is used for model calibration and evaluation (i.e. cross- 14
validation, split-sample) and the unbiased truth is unknown. In these cases (i.e., identical bias in 15
calibration and evaluation data), all the presented approaches perform equally well and give accurate 16
information about the model performance under the given bias (i.e., How well the data is predicted by 17
the model rather than how well the (unknown) truth is predicted). However, if one has an idea about the 18
bias affecting the data, the here presented simulation study on virtual species might help to select an 19
evaluation approach that is less affected by bias in the initial data.
20
Creation of virtual species/communities 21
We create a set of 100 virtual species that were loosely based on existing plant species to maintain 22
ecological realism. However, in contrast to the distribution of “real world” species that is determined by 23
a multitude of abiotic and biotic factors the distribution of our 100 virtual species is solely determined 24
by six environmental factors (annual mean temperature, annual temperature range, annual sum of 25
precipitation, potential annual solar radiation, slope and topographic position).
26
We then projected the niche of virtual species on a set of 720 sites with a large range of environmental 27
conditions. Based on this dataset of 720 sites our virtual plant species had a prevalence (i.e. percentage 28
of sites with presence) ranging from 0.2 to 0.8 and a site SR 40.7 ± 9.8 (mean ± sd).
29
This set of virtual communities based on 100 species and 720 sites was then considered our “know truth”
30
used to evaluate the performance of our models and evaluation approaches.
31
Simulation of errors 32
We tested five different levels of error added (0, 5%, 10%, 30%, 50%) and two different types of errors 33
emulating detection issues (i.e. adding false absences) and misidentification (i.e., adding false absences 34
and false presences). To simulate detection issues, we simply changed randomly X% of the presences 35
into absences. To simulate misidentifications, we change X% of presences or absences into the opposite 36
(0 -> 1 or 1 -> 0). The selection of which absences/presences to change was random.
37
For each simulation we calibrated our models based on the virtual communities with errors added (i.e., 38
containing false absences/presences) and then evaluated the models against the “known truth” (i.e., data 39
with no error). All models were run in R.3.6.1 using biomod2 (Thuiller et al. 2009; Thuiller et al. 2016) 40
and GLM (regression based) and RF (decision tree based) as modelling techniques and 5 fold-cross 41
validation. For each level of error, we run 100 simulations (i.e., 100 different randomized errors).
42
The quality of our (single species) SDMs was evaluated by AUC, and the community predictions based 43
on the S-SDMs were evaluated with all the approaches presented in the main manuscript.
44
Performance of individual SDMs 45
As expected, the performance of individual SDMs (individual species) was negatively affected by the 46
addition of errors in the calibration data (Fig. A2.1). However, the effect of omissions (i.e., detection 47
issues; false absences) was very small (Fig. A2.1) compared to the effects of misidentifications (i.e., 48
having also false presences). This pattern is well known (see e.g., Fernandes, Scherrer & Guisan 2019) 49
and mostly explained by the fact that for most modelling techniques the presences provide the 50
information signal and the absences mostly the background information (that’s also why presence-only 51
models work well with most techniques). There was no difference in model performance between the 52
two modelling techniques (GLM, RF) chosen (Fig. A2.1).
53 54
55
Fig. A2.1: Average model performance of individual SDMs measured by AUC in dependence of 56
different levels of error (detection issues or misidentification) added to the calibration data and evaluated 57
on the “known truth”.
58
Effect of detection issues (i.e., false absences) 59
The effect of detection issues (i.e., false absences) strongly varied among the different evaluation 60
approaches. cAUC, maximization approaches (maxSørensen, maxJaccard) and probability sum ratios 61
(probSørensen, probJaccard) are only slightly effected (Fig. A2.2), while the deviation of SR (Fig. A2.3) 62
and the improvement over null-models show strong signals (Fig. A2.4).
63
To understand why cAUC, maximization approaches and probability sum ratios are largely unaffected 64
we have to analyze the effects of adding false presences. If we keep in mind that the average probability 65
of a species to occur at a site is equal to its prevalence (i.e., proportion of sites occupied) we can directly 66
understand that omitting the species at sites (i.e., creating a false absence) reduces the number of 67
occupied sites and therefore the prevalence and the average probability to occur at a site. Therefore, 68
detection issues lead to a reduction/underestimation of the average probability of species to occur at a 69
site. However, identical to the “normal” AUC, the cAUC is not directly affected by the probability per 70
se but only by the ranking of probabilities (i.e., species with lowest to highest probability of occurrence).
71
As a results, the cAUC is not affected by detection issues as long as the detectability is similar across 72
the species of a community. In the case of highly variable detectability among the community, the cAUC 73
will be more affected (as demonstrated/explained below in the misidentification section).
74
The same explanation is valid for maximization approaches (maxSørensen, maxJaccard) and probability 75
sum ratios (probSørensen, probJaccard) as all of those depend on the ranking of probabilities rather than 76
the probabilities per se.
77
78
Fig. A2.2: Community evaluation based on cAUC, MaxSørensen, probSørensen in dependence of 79
different levels of detection issues (false absences) added to the calibration data and evaluated on the 80
“know truth”. A detectability of 1 reflects perfect detection while a detectability of 0.5 reflects 50%
81
omission error (i.e., 50% of presences change into absences).
82
The strong effect of detection issues on the deviation in SR is not surprising. As mentioned before, the 83
omission of species leads to a reduction in average probabilities. As the expected SR (E(SR)) is defined 84
as the sum of the probabilities of a site the omission of species leads directly to an underestimation of 85
the SR. Therefore, the more species are omitted the higher is the difference between observed and 86
expected SR. Again, it is important to note, that if no “known truth” is available and the same dataset is 87
used for calibration and evaluation (e.g., using cross-validation) then there will be no differences 88
between average expected and average observed SR as the observed SR is identically affected by the 89
omission of species.
90
91
Fig. A2.3: The probability to get the calculated deviation in SR or higher based on the predicted 92
probabilities of occurrence depending on different levels of detection issues (false absences) added to 93
the calibration data and evaluated on the “know truth”. A detectability of 1 reflects perfect detection 94
while a detectability of 0.5 reflects 50% omission error (i.e., 50% of presences change into absences).
95
The improvement over null-models was also strongly affected by the detection issues. This is not 96
surprising, as this approach is based directly on the probabilities of occurrence. Changing (i.e., in this 97
case reducing) these average probabilities leads to a lower likelyhood to get the true composition and 98
SR correctly, and consequently to a lower improvement compared to the null-model.
99
100
Fig. A2.4: Log-fold improvement of species richness and species composition compared to a null-model 101
based on the average prevalence of the observed species depending on different levels of detection issues 102
(false absences) added to the calibration data and evaluated on the “know truth”. A detectability of 1 103
reflects perfect detection while a detectability of 0.5 reflects 50% omission error (i.e., 50% of presences 104
change into absences).
105
Effect of misidentification (i.e., false absences and presences) 106
The effects of misidentification were much stronger than the effects of detection issues, especially for 107
cAUC, maximization approaches (maxSørensen, maxJaccard) and probability sum ratios 108
(probSørensen, probJaccard; Fig. A2.5). As mentioned before, these approaches are based on the ranking 109
of probabilities and are therefore very sensitive to changes in those rankings. In contrast to only adding 110
false absences (detection issues), the misidentification also adds false presences. Due to the random 111
process of removing and adding presences to species, the prevalence (and therefore the average 112
predicted probability) changes. However, in contrast to the detection issues simulation, the 113
misidentification simulations did not reduce the prevalence uniformly but randomly increased or 114
decreased it. This automatically leads to changes in the ranking of species, and therefore a strong effect 115
on cAUC, maximization approaches and probability sum ratios. This extreme case of misidentification 116
is similar to above mentioned scenario of detection issues affecting species differently. If some species 117
can be detected perfectly (0 omissions) and other species can be detected poorly (high omission error) 118
the prevalence of species changes non-uniformly affecting the ranking of probabilities similar (but 119
usually weaker) than misidentification.
120
121
Fig. A2.5: Community evaluation based on cAUC, MaxSørensen, probSørensen in dependence of 122
different levels of misidentification (false presences and false absences) added to the calibration data 123
and evaluated on the “know truth”.
124
The average deviation in SR between observation and expectation is less affected by misidentifications 125
than by detection issues (Fig. A2.6). This is expected and theoretically deviation in SR should not be 126
affected by the addition of presences and absences as long as the same number of presences are added 127
and removed (i.e. the overall SR is the same just the composition changes). However, as most of our 128
species have a prevalence below 0.5, it is slightly more likely to change an absence into a presence than 129
vice-versa leading to a slight change in average SR.
130
131
Fig. A2.6: The probability to get the calculated deviation in SR or higher based on the predicted 132
probabilities of occurrence depending on different levels of misidentification (false presences and false 133
absences) added to the calibration data and evaluated on the “know truth”.
134
The improvement over null-models is similarly affected by misidentification and detection issues. As 135
these approaches are directly based on the probabilities and take into account both the probability to be 136
present and absent, a random increase or decrease of probabilities lead to similar patterns (Fig. A2.7) 137
138
Fig. A2.4: Log-fold improvement of species richness and species composition compared to a null-model 139
based on the average prevalence of the observed species depending on different levels of 140
misidentification (false presences and false absences) added to the calibration data and evaluated on the 141
“know truth”.
142 143
References 144
Fernandes, R.F., Scherrer, D. & Guisan, A. (2019) Effects of simulated observation errors on the 145 performance of species distribution models. Diversity and Distributions, 25, 400-413.
146 Thuiller, W., Georges, D., Engler, R. & Breiner, F. (2016) biomod2: Ensemble Platform for Species
147 Distribution Modeling.
148 Thuiller, W., Lafourcade, B., Engler, R. & Araujo, M.B. (2009) BIOMOD - a platform for ensemble 149 forecasting of species distributions. Ecography, 32, 369-373.
150 151