• Keine Ergebnisse gefunden

3 Results

3.2 Spatio-temporal activity of Acidobacteria in a grassland soil

3.2.6 Acknowledgements

3.2.8.1 Glossary

Term Explanation

Boosting Statistical algorithm to fit linear or generalized additive models ( GAMs) and to select important variables ( variable selection) at the same time.

Cross-validation Statistical algorithm to validate model performance on "new data” by repeatedly splitting the data into two parts, one to fit the model and one to evaluate the model; by this approach one tries to assess how good the results can be generalized or if they are specific to the data at hand (overfitting).

dbMEM Distance based Moran's Eigenvector Maps: Orthogonal

covariates which emulate spatial (or temporal) patterns

extrapolated from spatial (or temporal) coordinates and can be used in multiple regression modelling.

Deterministic processes Processes that shape communities under conditions in which competing species are not equally adapted to the habitat Dispersal Movement of species in space and time

Drift Stochastic changes of species abundances

Functional redundancy The ability of a community to have the same ecosystem service provided by multiple species

GAMs (Generalized additive models)

A modelling framework in which the linear covariates of a basic linear model Y = b0 + b1x1 + b2x2+... +bnxn can be replaced with functions: Y = b0 + f1(x1) + f2(x2)+... +fn(xn). Functions fi may be specified by parametric, non- and semi-parametric forms ( smoother). Thus, Y is no longer a linear combination of covariates (but additive in the functions).

Neutral processes Processes that shape communities under conditions in which competing species are equally adapted to the habitat

Resilience The tendency of communities to return to a pre-disturbance or alternative stable state after being subjected to disturbances.

Selection Processes leading to the prevalence of populations which are better adapted to environmental changes

Smoother Algorithm to fit functions f1(x1), …, fp(xp) to the data such that one obtains a smooth curve through the data without big jumps ( GAMs); sometimes also refers to the resulting function estimate.

Spatial autocorrelation The degree of dependency among observations in space.

Variable selection Mathematical/statistical algorithms to identify the meaningful predictors in a regression model to obtain parsimonious models Variance partitioning A method to identity those fractions of the variation in a

dataset, which are explained by specified predictor categories (e.g. space and environment), either as pure or joint fractions

99 3.2.8.2 Supplementary Material and Methods 3.2.8.2.1 RNA Extraction

RNA was extracted from 358 samples using a direct lysis protocol (modified after Lueders, et al. (2003)).

For each sample, 600 mg of soil were transferred to a reaction tube, and cells were lysed by bead beating (45 s, 6.5 m/s). Nucleic acids were gradually extracted using phenol/chloroform/isoamyl alcohol (c(v/v/v) = 25/24/1) and chloroform/isoamyl alcohol (c(v/v) = 24/1) and were precipitated from aqueous supernatants with two volumes of polyethylene glycol (30% w/v in 1.6 M NaCl) during centrifugation (90 min, 4° C). The nucleic acid pellet was washed with ethanol, and resuspended in 30 µl elution buffer (Tris-HCl, 10 mM, pH 8.5). The nucleic acid concentration was determined using a NanoDrop 1000 spectrophotometer (Peqlab Biotechnologie, Erlangen, Germany). DNA was removed from extracts by DNase treatment (1 U DNaseI 1U/µl per µg of nucleic acids, Fermentas/Thermo Fisher Scientific, Waltham, USA)) RNA was precipitated with 1/10 volume of sodium acetate (3 M, pH 5.2), washed with ethanol, and stored in RNAse-free water at -80° C. After RNA quantification using the Quant-iT RiboGreen assay (Life Technologies/Thermo Fisher Scientific, Waltham, USA), RiboLock RNase inhibitor (Fermentas/ Thermo Fisher Scientific, Waltham, USA) was added to each sample.

3.2.8.2.2 cDNA and amplicon library construction

For each sample, RNA was reversely transcribed into cDNA in two separate reactions using GoScript (Promega, Mannheim, Germany) according to manufacturer's instructions. The amplicon library for Illumina sequencing was constructed as previously described (Bartram, et al., 2011). Briefly, primers contained the Illumina adapter sequence, the binding site for sequencing primers, and sequences for priming the target 16S rRNA region V3 (341f, 2 wobbles; modified from (Muyzer, et al., 1993)) and 515R (Lane, 1991), respectively. Additionally, the reverse primer included a barcode region of six

nucleotides, yielding the following sequences: V3_F

(5'-ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTCCTACGGGWGGCWGCA G-3') and V3_nR (5'-CAAGCAGAAGACGGCATACGAGATNNNNNNGTGACTGGAGTTCAGACGTGTGCTCTT CCGATCTCCGCGGCTGCTGGCAC-3'). The PCR was performed with the Phusion High-Fidelity DNA Polymerase kit (New England Biolabs, Ipswich, UK) following the manufacturer's instructions, using 15 cycles of 94° C (15 s), 59° C (15 s), 72° C (15 s), after an initial denaturation step (94° C, 5 min), ended by a final elongation step (72° C, 7 min).

3.2.8.2.3 Sequencing and bioinformatic data procession

Sequencing was performed on the Illumina GA IIx system with paired 100-base reads in three separate runs on multiple lanes each. Forward and reverse reads were first trimmed to 100 bp and dimers were filtered out based on detection methods implemented in FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). FASTQ-files were synchronized for

100

those rare cases where only one of both reads has been filtered out. Reads were joined using fastq-join (Aronesty, 2013) at a minimum overlap of 6 nucleotides and allowing 20 percent mismatch and converted to FASTA file format. Converted FASTA files were checked for chimeras by Uchime integrated in Usearch 5.2.3 (Edgar, et al., 2011) applying the GOLD database from ChimeraSlayer (http://drive5.com/otupipe/gold.tz) as reference. Taxonomic-dependent analysis was performed using RDP Multiclassifier 1.1, which is essentially based on RDP classifier (Wang, et al., 2007, Cole, et al., 2009). A confidence value of 0.5 was applied for short amplicon data. This dataset represents our first level of analysis

3.2.8.2.4 OTU classification

In silico simulations demonstrated that a traditional OTU clustering at 97% sequence similarity of the variable region 3 of the acidobacterial 16S rRNA does not reflect the clustering of corresponding full length sequences. Thus, we used a closed reference OTU assignment based on known sequences. For this, all acidobacterial sequences were extracted from the SILVA reference database 115 (Quast, et al., 2013) and recently acquired full length sequences from in house cultivation experiments (unpublished) were added to the set. Sequences were dereplicated with an online tool (http://tornado.igb.illinois.edu/dereplicate.html), and then aligned with Infernal (Nawrocki & Eddy, 2013) versus a bacterial 16S rRNA RFAM model (as provided by http://infernal.janelia.org/). The alignment was clustered with HPC-clust (Matias Rodrigues & von Mering, 2014) with average linkage at 0.97 identity. These clusters served as reference for our OTU assignment. From this RDP-annotated dataset, all sequences classified as Acidobacteria were extracted via a whitelisting process and subjected to an additional cleaning step provided by the CD-HIT package (cd-hit-otu-filter.pl for removal of sequences containing ambiguous base calls, low quality bases, and erroneous primer sequences (Fu, et al., 2012)). Then, the reduced set of reads was mapped to the reference database with using the BLAST-N algorithm with BLAST+ 2.2 (Camacho, et al., 2009). The best matching hit was used for the OTU classification of a read, if the similarity between read query and reference cluster exceeded 99%, else the read was discarded from the analysis. From the BLAST hit tables, a sample by OTU matrix was built.

3.2.8.2.5 Variable selection with boosted generalized additive models

Generalized additive models (GAMs) were fitted to each OTU separately to model the species abundance. We used negative binomial models for generalist species and zero-inflated negative binomial models for rare species (with > 40 plots with zero counts). The models accounted for environmental factors, temporal and spatial effects. Categorical effects were used to model the temporal structure, P-splines (Eilers & Marx, 1996, Schmid & Hothorn, 2008) were used to model smooth, environmental effects, and bivariate P-splines were applied for spatial effects (Kneib, et al.,

101

2009). In a second step, spatio-temporal effects were added to the model to account for possible changes over time. These effects were modeled as time-specific bivariate P-splines.

In order to fit the models, we used component-wise boosting methods (see Mayr, et al. (2014) for an overview): Each effect is specified as a so called base-learner. Each of the base-learners is then fitted separately to the negative gradient of the loss function evaluated at the current fit (here the positive gradient of the log-likelihood) and only the best-fitting effect is updated. This is done by adding a fraction of the fit (usually 10%) to the model and by updating the corresponding effect estimate accordingly. In the next step, the negative gradient is recomputed and the procedure is repeated until a pre-specified number of iterations is reached. For details on the algorithm refer to Hofner, et al.

(2014).

This number of boosting iterations is the major tuning parameter of the model. By using cross-validation methods (such as 10-fold cross-cross-validation, bootstrap or subsampling) the appropriate stopping iteration is obtained such that the prediction accuracy of the model is optimized for new data (Mayr, et al., 2012). Here, we used 25-fold bootstrap to obtain the optimal stopping iteration for each model. As only one base-learner is updated per iteration we obtain variable selection by stopping the model after a sufficiently small number of steps ("early stopping"). Hence, boosting with cross-validation allows to fit the model and to select variables at the same time. The resulting model is usually sparse and has good prediction performance.

Yet, in many situations boosting is known to select too many noise variables (i.e., variables that in truth do not have an influence on the outcome). To obtain even sparser models, we additionally applied the generic stability selection approach (Meinshausen & Bühlmann, 2010, Shah & Samworth, 2013).

Stability selection uses 100 random subsamples of size ⌊n/2⌋. The model is fitted on each of the subsamples until q variables have been selected by the boosting algorithm. Variables that were selected on each (or at least most) of the subsamples are considered as stable, i.e., important effects.

To determine how often a variable has to be selected to be stable one uses a threshold for the selection frequency. This threshold can in turn be computed from a pre-specified bound on the per-family error rate (PFER). The PFER is defined as the expected number of falsely selected variables, i.e., as the expected number of noise variables that were selected. We specified q = 12 and PFER ≤ 2, i.e., we accepted at most two noise variables to be selected in each model which resulted in a threshold for the selection frequency of 80%. For details see Hofner, et al. (2015).

102 3.2.8.3 Supplementary Tables

Table 14. Variable selection with boosted GAMs per acidobacterial subgroup. Values (except for 'Total OTUs'") are given in % and should be read as the percentage of OTU per subgroup for which a variable was selected. Results are shown after cross-validation (top) and after additional stability selection of the models (bottom). Variables are sorted by their overall selection frequency.

103

Table 15. Best fit linear models describing the relation between environmental and spatial filters. Only environmental and spatial filters were selected for regression, which were significantly structuring the community composition (Table 13).

Models were evaluated with stepwise variable selection according the Akaike Information Criterion. The adjusted R2 value gives the strength of explanatory power of the model. Significance levels: * = p < 0.05, ** = p < 0.01, *** = p < 0.001. These results show which filters were responsible for the amount of variance which was explained jointly by space and soil properties. This may result either from stacked effects (many variables operating on few spatial filters) or from single variables working on scales represented by many spatial filters (as in August).

(BD = Bulk density, SOC = Soil organic carbon, SM = Soil Moisture, SE= Shannon Equitability (above ground vegetation), ArEla = Arrhenatherum elatius, CeHol = Cerastium holosteoides, TrRep = Trifolium repens)

Variable Spatial Filters

R2

(adjusted) Significance level Month

Nmic Ap5, Ap14 0.13 ** April

pH Ap5, Ap9 0.28 ***

Phosphate Ap5, Ap14 0.4 ***

SM Ap9 0.13 **

ArEla Ma1, Ma5 0.19 ** May

CeHol Ma1, Ma2 0.15 **

Nmic Ma1, Ma5, Ma7 0.25 **

pH Ma1, Ma4 0.29 ***

Moss Ju3, Ju10 0.27 *** June

pH Ju2, Ju10 0.2 ***

Phosphate Ju2 0.3 ***

SE Ju10 0.06 *

pH Au2, Au3, Au10, Au12, Au16 0.3 *** August

Cmic Oc7 0.14 ** October

Nmic Oc7 0.1 **

TrRep Oc5 0.06 *

Clay No1 0.1 ** November

Nmic No1 0.1 **

Ammonium No1 0.15 **

104 3.2.8.4 Supplementary Figures

An additional supplementary figure within the manuscript replicates Figure 6

Figure 15. Box plots depicting the seasonal progression of environmental variables in the experimental site. Values are standardized (mean = 0, SD = 1). For experimental procedures, refer to Regan, et al. (2014). Two PLFA signals were used to represent saprophobic fungi (18:2ω6) and protozoa (20:4ω6), the remaining PLFA signals were combined into a single variable (as a proxy for microbial productivity). SOC = Soil organic carbon; EOC/N = Extractable organic carbon/nitrogen;

C/N = Carbon to Nitrogen-Ratio; Cmic, Nmic = Microbially bound carbon/nitrogen. Cell counts refer to prokaryotic cells.

105

Figure 16. Stacked bar charts depicting averaged relative read abundances on the phylum level. Taxonomic information was inferred from the RDP-classifier (Wang, et al., 2007). Each month is represented by an averaged community from 60 samples (59 in April and June, respectively).

Figure 17. Temporal activity patterns of lesser abundant Acidobacteria subgroups in the ScaleMic experimental plot, not included in Figure 10 of the main manuscript. Shown are subgroup abundances (relative to all Bacteria, log-scaled) observed during each sampling date. Subgroups are sorted by average relative abundance

106

Figure 18. Relative abundances of important OTUs compared over all 358 sites (x-coordinate). OTU 11541_4 is the most dominant cosmopolitan OTU; 9656_3 is a comparably rare cosmopolitan OTU which became dominating during bloom events; 2017_1 is the most dominant OTU in the bloom events, appearing only in traces elsewhere; 10918_2 is a typical OTU in the bloom events, but frequently found elsewhere. Left panels feature logarithmic relative abundances, right panels absolute relative abundance values on the y-coordinate.

107

Figure 19. Time-Decay models for acidobacterial communities (averaged over all 59-60 sites per sampling date). Bray-Curtis dissimilarities between averaged monthly communities were transformed to similarities by subtraction of 1, log-transformed, and regressed over elapsed time between sampling dates. A: Slopes of linear models describing the relation between β-diversity and elapsed time between sampling days. 4 models are shown (for communities of which 0%, 10%, 50%, 90% of the most abundant OTUs were removed). Grey areas represent a 95% interval of the linear model. B:

Influences of the rare taxa to time-decay effects. The plot depicts communities which were iteratively depleted by the single most abundant OTU (x-coordinate) and the slope of the time-decay models obtained from the reduced community.

No model was statistically significant after adjustment of p-values (pFDR > 0.05).

108

Figure 20. Distance-decay regression models of iteratively reduced acidobacterial communities. Bray-Curtis dissimilarities between samples were transformed to similarities by subtraction of 1, log-transformed, and regressed over spatial distance between sites. The most abundant OTU of the remaining dataset was iteratively removed and a new model was fitted. X-coordinates represent the level of community completeness, y-coordinates the slope of the distance-decay models associated with each subcommunity.

109

Figure 21. Distance-Decay scatter plots. All 358 samples are assessed together, regardless of their sampling date and without averaging. Top left: Bray-Curtis similarities plotted vs. spatial distance. Top right: Bray-Curtis similarities plotted vs. temporal distance. Whenever a blooming was part of a pairwise comparison, the spot was colored accordingly. Bottom:

Bray-Curtis similarities plotted vs. temporal distance and spatial distance, colored by temporal distance.

110

Figure 22. Individual time-decay plots for each of the 30 defined subplots (see Regan, et al. (2014)). For each sampling date, the two locations sampled per subplot were averaged, and then subjected to time-decay-regression. No model was statistically significant (p>0.05).

111

Figure 23. Plot overlay of partial models for variables not shown in the manuscript. Each curve represents one partial function of one OTU-specific model, as derived from model cross-validation. Response values on the y-coordinate are partial effects of the additive model.

112

Figure 24. Distance based Moran eigenvector Maps, which were significantly filtering acidobacterial communities at a given sampling date (see Table 13). Each bubble represents a plot. Color and size of bubbles indicate the absolute value at a given plot centered on zero.

113

Figure 25. Spatial abundance plots of three environmental variables (A-C: pH, Nmic, Cmic) and two cosmopolitan but rare OTUs (D, E). All sampling dates are plotted. Each bubble represents a sampling site. Color and size of bubbles indicate values of the measure (standardized), or abundance (relative to all bacterial reads). The plot demonstrates the spatial dependencies of OTU abundance on non-randomly distributed environmental variables.

114 3.2.8.5 Supplementary References

Aronesty E (2013) Comparison of Sequencing Utility Programs. TOBIOIJ 7: 1:8.

Bartram AK, Lynch MD, Stearns JC, Moreno-Hagelsieb G, Neufeld JD (2011). Generation of multimillion-sequence 16S rRNA gene libraries from complex microbial communities by assembling paired-end illumina reads.

Appl Environ Microbiol 77: 3846-3852.

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K & Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics 10: 421.

Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, Garrity GM & Tiedje JM (2009) The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res 37: D141-145.

Edgar RC, Haas BJ, Clemente JC, Quince C & Knight R (2011) UCHIME improves sensitivity and speed of chimera detection. Bioinformatics 27: 2194-2200.

Eilers PHC, Marx BD (1996). Flexible smoothing with B-splines and penalties. Stat Sci 11: 89-102.

Fu L, Niu B, Zhu Z, Wu S & Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data.

Bioinformatics 28: 3150-3152.

Hofner B, Mayr A, Robinzonov N & Schmid M (2014) Model-based boosting in R: a hands-on tutorial using the R package mboost. Comput Stat 29: 3-35.

Hofner B, Boccuto L & Goker M (2015) Controlling false discoveries in high-dimensional situations: boosting with stability selection. BMC Bioinformatics 16: 144.

Kneib T, Hothorn T, Tutz G (2009). Variable Selection and Model Choice in Geoadditive Regression Models.

Biometrics 65: 626-634.

Lane DJ (1991). 16S/23S rRNA sequencing. In:

Stackebrandt E. GM (ed). Nucleic Acid Techniques in Bacterial Systematics. Wiley & Sons: Chichester, United Kingdom.

Lueders T, Manefield M & Friedrich MW (2003) Enhanced sensitivity of DNA- and rRNA-based stable isotope probing by fractionation and quantitative analysis of

isopycnic centrifugation gradients. Environ Microbiol 6:

73-78.

Matias Rodrigues JF & von Mering C (2014) HPC-CLUST:

distributed hierarchical clustering for large sets of nucleotide sequences. Bioinformatics 30: 287-288.

Mayr A, Hofner B, Schmid M (2012). The Importance of Knowing When to Stop. A Sequential Stopping Rule for Component-wise Gradient Boosting. Methods Inf. Med.

51: 178-186.

Mayr A, Binder H, Gefeller O, Schmid M (2014). The Evolution of Boosting Algorithms. From Machine Learning to Statistical Modelling. Methods Inf. Med. 53: 419-427.

Meinshausen N, Bühlmann P (2010). Stability selection. J R Stat Soc Series B Stat Methodol 72: 417-473.

Muyzer G, de Waal EC, Uitterlinden AG (1993). Profiling of complex microbial populations by denaturing gradient gel electrophoresis analysis of polymerase chain reaction-amplified genes coding for 16S rRNA. Appl Environ Microbiol 59: 695-700.

Nawrocki EP & Eddy SR (2013) Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29: 2933-2935.

Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J & Glockner FO (2013) The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res 41: D590-596 Regan KM, Nunan N, Boeddinghaus RS, Baumgartner V, Berner D, Boch S, Oelmann Y, Overmann J, Prati D, Schloter M, Schmitt B, Sorkau E, Steffens M, Kandeler E &

Marhan S (2014) Seasonal controls on grassland microbial biogeography: Are they governed by plants, abiotic properties or both? Soil Biol Biochem 71: 21-30.

Schmid M & Hothorn T (2008) Boosting additive models using component-wise P-Splines. Comput Stat Data Anal 53: 298-311.

Shah RD, Samworth RJ (2013). Variable selection with error control: another look at stability selection. J R Stat Soc Series B Stat Methodol 75: 55-80.

Wang Q, Garrity GM, Tiedje JM, Cole JR (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol 73: 5261-5267.

115

Manuscript 2:

3.3 Spatial interaction of archaeal ammonia-oxidizers and nitrite-oxidizing