• Keine Ergebnisse gefunden

The simulated examples presented here are run using compute servers provided by the Hum-boldt Lab of Empirical and Quantitative Research (LEQR). The general setup of the sim-ulation study in R (R Core Team 2020) is largely inspired by the work of Peikert (2019) who shows examples of working with Makefiles for running simulations in parallel as well as the parallelization approaches presented in Lee, Sriutaisuk, and Kim (2020). The R-package purrr(Henry and Wickham 2020) as well as tools from the tidyverse (Wickham 2019) are used to enable parallel computation.

The procedure of the simulation is as follows. To get results for different combinations of hy-perprior settings while controlling for different training and test data sample sizes, a function

is built to iterate over a table of all combinations of settings created usingtidyr::crossing() from the R-package tidyr (Wickham 2020). In particular, for the first exploratory step, we create a grid of all combinations of settings via:

if(!requireNamespace("pacman"))install.packages("pacman")

pacman::p_load(tidyverse)

# true values truealpha <- pi truesigma2 <- 0.5 truerho <- 0.021

# sample sizes for training data sample_sizes <- c(50, 100, 250, 500)

# sample sizes for validation set for all settings val_sample_sizes = c(500, 1000)

# replicates

nr_iterations <- 1:100

# placeholder for hyperprior settings

hypersettings <- c("Setting 1: weakly informative",

"Setting 2: weakly informative",

"Setting 3: weakly informative",

"Setting 4: informative",

"Setting 5: informative",

"Setting 6: informative")

# creating a grid of simulation conditions conditions <- crossing(

truealpha = truealpha, truesigma2 = truesigma2, truerho = truerho,

hyperprior = hypersettings, N = sample_sizes,

val_N = val_sample_sizes, iteration = nr_iterations)

3.3.1 Training and Test Data

In Jona Lasinio, Gelfand, and Jona Lasinio (2012), the authors generated samples of size n = 100 that were then split up into samples of size n = 30 and n= 70 using the leftover data as test data to test two different sizes of training and test data. In contrast, in this simulation a larger variety of different training and test data sample sizes is compared while also creating 100 replicates for each of the combinations of hyperprior settings, training and test data sample sizes. Training data sample sizes for all simulated examples are N

={50,100,250,500}. Test data for spatial interpolation is created with sample sizes of val N

= {500,1000} (and val N = {50,500,1000} in the extended setup) from observations that are not used in training.

This results in a total number of 100×all combinations ofN, val N, hyperpriormodels for a simulation run, e.g. there are 100×4×2×6 = 4800 models in the exploratory setup in section 3.3.

To ensure a more realistic setup for spatial interpolation, possible locations are restricted to a grid of points created in Universal Transverse Mercator (UTM) coordinates covering a part of the UTM zone 32 of the Northern hemisphere. Then, locations for training and testing are sampled from this common area for each replicate in each setting while ensuring that locations that are used for training are not used for testing and spatial interpolation.

This is facilitated with a custom function that converts given coordinates from longitude and latitude to a UTM format. The complete steps of data generation can be found in sim gendata grid.R.

Then, samples from a linear (unwrapped) Gaussian process are simulated by sampling from a multivariate Gaussian distribution following Coles (1998), Coles and Casson (1998) and Jona Lasinio, Gelfand, and Jona Lasinio (2012) using the custom functionrmnorm()introduced in the vignette of G. Jona Lasinio, Mastrantonio, and Santoro (2019) which allows for Cholesky decomposition.

Following the steps in examples in the vignette of G. Jona Lasinio, Mastrantonio, and Santoro (2019), training and test coordinates are used in the creation of a distance matrix that contributes to the covariance matrix of the unwrapped spatial Gaussian process.

A circular Gaussian process is created by wrapping these samples componentwise using Θi= Yi mod 2π for the i-th entry. This is repeated until the selected sample size of N+val Nis reached where N and val Ndenote the selected training and test sample sizes, respectively.

The sampled circular data is then split into a training and test dataset according to the selected Nand val N.

A reduced version of these steps is presented below, seesim gendata grid.Rfor the complete procedure in the simulation runs.

# requires rmnorm(), coordinates, true values to run

#----dist-matrix----joint_coords <- rbind(coords, valcoords)

# create dist matrix for all coords Dist <- as.matrix(dist(joint_coords))

#----cov-function----# exponential covariance function with parameters sigma2 & rho SIGMA <- sigma2 * exp(-rho*Dist)

#----simulate-data-unwrapped-GP----Y <- rmnorm(1, rep(alpha, times=NROW(joint_coords)), SIGMA)

#----wrap-data----theta <- c()

for(j in 1:NROW(joint_coords)) {

theta[j] <- Y[j]%%(2*pi) }

# create training and validation set for the directional data input rows_theta_train <- sample(seq_len(NROW(joint_coords)), N)

thetadata <- theta[rows_theta_train]

valthetadata <- theta[-rows_theta_train]

Sampling different data for training and testing for each replicate in each combination of settings would make comparisons between different hyperprior settings unclear.

To ensure comparability between the different combinations and to control for the potential influence of different training and test locations in spatial interpolation, seeds for data gen-eration are created which are the same for each replicate in each setting. The below example from sim settings grid.R presents the general idea:

# requires conditions-df to run

if(!requireNamespace("pacman"))install.packages("pacman") pacman::p_load(tidyverse)

# set seed for creating seeds set.seed(987654321)

# nest & unnest conditions

# to create same train- and validationset

# over iterations and sample sizes conditions <- conditions %>%

nest(hyperprior, N, val_N) %>%

dplyr::mutate(

datagenseed = as.integer(runif(length(iteration)) * 1e6)

) %>%

unnest(., data)

Hence, training and test data vary among the 100 replicates for each hyperprior setting. At the same time, the above procedure ensures a level of comparability between the replicates of each combination of hyperprior settings, training and test data sample sizes because training and test data in each of the 100 replicates only differ in sample sizes among the different hyperprior settings.