• Keine Ergebnisse gefunden

Model Evaluation Based on Empirical Data

A Scalable Deep-Learning Model

3.6 Model Evaluation Based on Empirical Data

3.6.1 Data

We validate the prediction performance of our model using transaction data provided by a leading German grocery retailer. The data set comprises three data sources: loyalty card data, market basket data, and coupon data. Loyalty card data follows a panel structure and contains transactions by the loyalty card holders. Market basket data includes information about purchases without customer identifiers. Coupon data contains information about the coupons provided to the customers (including unredeemed coupons). Overall, the data set spans over 83 weeks (2015-2016) and includes 22,740,377 purchases by 489,438 loyalty card customers and 73,048,605 shopping baskets by customers without a loyalty card.

We provide summary statistics in Table 3.5.

We focus our analysis on 50 product categories for which the retailer distributed coupons during the period of the analysis. The categories include food products such as milk, bread, chocolate bars, and coffee, and nonfood products such as shampoo, fabric softener, and toothbrushes. The categories vary in the interpurchase times and a competitive intensity. We provide a complete list of the product categories in Appendix 3.8.

Most customers visit the store no more than once a week, so we aggregate data to a weekly level. The median time between two shopping trips across all customers is two weeks (SD = 4.02). This value is typical for a supermarket in a German metropolitan area. The retailer used coupons to promote category-brand combinations that group stock keeping units of the same package size and price range. We follow the retailer’s product grouping and use this level of aggregation for our analysis. For a small subset of customers, the retailer provided coupons randomly. Our empirical analysis only considers customers with randomly assigned coupons, which allows us to avoid endogeneity concerns in model training and validations.

3.6. MODEL EVALUATION BASED ON EMPIRICAL DATA 41 Table 3.5. Summary statistics: data sets for empirical application.

Data Set Variable Value

Loyalty Card Data

# of users 489,438

# of weeks (date range) 83 (2015/05 - 2016/12)

# of brands (# of retailer categories) 758 (50)

# of stores 155

Coupon Data

# of coupons 650,973

Avg. # of coupons per customer (SD) 4.76 (6.33)

Discount range [5%, 50%]

Avg. redemption rate (SE) 1.529% (.014%)

Avg. discount (SD) 23.7% (10.0%)

Market Basket Data

# of baskets 73,048,605

# of months 12a)

Avg. # of products per basket 4.91 Note: a) First year of loyalty card data.

3.6.2 Evaluation Results

For the model evaluation, we follow the approach used in the simulation study and create a hold-out test set by splitting the data in the time dimension. The first 73 weeks are used for model training, whereas the last ten weeks comprise the test data. We predict the purchase probabilities for all products and 1,000 customers.

We train the deep neural network in two stages. In the first stage we apply P2V-MAP (Gabel et al., 2019) to the market basket data to derive product embeddings.

We then use the pretrained embeddings to initializeWH,Wd, andW. Pretraining embeddings is a common approach to training deep neural network architectures in computer vision and natural language processing. Pretraining helps to avoid local minima in supervised learning and facilitates a better generalizability of results by having a regularizing effect on the neural network (Erhan et al., 2010).

Initializing the product embeddings in our neural network with the output of P2V-MAP reduces the number of training iterations that is required to achieve model convergence by approximately 25%. The training time for mini-batch gradient descent scales linearly with the number of training iterations, so pretraining the product embedding reduces training time significantly.

In the second step of the model training, we initialize the parameters of the bottleneck layers with the product embedding and train the full neural network by minimizing the binary cross-entropy loss (see Section 3.3.3). This step fine-tunes the pretrained bottleneck layer weights.

We calibrate the hyperparameters of the model on a validation set by comparing the binary cross-entropy loss over a small number of randomly sampled hyperpa-rameter sets. Random search is a common approach to configuring neural networks and typically finds solutions that are as good as grid searched results within a small fraction of the computation time of grid search (Bergstra and Bengio, 2012).

We initialize the hyperparameter search with the values used in the simulation.

We find that a larger embedding size (L= 50) improves the test loss and that the model converges in less epochs (nepoch = 10). For the other hyperparameters, the random search did not yield a significant loss improvement, so we use the same values as in the simulation. We calibrate the simulation to mimic the behavior of the empirical data, so the similarities between the simulation study and the empirical application are not surprising. In line with the results of the simulation study we find that the parsimonious architecture makes our neural network robust to overfitting (see Appendix 3.8). We present a two-dimensional t-SNE projection of the product embedding WH trained on empirical data in Appendix 3.8.

An important difference to the simulation study is that true purchase probabilities are unknown in the context of the empirical application. We therefore focus on the comparison of binary cross-entropy loss that evaluates the predictions based on the observed (binary) purchase indicator and the predicted purchase probabilities. We evaluate the prediction performance of our proposed model and compare it to the baselines used in Section 3.5.

Table 3.6 reports the evaluation results. We find that the ranking of the models based on the predictive performance is in line with the results obtained from the simulated data, and our proposed model achieves a lower cross-entropy loss than the reference methods.

We conduct an additional regression analysis to understand the performance differences between our deep neural network (DNN) and each of the two baseline models. First, we compute the binary cross-entropy loss for each observation (customer, product, time) in the test set for the DNN and the reference model. We

then compare the loss values using a linear regression:

M1 : Lijtm = α0+αc+δDN N,

M2 : Lijtm = α0+αc+δDN N +δC +δP +IP Tic

+δDN N ×δC +δDN N ×δP +δDN N ×IP Tic,

wheremindexes the model (either DNN or the reference model),α0 is the regression intercept andαcare category-level fixed effects, and IP Tic is the average customer-level category interpurchase time computed on the training data. The regression includes three indicator variables:

3.6. MODEL EVALUATION BASED ON EMPIRICAL DATA 43