One-step-ahead predictive assessment - Exact and approximate predictive assessment

3.3 Exact and approximate predictive assessment

3.3.1 One-step-ahead predictive assessment

Given the time series from time 1 to timen, how well can our model predict the observation at the next time n+ 1? That is, how good are the one-step-ahead predictions in our change point model? This classic task of time series models has been called “prequential forecasting” by Dawid (1984, p. 278), merging the adjectives of the terms probability forecasting and sequential prediction.

Exact sampling

One way to assess the one-step-ahead prediction performance in our model is to try the prediction for the counts at times t+ 1 = 1,2, . . . , n, if we feed our algorithm only with the counts at times 1,2, . . . , t. Fort= 0, we predict the first observationy1from the prior predictive distribution f(y₁). The forward and backward steps must thus be computed n−1 times, one time less than the number of observations because the prior predictive

samples are directly obtained from the data generating distribution parametrized by prior parameter samples. The following sampling details are tailored to the more complicated case t > 1. Note that a particle filter algorithm could avoid the repeated forward and backward computations implied by our posterior sampling approach from section 3.2.4 (see Doucet, De Freitas, and Gordon (2001) for sequential Monte Carlo methods).

For the sampling from the predictive density f(y_t+1|y_[1,t]), the idea is to use the con-ditional independence of the observationsy_t given the parametersξ_t:

f(y_t+1|y_[1,t]) = Z

f(y_t+1|ξ_t+1)f(ξ_t+1|y_[1,t])dξ_t+1

So if we can sample fromf(ξ_t+1|y_[1,t]), we just plug the realizationξ_t+1into the likelihood and keepy_t+1^∗ ∼f(· |ξ_t+1,x_t+1) as a sample from the predictive distribution.

One solution is to give the algorithm the response and observation indicator vectors (y₁, . . . , y_t,0) and (o₁, . . . , o_t,0) to marky_t+1 as missing. This naturally produces samples fromf(ξ_t+1|y_[1,t]).

Another solution exploits the sequential structure of the model. Consider f(ξ_t+1|y_[1,t]) =

f(ξ_t+1|ξ_t,θ_[1,t])f(ξ_t,θ_[1,t]|y_[1,t])dξ_tdθ_[1,t],

whereθ_[1,t]is the change points configuration in the time seriesy_[1,t]: The next parameter ξ_t+1 only depends on the last parameterξ_t andθ_[1,t], because if there is no change point at time t, then ξ_t+1 ≡ξ_t, else we draw the next parameter from the priorf(ξ|φ). This mixture of a point mass atξ_t and the prior distribution is weighted by the probability of a change point at timetgivenθ_[1,t]. Ifθ_[1,t]containskchange points with the last change point occurring at times, this probability is

P(θ_k+1 =t|θ_k=s,y_[1,t]) =P(θ_k+1=t|θ_k=s),

equal to the respective prior transition probability. The reason is that the event of a change point occurrence at timet is independent of the observations until time t – they are happening before this change point could have an effect. For the flat number prior, this probability equals (k+ 1)/(t+ 1), which is the “success probability” already visible in (3.2.5). See Hofmann (2007, p. 34) or Held, Hofmann, H¨ohle, and Schmid (2006, section 2.6) for more details on the derivation. For the binomial number prior, this probability equals the hyperparameter π. So if we have sampled ξ_t and θ_[1,t] with k change points, afterwards we sample

ξ_t+1 ∼ {1−P(θ_k+1=t|θ_k=s)}δ_ξ_t +P(θ_k+1=t|θ_k =s)f(ξ|φ).

Both solutions requiren−1 forward steps: for the first solution, the algorithm must be run with the observation indicator vectors

(o₁,0, . . . ,0),(o₁, o₂,0, . . . ,0), . . . ,(o₁, o₂, . . . , o_n₋₁,0),

and for the second solution we must start the algorithm for each of the data vectors y₁,y_[1,2], . . . ,y_[1,n−1].

Note that this second solution is the basis for the approximate sampling described below.

Approximate sampling

The exact sampling from f(ξ_t+1|y_[1,t]) is computationally expensive, because for every timet, a new forward step is necessary. An approximate version of the sampling is inspired by Marshall and Spiegelhalter (2003).

The idea is to use the whole time seriesy for only one forward-backward run. That is, the same sampling probabilities (3.2.12) and (3.2.11) are used forall learning sets{y_[1,t]}, t= 1,2, . . . , n−1. So the change point locations θ_[1,t] up to time twhich are used above are not drawn from f(θ_[1,t]|y_[1,t]) but from f(θ_[1,t]|y). In practice, the change points sampler is run on the whole data y and produces samples ofθ. Then for the processing of the learning set y_[1,t], all change points at times t, t+ 1, . . . , n−1 are deleted from the sample vectors to produces approximate samples of θ_[1,t], which conform with the maximum change point time t−1 for the reduced data y_[1,t]. (The conventional “change point” at the last time tis deterministic and is of course included.)

However, sampling the parameter levelξ^(k+1) ≡ξ_tof the block including the timetfrom the correct densityf(ξ^(k+1)|θ_[1,t],y_[1,t]) is easy: it is just the block posterior distribution f_block(ξ^(k+1)|y_(θ_k_,t],φ). Therefore, the unknown part y_[t+1,n] influences the parameter sample ξ_t only indirectly, through the “pruned” samples θ_[1,t] obtained from θ. Hence, the conservatism concerning the sampling of ξ_t+1 and y_t+1^∗ , which is the price for the reduction of computing time in this approximate sampling scheme, should be moderate.

The proposed approximation of the one-step-ahead predictive density is thus f(y_t+1|y_[1,t]) =

ZZZ

f(y_t+1|ξ_t+1)f(ξ_t+1|ξ_t,θ_[1,t])f(ξ_t,θ_[1,t]|y_[1,t])dξ_t+1dξ_tdθ_[1,t]

= ZZZ

f(yt+1|ξ_t+1)f(ξ_t+1|ξ_t,θ_[1,t])f(ξ_t|θ_[1,t],y_[1,t])f(θ_[1,t]|y_[1,t])dξ_t+1dξ_tdθ_[1,t]

≈ ZZZ

f(y_t+1|ξ_t+1)f(ξ_t+1|ξ_t,θ_[1,t])f(ξ_t|θ_[1,t],y_[1,t])f(θ_[1,t]|y)dξ_t+1dξ_tdθ_[1,t]

=: ˜f(y_t+1|y_[1,t]).

Analytical logarithmic scores

The proposed conjugate change point model allows the analytical computation of the one-step-ahead logarithmic scores

−logf(y₁),−logf(y₂|y₁), . . . ,−logf(y_n|y_[1,n−1]),

because all marginal likelihoodsf(y_[1,t]), t= 1, . . . , n can be calculated in an exact one-step-ahead sampling loop: While for t = 1, we just use f(y₁) = f_block(y₁) and already have the first score, for t≥2 we obtain f(y_[1,t]) from (3.2.10) as a by-product from the (reduced) posterior sampling given the datay_[1,t]. Remember that the marginal likelihood is computed at the end of the forward step, which is necessary for the change points sampling in the backward step. Having finished the exact one-step-ahead validation loop, we can compute the remaining logarithmic scores

−logf(y_t+1|y_[1,t]) = logf(y_[1,t])−logf(y_[1,t+1]), t= 1, . . . , n−1.

Note that the sum of the one-step-ahead logarithmic scores equals the negative marginal log-likelihood. The well-known decomposition of the marginal likelihood of the vector y into the conditional scalar densities,

f(y) =f(y_n|y_[1,n−1])f(y_n₋₁|y_[1,n−2])· · ·f(y₂|y₁)f(y₁),

is the equivalent on the multiplicative scale. Therefore the mean one-step-ahead log-score can be computed directly from the marginal likelihood in the conjugate change point model as LogS = −¹_nlogf(y). Since this is a strictly monotone transformation, the model comparison based on the one-step-ahead log-score is equivalent to that based on the marginal likelihood. Another consequence is that the mean one-step-ahead log-score of the reversed time series is identical to the log-score of the original time series, because the assumed prior and likelihood are invariant to the time direction and so the marginal likelihood is identical.

The estimation of the logarithmic scores using the Monte Carlo approach on page 14 is nevertheless sensible, because we can thus assess the Monte Carlo error which also contributes to the difference between exact sampling and approximate sampling results.

We have compared the analytical one-step-ahead log-scores with the corresponding exact sampling log-scores for all examined models in the case studies from sections 3.4.2, 3.5.2 and 3.6.2. The maximum absolute differences for the three sections were 0.051, 0.065 and 0.1, while the mean deviances were 0.004, 0.003 and 0.005, respectively. These Monte Carlo errors are very small compared to the approximate sampling errors, with maximum deviances 1.924, 1.705 and 1.253, and mean deviances 0.107, 0.098 and 0.079, respectively in the three sections when all models are pooled. For illustration we show comparison plots of exact sampling versus analytical log-scores for three selected models in Figure 3.2.

Only in panel (b) some points in the upper right corner are lying slightly away from the identity line.

Figure 3.2 – Comparison of analytical one-step-ahead log-scores (x-axis) and corresponding exact sampling log-scores (y-axis), for three models from sections 3.4.2, 3.5.2 and 3.6.2.

1 2 3 4

(a) Coal-mining data model 1

0.0 1.0 2.0 3.0 0.0

0.5 1.0 1.5 2.0 2.5 3.0 3.5

(b)Tokyo rainfall data model 2

6.0 6.5 7.0 7.5 8.0 8.5 6.0

6.5 7.0 7.5 8.0 8.5

(c)Nile discharge data model 3

Im Dokument Predictive Assessment of Bayesian Hierarchical Models (Seite 38-42)