• Keine Ergebnisse gefunden

One-step-ahead predictive assessment

3.3 Exact and approximate predictive assessment

3.3.1 One-step-ahead predictive assessment

Given the time series from time 1 to timen, how well can our model predict the observation at the next time n+ 1? That is, how good are the one-step-ahead predictions in our change point model? This classic task of time series models has been called “prequential forecasting” by Dawid (1984, p. 278), merging the adjectives of the terms probability forecasting and sequential prediction.

Exact sampling

One way to assess the one-step-ahead prediction performance in our model is to try the prediction for the counts at times t+ 1 = 1,2, . . . , n, if we feed our algorithm only with the counts at times 1,2, . . . , t. Fort= 0, we predict the first observationy1from the prior predictive distribution f(y1). The forward and backward steps must thus be computed n−1 times, one time less than the number of observations because the prior predictive

samples are directly obtained from the data generating distribution parametrized by prior parameter samples. The following sampling details are tailored to the more complicated case t > 1. Note that a particle filter algorithm could avoid the repeated forward and backward computations implied by our posterior sampling approach from section 3.2.4 (see Doucet, De Freitas, and Gordon (2001) for sequential Monte Carlo methods).

For the sampling from the predictive density f(yt+1|y[1,t]), the idea is to use the con-ditional independence of the observationsyt given the parametersξt:

f(yt+1|y[1,t]) = Z

f(yt+1t+1)f(ξt+1|y[1,t])dξt+1

So if we can sample fromf(ξt+1|y[1,t]), we just plug the realizationξt+1into the likelihood and keepyt+1 ∼f(· |ξt+1,xt+1) as a sample from the predictive distribution.

One solution is to give the algorithm the response and observation indicator vectors (y1, . . . , yt,0) and (o1, . . . , ot,0) to markyt+1 as missing. This naturally produces samples fromf(ξt+1|y[1,t]).

Another solution exploits the sequential structure of the model. Consider f(ξt+1|y[1,t]) =

ZZ

f(ξt+1t[1,t])f(ξt[1,t]|y[1,t])dξt[1,t],

whereθ[1,t]is the change points configuration in the time seriesy[1,t]: The next parameter ξt+1 only depends on the last parameterξt andθ[1,t], because if there is no change point at time t, then ξt+1 ≡ξt, else we draw the next parameter from the priorf(ξ|φ). This mixture of a point mass atξt and the prior distribution is weighted by the probability of a change point at timetgivenθ[1,t]. Ifθ[1,t]containskchange points with the last change point occurring at times, this probability is

P(θk+1 =t|θk=s,y[1,t]) =P(θk+1=t|θk=s),

equal to the respective prior transition probability. The reason is that the event of a change point occurrence at timet is independent of the observations until time t – they are happening before this change point could have an effect. For the flat number prior, this probability equals (k+ 1)/(t+ 1), which is the “success probability” already visible in (3.2.5). See Hofmann (2007, p. 34) or Held, Hofmann, H¨ohle, and Schmid (2006, section 2.6) for more details on the derivation. For the binomial number prior, this probability equals the hyperparameter π. So if we have sampled ξt and θ[1,t] with k change points, afterwards we sample

ξt+1 ∼ {1−P(θk+1=t|θk=s)}δξt +P(θk+1=t|θk =s)f(ξ|φ).

Both solutions requiren−1 forward steps: for the first solution, the algorithm must be run with the observation indicator vectors

(o1,0, . . . ,0),(o1, o2,0, . . . ,0), . . . ,(o1, o2, . . . , on1,0),

and for the second solution we must start the algorithm for each of the data vectors y1,y[1,2], . . . ,y[1,n−1].

Note that this second solution is the basis for the approximate sampling described below.

Approximate sampling

The exact sampling from f(ξt+1|y[1,t]) is computationally expensive, because for every timet, a new forward step is necessary. An approximate version of the sampling is inspired by Marshall and Spiegelhalter (2003).

The idea is to use the whole time seriesy for only one forward-backward run. That is, the same sampling probabilities (3.2.12) and (3.2.11) are used forall learning sets{y[1,t]}, t= 1,2, . . . , n−1. So the change point locations θ[1,t] up to time twhich are used above are not drawn from f(θ[1,t]|y[1,t]) but from f(θ[1,t]|y). In practice, the change points sampler is run on the whole data y and produces samples ofθ. Then for the processing of the learning set y[1,t], all change points at times t, t+ 1, . . . , n−1 are deleted from the sample vectors to produces approximate samples of θ[1,t], which conform with the maximum change point time t−1 for the reduced data y[1,t]. (The conventional “change point” at the last time tis deterministic and is of course included.)

However, sampling the parameter levelξ(k+1) ≡ξtof the block including the timetfrom the correct densityf(ξ(k+1)[1,t],y[1,t]) is easy: it is just the block posterior distribution fblock(k+1)|yk,t],φ). Therefore, the unknown part y[t+1,n] influences the parameter sample ξt only indirectly, through the “pruned” samples θ[1,t] obtained from θ. Hence, the conservatism concerning the sampling of ξt+1 and yt+1 , which is the price for the reduction of computing time in this approximate sampling scheme, should be moderate.

The proposed approximation of the one-step-ahead predictive density is thus f(yt+1|y[1,t]) =

ZZZ

f(yt+1t+1)f(ξt+1t[1,t])f(ξt[1,t]|y[1,t])dξt+1t[1,t]

= ZZZ

f(yt+1t+1)f(ξt+1t[1,t])f(ξt[1,t],y[1,t])f(θ[1,t]|y[1,t])dξt+1t[1,t]

≈ ZZZ

f(yt+1t+1)f(ξt+1t[1,t])f(ξt[1,t],y[1,t])f(θ[1,t]|y)dξt+1t[1,t]

=: ˜f(yt+1|y[1,t]).

Analytical logarithmic scores

The proposed conjugate change point model allows the analytical computation of the one-step-ahead logarithmic scores

−logf(y1),−logf(y2|y1), . . . ,−logf(yn|y[1,n−1]),

because all marginal likelihoodsf(y[1,t]), t= 1, . . . , n can be calculated in an exact one-step-ahead sampling loop: While for t = 1, we just use f(y1) = fblock(y1) and already have the first score, for t≥2 we obtain f(y[1,t]) from (3.2.10) as a by-product from the (reduced) posterior sampling given the datay[1,t]. Remember that the marginal likelihood is computed at the end of the forward step, which is necessary for the change points sampling in the backward step. Having finished the exact one-step-ahead validation loop, we can compute the remaining logarithmic scores

−logf(yt+1|y[1,t]) = logf(y[1,t])−logf(y[1,t+1]), t= 1, . . . , n−1.

Note that the sum of the one-step-ahead logarithmic scores equals the negative marginal log-likelihood. The well-known decomposition of the marginal likelihood of the vector y into the conditional scalar densities,

f(y) =f(yn|y[1,n−1])f(yn1|y[1,n−2])· · ·f(y2|y1)f(y1),

is the equivalent on the multiplicative scale. Therefore the mean one-step-ahead log-score can be computed directly from the marginal likelihood in the conjugate change point model as LogS = −1nlogf(y). Since this is a strictly monotone transformation, the model comparison based on the one-step-ahead log-score is equivalent to that based on the marginal likelihood. Another consequence is that the mean one-step-ahead log-score of the reversed time series is identical to the log-score of the original time series, because the assumed prior and likelihood are invariant to the time direction and so the marginal likelihood is identical.

The estimation of the logarithmic scores using the Monte Carlo approach on page 14 is nevertheless sensible, because we can thus assess the Monte Carlo error which also contributes to the difference between exact sampling and approximate sampling results.

We have compared the analytical one-step-ahead log-scores with the corresponding exact sampling log-scores for all examined models in the case studies from sections 3.4.2, 3.5.2 and 3.6.2. The maximum absolute differences for the three sections were 0.051, 0.065 and 0.1, while the mean deviances were 0.004, 0.003 and 0.005, respectively. These Monte Carlo errors are very small compared to the approximate sampling errors, with maximum deviances 1.924, 1.705 and 1.253, and mean deviances 0.107, 0.098 and 0.079, respectively in the three sections when all models are pooled. For illustration we show comparison plots of exact sampling versus analytical log-scores for three selected models in Figure 3.2.

Only in panel (b) some points in the upper right corner are lying slightly away from the identity line.

Figure 3.2 – Comparison of analytical one-step-ahead log-scores (x-axis) and corresponding exact sampling log-scores (y-axis), for three models from sections 3.4.2, 3.5.2 and 3.6.2.

1 2 3 4

1 2 3 4

(a) Coal-mining data model 1

0.0 1.0 2.0 3.0 0.0

0.5 1.0 1.5 2.0 2.5 3.0 3.5

(b)Tokyo rainfall data model 2

6.0 6.5 7.0 7.5 8.0 8.5 6.0

6.5 7.0 7.5 8.0 8.5

(c)Nile discharge data model 3