• Keine Ergebnisse gefunden

ERROR CORRECTION NEURAL NETWORKS (ECNN)

Dynamical System

2. ERROR CORRECTION NEURAL NETWORKS (ECNN)

If we are in the position of having a complete description of all ex-ternal forces influencing a deterministic system the equations 4.6 would allow to identify the temporal relationships by setting up a memory in form of a state transition equation [MC01, Hay94, p. 44-5 and p. 641-6, 666]. Unfortunately, our knowledge about the external forces is typically incomplete or our observations might be noisy. Under such conditions learning with finite datasets leads to the construction of incorrect causal-ities due to learning by heart (overfitting). The generalization properties of such a model are questionable [NZ98, p. 373-4].

st = f(st−1, ut),

yt = g(st) . (4.6)

If we are unable to identify the underlying system dynamics due to insufficient input information or unknown influences, we can refer to the observed model error at time periodt−1, which can be interpreted as an indicator that our model is misleading. Handling the latter error information as an additional input, we extend Eq. 4.6 obtaining Eq. 4.7:

st = f(st−1, ut, yt−1−yt−1d ) ,

yt = g(st) . (4.7)

Keep in mind, that if we have a perfect description of the underly-ing dynamics, the extension of Eq. 4.7 is no longer required, because the observed model error at t−1 would be zero. Hence, modeling the dynamical system one could directly refer to Eq. 4.6. In all other situa-tions the model uses it’s own error flow as a measurement of unexpected shocks.

This is similar to the MA part of a linear ARIMA model (ARIMA stands for Autoregressive Integrated Moving Average models, which utilizes both, linear autoregressive components and stochastic moving average-components derived from the observed model error to fit a time series, see [Wei90, p. 71]). As a major difference, the error correction system is a state space model, which accumulates a memory over the state transitions. Hence, there is no need to preset a number of delayed error corrections. Another difference is that ARIMA models include only linear components, whereas our approach is nonlinear.

Our error correction system bears also some resemblance to the so-called Nonlinear AutoRegressive with eXogenous inputs recurrent net-works (NARX) [NP90]. These netnet-works ”describe a system by using

a nonlinear functional dependence between lagged inputs, outputs and / or prediction errors” [MC01, p. 71]. Hence, NARX models can be seen as a generalization of linear ARMA processes [Wei90, p. 56-7].

NARX models are usually approximated by recurrent neural networks, i. e. multi-layer perceptrons incorporating feedback loops from the out-put to the hidden layer [DO00, p. 336-7]. Dealing with NARX networks, one faces the problem of exploring long-term dependencies in the data [MJ99, p. 136]. As we will show, an error correction neural network is able to handle this issues appropriately (see sec. 2.3). For further details on NARX models, the reader is referred to Medsker and Jain (1999) [MJ99, p. 136-9] or Haykin (1994) [Hay94, p. 746-50].

Another resemblance bears to Kalman Filters where the model error is used to improve the system identification [MJ99, Hay96, p. 255 and p.

302-22]. In contrast to the online adaptation in the Kalman approach, we try to identify a fixed nonlinear system which is able to handle external shocks. By using such a system we can not evade an error when the external shock appears. Thus, our task is to find a fast adaptation to the new situation. Such a strategy should decrease the learning of false causalities by heart. This will also improve the generalization ability of our model.

2.1 APPROACHING ERROR CORRECTION NEURAL NETWORKS

A first neural network implementation of the error correction equa-tions in Eq. 4.7 can be formulated as

st = tanh(Ast−1+But+D(Cst−1−ydt−1)),

yt = Cst . (4.8)

The termCst−1 recomputes the last output yt−1 and compares it to the observed data ydt−1. The matrix transformation D is necessary in order to adjust different dimensionalities in the state transition equation.

It is important to note, that the identification of the above model framework is ambiguously creating numerical problems, because the au-toregressive structure, i. e. the dependency ofstonst−1, could either be coded in matrixA or inDC. On this problem, one may argue to trans-form Eq. 4.8 into a well defined trans-form (Eq. 4.9) utilizing ˙A=A+DC.

st = tanh( ˙Ast−1+But+Dyt−1d ) ,

yt = Cst . (4.9)

Eq. 4.9 is algebraic equivalent to Eq. 4.8 without taking its numeri-cal ambiguous problems. Unfortunately, by using Eq. 4.9 we loose the explicit information provided by external shocks, since the error correc-tion mechanism is no longer measured by a deviacorrec-tion around zero. Using neural networks with tanh(.) as a squashing function, our experiments indicate, that the numerics works best if the included variables fluctuate around zero, which fits best to the finite state space (−1; 1)ncreated by the tanh(.) nonlinearity [MC01, p. 50-4].

To overcome the drawbacks of our concretizations in Eq. 4.8 and Eq. 4.9, we propose the neural network of Eq. 4.10 formulated in an error correction form, measuring the deviation of the expected value Cst−1 and the observationyt−1d . The non-ambiguity is a consequence of the additional nonlinearity.

st = tanh(Ast1+But+Dtanh(Cst1−yt−1d )),

yt = Cst . (4.10)

The system identification (Eq. 4.11) is a parameter optimization task adjusting the weights of thefour matricesA,B,C,D(see chp. 5).

1 T ·

T

X

t=1

yt−ytd2

→ min

A,B,C,D . (4.11)

2.2 UNFOLDING IN TIME OF ERROR CORRECTION NEURAL NETWORKS

Next, we translate the formal description of Eq. 4.10 into a spatial net-work architecture using unfolding in time and shared weights (Fig. 4.5) [RHW86, Hay94, p. 354-57 and p. 751-6]. The resulting network archi-tecture is called Error Correction Neural Network (ECNN).

The ECNN architecture (Fig. 4.5) is best to understood if one analyses the dependency betweenst−1,ut,zt−1 =Cst−1−yt−1d and st.

Interestingly, we have two types of inputs to the model: (i.) the ex-ternal inputsutdirectly influencing the state transition st and (ii.) the targetsytd1, whereby only the difference between the internal expected yt−1 and the observation yt−1d has an impact on st. Note, that −Id is the fixed negative of an identity matrix.

This design allows an elegant handling of missing values in the series of target vectors: if there is no compensationyt−1d of the internal expected value yt−1 = Cst−1 the system automatically generates a replacement.

A special case occurs at time periodt+ 1: at this future point in time,

D C D C D C

−Id −Id −Id

zt−2 st−1 zt−1 st zt st+1 yt+1

ut−1 ut

yt−2d yt−1d ytd

B B

A A

Figure 4.5 Error Correction Neural Network (ECNN)

we have no compensation ydt+1 of the internal expected value, and thus the system is offering a forecastyt+1=Cst+1. A forecast of the ECNN is based on a modeling of the recursive structure of a dynamical system (coded in A), external influences (coded inB) and the error correction mechanism which is also acting as an external input (coded inC,D).

The output clusters of the ECNN which generate error signals during the learning phase arezt−τ and yt+τ. Have in mind, that the target val-ues of the sequence of output clustersztτ arezero, because we want to optimize the compensation mechanismyt−τ−ytdτ between the expected valueyt−τ and its observationytdτ.

Compared to the finite unfolding in time neural networks (see Fig. 4.3) [CK01, Hay94, p. 19-21 and p. 641-5, 751-5], the ECNN has another advantage: Using finite unfolding in time neural networks, we have by definition an incomplete formulation of accumulated memory in the left-most part of the network. Thus the autoregressive modeling is hand-icapped. In contrast, due to the error correction, the ECNN has an explicit mechanism to handle the shock of the initializing phase.

Remind, that the ECNN shares some similarities with ARIMA pro-cesses [Wei90, p. 71], Kalman Filters [Hay96, p. 302-22] and NARX networks [MJ99, p. 136-9]. In opposite to these models, the ECNN is designed as a nonlinear state space model, which accumulates a memory over the state transitions. Furthermore, the ECNN is able to explore long-term dependencies in the data. We address this issue in the next subsection.

2.3 COMBINING OVERSHOOTING & ECNN

A combination of the basic ECNN presented in the preceding section 2.1 and the overshooting technique of section 1.3 is shown in Fig. 4.6 [ZN01, p. 326-7].

−Id zt−2

yt−2d D st−1

ut−1 B

C

−Id zt−1

yt−1d D st

ut B

C zt

ytd

−Id

D st+1C yt+1

A

A A

st+2C

yt+2 st+3C yt+3 A

Figure 4.6 Combining Overshooting and Error Correction Neural Networks

Besides all advantages described in section 1.3, overshooting influences the learning of the ECNN in an extended way. A forecast provided by the ECNN is in general based on a modeling of the recursive structure of a dynamical system (coded in the matrixA) and the error correction mechanism which is acting as an external input (coded in C and D).

Now, the overshooting enforces the autoregressive substructure allowing long term forecasts. Of course, in the overshooting environment we have to supply the additional output clustersyt+1, yt+2, yt+3, . . . with target values. Note, that this extension has the same number of parameters as the standard ECNN in Fig. 4.5.

2.4 ALTERNATING ERRORS & ECNN

In this section we propose another extension of the basic ECNN in Fig. 4.5 called alternating errors. The inclusion of alternating errors allow us to overcome trend following models. Trend following behav-ior is a well-known difficulty in time series forecasting. Typically, trend following models underestimate upwarding trends and vice versa. Fur-thermore, trend reversals are usually predicted too late.

More formally, a trend following model can be identified by a se-quence ofnon-alternatingmodel errorszτ = (yτ−ydτ) [PDP00, p. 295-7].

Thus, enforcing alternatingerrors zτ we reduce trend following tenden-cies. This can be achieved by adding a penalty term to the overall error function of the ECNN (Eq. 4.11):

1 T ·

T

X

t=1

yt−ytd

2

+λ·(zt+zt1)2 → min

A,B,C,D . (4.12)

The additional penalty term in Eq. 4.12 is used to minimize the auto-covariance of residual errors in order to avoid trend following behavior.

Note, that solutions arenotsensitive against the choice of λ.

Our experiments with the penalty term of Eq. 4.12 indicate, that the additional error flow is able to prevent the learning of trend following models. In Fig. 4.7 we combined the penalty term of Eq. 4.12 with the basic ECNN of Fig. 4.5.

D C D C

−Id −Id

zt−1 st zt st+1 yt+1

ut

yt−1d ytd

C D

−Id

zt−3

ut−2

yt−3d

C D

−Id

zt−4 st−3

ut−3

yt−4d

st−2 z D

t−2 st−1

ut−1

zt−3+zt−2 zt−2+zt−1 zt−1+zt

B A

A A

C

B

A

B B

yt−2d

−Id

Id Id Id

Id Id Id

( )2 ( )2 ( )2

Figure 4.7 Combining Alternating Errors and ECNN. The additional output layers (yτydτ)2are used to compute the penalty term of Eq. 4.12 during the training of the network. We provide the output layers (yτ ydτ)2 with task invariant target values of 0 and apply the mean square error function (Eq. 5.1, [PDP00, p. 413-4]), because we want to minimize the autocovariance of the residuals. Note, that we can omit the calculation of the alternating errors in the recall (testing) phase of the network.

The integration of alternating errors into the ECNN is natural: The error correction mechanism of the ECNN provides the model errorzτ = (yτ−ydτ) at each time step of the unfolding, since the model error is re-quired by the ECNN as an additional input. Thus, we connect the output clusterszτ in pairs to another output cluster, which uses a squared error function. This is done by using fixed identity matricesid. Note, that the additional output layers (yτ−yτd)2 are only required during the training of the ECNN.

Due to the initialization shock, we do not calculate a penalty term for the first pair of model errors. Note, that the proposed ECNN of Fig. 4.7 has no additional weights, because we only use already existing information of model errors at the different time steps of the unfolding.