• Keine Ergebnisse gefunden

SHARED WEIGHTS EXTENSION OF ERROR BACKPROPAGATION

BACKPROPAGATION ALGORITHM

3. SHARED WEIGHTS EXTENSION OF ERROR BACKPROPAGATION

5.8 become meaningless as well. The outlined problem of long tanh(.) chains can be solved by a re-normalization of the gradients at each step of the backpropagation [NZ98, ZN01, p. 396 and p. 325]. Later on, we will introduce a proper learning rule calledvario-eta, which includes such a re-normalization of the gradients [NZ98, p. 396].

3. SHARED WEIGHTS EXTENSION OF

[RHW86, p. 354-55]. The unfolding in time of a recurrent neural network into a corresponding feedforward architecture is illustrated in Fig. 5.5 [RHW86, p. 355].

ut−2 t−1

ut−3 ut−4

st−4 s

u

t−3 t−2 st−1

yt−1 yt−2

yt−3

s yt−4

t

st yt

A A

u C

B

C

B

C

B

C

B

C

B

A A A

(...) (...)

(...) B C

u s y

C

B

A

Figure 5.5 Time-delay recurrent neural network (see Eq. 4.5), left, corresponding unfolding in time neural network architecture, right [RHW86, p. 355].

If we ignore, that the internal statestof the recurrent neural network of Fig. 5.5, left, depends on the previous state transitionst−1, i. e. as-suming that st = f(ut), we are back in the framework of feedforward modeling [ZN01, p. 322]. The spatial unfolding in time architecture of Fig. 5.5, right, adapts this characteristic of the recurrent neural network:

At each time step τ of the unfolding, we have the same non-recursive structure, i. e. a hidden statesτ is only dependent on externals influences uτ. In order to model the recursive structure, we establish connections between the hidden states st−4, . . . , st of the spatial unfolding network with respect to their chronological order. Now, the hidden state sτ of the unfolding network depends on the previous hidden state sτ1 and external influencesuτ. By this, we approximate the behavior of the re-current neural network (Fig. 5.5, left) up to a sufficient level [RHW86, p. 354-56].

It is important to note, that the weight matrices A,B and C of the unfolding neural network of Fig. 5.5, right, are identical at each step of the unfolding in time. In other words, matrices A, B and C share the same memory for storing their weights. Hence, A, B and C are calledshared weight matrices. The usage of shared weights ensures, that the unfolding network behaves as in the recurrent case [Hay94, RHW86, ZN01, p. 751-52, p. 355 and p. 323-4].

The unfolding in time neural network (Fig. 5.5, right) can be handled by a so-calledshared weights extension of the standard error backprop-agation algorithm.

Let us exemplify the integration of shared weights into the backpropa-gation algorithm by examining a 3-layer feedforward neural network (see Fig. 5.1). To adapt the idea of shared weights, we apply an additional constraint to the 3-layer network: We assume, that the network weights wkj and wji are identical. This means that the weightswkj connecting the hidden to the output layer are equal to the connectionswji between the input and hidden layer. By this constraint, we also presume ap-propriate dimensions of the neural network layers. More formally, the shared weights constraint can be written as

w=wkj =wji. (5.11)

Given the constraint of Eq. 5.11, the partial derivative of the network overall error function (Eq. 5.1) with respect to a particular weightwcan be obtained by applying the chain-rule and the product rule:5

∂MSE

∂w = 1

T

T

X

t=1

out2k−tark ∂out2k

∂netin2k

∂netin2k

∂wkj + out2k−tark ∂out2k

∂netin2k

∂netin2k

∂out1j

∂out1j

∂netin1j

∂netin1j

∂wji

= 1

T

T

X

t=1

out2k−tark

f0(netin2k)out1j +f0(netin1j)

n

X

k=1

wkjf0(netin2k) out2k−tark out0i .

(5.12) Since we assume, that the network is fully connected, we have a set of k·j =j·ipartial derivatives. Likewise to the standard backpropagation algorithm, we simplify the notation of the partial derivatives (Eq. 5.12) by referring to the auxiliary termsδ2k and δj1 of Eq. 5.13 and Eq. 5.14:

δ2k = f0(netin2k)(out2k−tark) , (5.13) δ1j = f0(netin1j)

n

X

k=1

wkjδ2k . (5.14) Substituting the auxiliary terms of Eq. 5.13 and Eq. 5.14 in Eq. 5.12 and re-arranging terms yields to

∂MSE

∂w = 1

T

T

X

t=1

δ2k·out1jj1·out0i . (5.15)

5According to the product-rule, the derivative ofy=f(x)·g(x) is ∂y

∂x=f(x)·∂f

∂x+g(x)·∂g

∂x.

As stated in Eq. 5.15, the overall partial derivative of the network error function with respect to a particular weight w consists of two blocks:

Thefirstblock describes the local partial derivativeδk2·out1j of the error function with respect to the weights wkj located in the upper network level. Thesecondblock refers to weights wji of the lower network level.

Here, the local partial derivative is computed by δ1j ·out0i. Thus, the overall partial derivative of the error function with respect to a weight w is computed by summing the local partial derivatives [RHW86, p.

356].

According to the calculations of the partial derivatives (Eq.5.15), the standard backpropagation algorithm (Fig. 5.2) has to be modified.

Fig. 5.6 depicts the shared weights extension of the backpropagation al-gorithm in case of the 3-layer neural network constrained by Eq. 5.11.6

j 1 k 2

k 2

k err = out − tar 2 k

k 2

k 2

k δ = f’(netin ) err2

i 0

ji j

1 j 1 j 1

kj

j δ1

i 0 j 1

ji

out

+

=

w k 2

j 1

j 1 kj

hidden neurons j = 1, ...,m i

0 j

1 k 2

j δ1out0i

j 1 k

kj δ2

k 2

d MSE

d w j

1 k

2out

δ δ1jout0i

m err = output neurons k = 1, ...,n w out

Σ

w

ji

j = 1 δ out = f(netin )

input neurons i = 1, ...,l l

i = 1 w out out = f(netin ) netin =

Σ

m j = 1

= f’(netin ) err

Σ Σ

w n err =

k = 1

w δ

out = inputi netin =

Figure 5.6 Shared weights extension of the standard backpropagation algorithm in case of a single training sample.

As shown in Fig. 5.6, the local partial derivatives of the weightswkj andwji are computed by using information of the forward and the back-ward pass of the network. This corresponds to the standard backprop-agation algorithm (Fig. 5.2). In contrast to the standard approach, the

6For simplicity, we omit bias connections to the hidden and the output layer. For the inclusion of bias weights see Bishop (1995) [Bis95, p. 142].

local partial derivatives are stored along the backward pass of the net-work for each weight separately. After the backward pass is finished, the resulting sum of all local partial derivatives determines the overall partial derivative of the error function with respect to the concerned weightw [RHW86, p. 356].

Having explained the shared weights extension of standard backprop-agation on the basis of a simple feedforward neural network, we are now dealing with its application to an unfolding in time neural network. The unfolding in time architecture is similar to the recurrent neural network depicted in Fig. 5.5, right. For simplicity, the unfolding was truncated after three time steps t, t−1 and t−2. The shared weights of this network are (i.) wA=wAji=wiuA, (ii.) wB =wBjh=wigB =wBuw and (iii.) wC =wljC =wkiC =wsuC. Fig. 5.7 illustrates the resulting backpropagation algorithm in case of a single training pattern.

As shown in Fig. 5.7, the computation of the partial derivatives is anal-ogous to the case of the shared weights 3-layer neural network (Fig. 5.6).

Likewise to the description of Fig. 5.6, we simplified the notations by the auxiliary termsδ. The upper indices correspond to the time stepst, t−1 andt−2 of the finite unfolding.

The local partial derivatives are derived by a combination of the for-ward and the backfor-ward flow of the unfolding in time network (Fig. 5.7).

The overall partial derivatives of the error function with respect to the shared weights wA,wB orwC are obtained by summing the local par-tial derivatives for each shared weight separately [RHW86, p. 355-56].

The partial derivative of the network error function with respect to the shared weightswB is composed of threelocal derivatives. These deriva-tives are collected at the different stepst,t−1 andt−2 of the unfolding.

The same is valid for the shared connectorwC. Since we truncated the unfolding afterthreetime steps, the shared connectorwAis only applied twice (i. e.wAji andwAiu). Thus, the overall partial derivative of the error function with respect to the shared weightswAis only composed of the local derivativesδjt·outt−1i and δit−1·outt−2u .

The application of the extended backpropagation algorithm to more sophisticated unfolding in time neural networks, e. g. error correction neural networks [ZNG01a, p. 247-49], is straightforward.

it−1

δout t−1 g

t−1 iδout Ad wd MSE= +++Bd wd MSE= +

ht

lt ltout = f(netin ) lt lt ltδ=f’(netin ) err jt jtout = f(netin ) jt jt jt = f’(netin ) errδ

ji jt Σ i = 1m Σ h = 1r jhhtw out+

jiit−1w outnetin = jhw

st−2 s

t−2 st−2 suut−2 st−2 st−2 st−2 ut−2 sust−2 ut−2 ut−2 ut−2 uwut−2 wt−2

st−2 k

t−1 kt−1out = f(netin ) kt−1 kt−1 kt−1δ=f’(netin ) err it−1 it−1out = f(netin )it−1 kikt−1o err = Σ k = 1wδ

kiw igw gt−1 igit−1m err = Σ j = 1wδ

it−1 it−1 it−1 = f’(netin ) errδ

kt−1 k

t−1 k

t−1 l

t l

t l

t kt−1 kii1 uwut−2 ut−2 gt−1 wt−2

wsu wiu

lt jt lj jt ljlt

lj ht jhB jt

outδit−1 u

t−2 i

t−1 jtoutδ ut−2outδwt−2

ut−2 iu Σ g = 1q ig outδjt ht outδjt htoutδg

h +t−1 ut−2outδwt−2 jt ltoutδkt−1 kt−1outδoutδst−2 ut−2= d MSE d wCoutδit−1 ut−2 it−1 jtoutδ

jt ltoutδoutδst−2 u

t−2 k

t−1 kt−1 w

err = out − tar w out+

w outt−1 i

v u = 1Σnetin = t−1 g

C C C

err = out − tar netin = B B BB

B

B

A A

C C out = input

w out C A

A δw j = 1Σerr =

n

B

out = f(netin )err = out − tar netin =w out

v Σ u = 1δ=f’(netin ) err z err = Σ s = 1wδout = f(netin ) w v err = Σ u = 1wδout = inputw

st−2 s

t−2 m out = input

B

C C

Σ i = 1 netin =ut−2w outΣ w = 1a wt−2 = f’(netin ) errδ uw g

err =

netin =

n Σ j = 1 w p Σ l = 1wδ

Cw out

l = 1, ..., p w = 1, ..., au = 1, ..., v

s = 1, ..., z g = 1, ..., q

k = 1, ..., o i = 1, ..., m h = 1, ..., rj = 1, ..., n

yt−2 t−1s t−1ut−1y ts tuty t−2s t−2u

Figure 5.7 The shared weights extension of standard backpropagation in case of a finite unfolding in time neural network with shared weights wA, wB, wC. (The resulting partial derivatives of the error function are depicted at the bottom of the network architecture. For simplicity, (shared) bias connections are omitted.)