The Backpropagation - C HOSEN ANN M ODELS

2.4 C HOSEN ANN M ODELS

2.4.1 The Backpropagation

The Backpropagation neural network (Werbos, 1974; Parker, 1985; leCun, 1985;

Rumelhart, Hinton and Williams, 1986) is a multilayer, feedforward network, using a supervised learning mode based on error corrections. Minsky and Papert (1969) have provided a very careful analysis of the capabilities of the neural models proposed so far. At that moment, the Perceptron was offered a very simple guaranteed learning rule (the delta rule – Widrow and Hoff, 1960) for all problems that can be solved by a two-layer feedforward network. What Minsky and Papert (1969) have shown was that this learning rule was capable of solving only linear separable problems. Later, Rumelhart, Hinton and Williams (1986) have shown that the addition of hidden neurons to the neural network architecture would permit the mapping of the problem representation in a way that non-linear separable problems could be solved.

The problem, at that moment, was that there was no specified learning rule to cope with the hidden neurons. The solutions that came up later to solve this problem were the following three:

• The addition of weight values on the hidden neurons by hand, assuming some reasonable performance;

• The competitive learning where unsupervised learning rules are applied in order to automatically develop the hidden neurons.

• The creation of a learning procedure capable of learning an internal representation (using the hidden neurons) that is adequate for performing the task at hand.

The Generalized Delta Rule represents the latest approach, which implements a supervised learning algorithm based on the error correction in a multilayer neural network and which is known as the Backpropagation neural network. The proposed learning procedure involves the presentation of a set of input and output patterns to the neural network. The input patterns typically correspond to a sample of the real patterns. For each input pattern one output pattern is determined. The output patterns are the known classification of the correspondent input patterns. The patterns are represented in the form of a vector. When learning, the system first uses the input vector to produce its own output vector and then compares this with the desired output (the output pattern). If there is no

difference, no learning takes place, but if there is a difference, the weights of the network are changed in order to reduce this difference.

The generalized delta rule minimizes the squares of the differences between the actual and the desired output values summed over the output neurons and all pairs of input/output vectors. There are many ways of deriving the delta rule; derivations are detailed in (Rumelhart at al.1986).

Figure 2.9 shows a generic Backpropagation neural network architecture with one layer of hidden neurons. The network is fully connected, having all neurons of one layer connected to all neurons of the next layer. It has an input layer which number of neurons corresponds to the size of the input vector. The number of neurons for the output layer corresponds to the size of the output vector. The Backpropagation may have more than one layer of hidden neurons. The number of neurons on the hidden layer may define the capability of the network in mapping the problem properly. Usually the determination of the number of neurons on the hidden layers and the number of hidden layers is very difficult. Typically, the network developer or user determines them empirically.

InputLayer OutputLayer

HiddenLayer

. . . . . . . . .

InputPatterns OutputPatterns

. . .

Figure 2.9 – Generic Backpropagation network

Next, a sequence of equations will be introduced showing the mathematical description of the generalized delta rule or the Backpropagation algorithm.

First the input vector xp = (xp1, xp2,…, xpN)^t is applied to the input layer of the network. The input neurons just bypass the input values to the hidden neurons via the connection synapses. The hidden neurons will calculate its net using Equation 2.8:

∑

Equation 2.8 – Hidden neurons net

In Equation 2.8, p is the pattern being learned, j denotes the jth hidden unit,w_ji is the weight on the connection from the ith input unit to the jth hidden unit, and θ_j is the bias value. The bias is useful to accelerate the network convergence. Equation 2.9 gives the activation of the hidden neuron.

) ( _pj

pj f net

i =

Equation 2.9 – Hidden neurons activation

The calculation of the output neurons net (net_pk) and the corresponding output value (O_pk) is the same as shown in Equations 2.10 and 2.11. In these equations, the difference to Equations 2.8 and 2.9 is that the index k denotes the kth output unit.

∑

Equation 2.10 – Hidden neurons net

)

( _pk

pk f net

o =

Equation 2.11 – Hidden neurons activation

Function f can assume several forms as seen previously on the activation functions exemplified in Figure 2.3 (Activation functions). Typically, two forms are of interest: linear and sigmoid output functions given by Equations 2.12 and 2.13 respectively. The same function forms are valid for both hidden and output neurons.

pk pk

k net net

f ( )=

Equation 2.12 – Linear output function

) 1

Equation 2.13 – Sigmoid output function

After propagating the input signals over the network as explained by the equations above, it is time to calculate the local gradients for the output and hidden layers. This calculation is done based on the difference between the real and the desired output, that is the principle of the supervised learning as already explained. The calculation of the local gradients for the hidden neurons is done before the update of the connection weights to the output layer neurons. Equation 2.14 shows the calculation of the local gradients for the output neurons.

Equation 2.14 – Output neurons local gradients function

Equation 2.15 shows the calculation of the local gradients for the hidden neurons.

∑

Equation 2.15 – Hidden neurons local gradients function

Equation 2.16 shows the f´ function that is the derivative of the sigmoid activation function in respect to its total input, net_pk. The function shown is the one used for the output neurons, the same is valid for the hidden neurons changing the index k by index j.

)

Equation 2.16 – Sigmoid function derivative

Having calculated the local gradients for all neurons on the output and hidden layers, it is time to go backwards recalculating the weights of the neural network based on those local gradients. It is necessary to calculate the negative of the gradient of E_p (error for the example pattern p), ∇E_p, with respect to the weights w_kj, to determine the direction in which to change the weights. Then the weights are adjusted so that the total error is reduced. Equation 2.17 calculates the error term E_p that is useful to determine how well the network is learning.

When the error is acceptably small for each of the training examples, the training can be

stopped. Figure 2.10 exemplifies a hypothetical error surface in weight space, its local and global minima and the gradient descendent.

∑

Equation 2.17 – Error term

Ep

Zmin Z

Ep

Figure 2.10 – Hypothetical error surface

The constant that assures the proportionality of the error adjustments steps is the learning rate η. The larger this constant, the larger the changes in the weights. The chosen learning rate shall be as large as possible without leading to oscillation. One way to avoid oscillation is to include to the generalized delta rule a momentum term that has a parameter α . Thereby Equation 2.18 gives the weight change for the connections among the output and hidden neurons in the iteration time t.

( )

⁼ ⁽ ⁾⁺ ^∆ ⁽ ⁻¹⁾

∆_pw_kj t η δ_pki_pj α _pw_kj t

Equation 2.18 – Weight change magnitude for connections among output and hidden neurons Equation 2.19 gives the update of the weight value for the connections among the output neurons and the hidden neurons.

)

Equation 2.19 – Weight update for connections among output and hidden neurons

Similarly, Equation 2.20 gives the weight change for the connections among the hidden and the input neurons in the iteration time t.

( )

⁼ ⁽ ⁾⁺ ^∆ ⁽ ⁻¹⁾

∆_pw_ji t η δ_pjx_i α _pw_ji t

Equation 2.20 – Weight change magnitude for connections among hidden and input neurons

Then Equation 2.21 gives the update of the weight value for the connections among the hidden neurons and the input neurons.

)

Equation 2.21 – Weight update for connections among hidden and input neurons

Once the network learning reaches a minimum, either local or global, the learning stops. If it reaches a local minimum, the error produced at the network outputs can still be unacceptably high. The solution then is to restart the network learning from scratch with new weights for the connections. If this is still not the solution, the number of neurons on the hidden layer can be improved, or the learning parameters learning rate and momentum can be changed.

Important aspects to consider on the object model for the Backpropagation are:

• The input nodes just bypass the information. Usually this information is a normalized set of values between 0 and 1, resulting from a pre-processing of the input data.

• The hidden and the output neurons implement the Perceptron functionality where the dot product is applied and a nonlinear function gives the neuron activation.

• In the original model the connections among the neurons are feedforward. There are no feedback connections and the network is fully connected.

• The number of input, hidden and output neurons is determined based on the expert knowledge on the application domain.

• Even not having feedback connections the backward propagation of the error terms is a computation in the reverse direction. In fact, the backward

computation is as important as the forward computation on this model, being an important and interesting characteristic to be modeled.

The given Backpropagation equations presented here are from Freeman, 1992.

Im Dokument A Component Architecture for Artificial Neural Network Systems (Seite 40-46)