Speeding Up First-order Gradient Descent - Learning and Self-organisation in Recurrent Neural N

Embodied Cognitive Modelling

4.3 Learning and Self-organisation in Recurrent Neural NetworksNeural Networks

4.3.5 Speeding Up First-order Gradient Descent

Enhancing first-order gradient descent with estimates of step sizes is particularly feasible in RNNs processing long sequences, where the training time is usual high and the weight changes would – on average – tend to point into the same directions for several epochs. Computing an optimal step size²⁷would not be desirable, because we certainly would end up in a local but not global optimum, depending on the initial guess on our weights. While approaching an optimum with an estimated step size, the gradient descent approach could lead to exploring the weight space slightly different in every epoch. A heuristic estimate thus must allow for both, large jumps and small optimisations. In case the number of learning steps is still high, we are able to make use of more general techniques for a clever speeding up: steering the error by teacher forcing in gradient descent, imitating stochastic deviation of training steps by adding stochastic noise to the training samples, or

27Just for the sake of argument though keeping in mind that this is not possible and computa-tional costs would be tremendously high.

implementing gradient descent in the computing machines by exploiting parallel processing architectures.

Linear and Logarithmic Decreasing Learning Rates

The simplest method is to estimate the learning rates based on the networks dimensions and the experience made with the task or data. For this we could decrease the learning rates from a good initial guess, e.g. ηmax = 1/|IAll|² to a small rate η_min, appropriate for the desired maximal error [148, 157]. This can be done linearly or logarithmically²⁸ based on good guesses for the maximal number of training epochs θ:

wu,i,j =wu−1,i,j−

ηmin+ η_max−η_min

θ (θ−u)

∆wi,j , (4.38) wu,i,j =wu−1,i,j−

ηmin+ ηmax−ηmin

∆wi,j . (4.39)

Velocity in Learning Rates Using Momentum

For a more informed estimate, we can use the history of learning steps. A particularly successful strategy is to actually sum up the directions of previous steps and thus increase the velocity of the gradient descent in certain directions [221, 272]. Based on the analogy to physics, we can include the momentum of previous weight changes:

w_u,i,j =w_u−1,i,j−(ρ∆w_u−1,i,j+η∆w_u,i,j) , (4.40)

where the momentum term ρ ∈ [0,1] regulates the magnitude of the previous weight update added to current weight update. For convex optimisation, we can also consider the Nesterov momentum that includes a correction of poor gradients [272].

Compared to FFNs, we would choose the momentum rather small (aroundρ = 0.1) and individual for every weight to avoid divergence, and would not assume convex functions.

Adaptive Resilient Learning Rates for RNNs

Another very successful heuristic optimisation method for FFNs is the Resilient Propagation (RPROP) algorithm suggested by Riedmiller and Braun [232]. For every individual weight the learning step is adapted based on the direction change of the first-order derivative with respect to the previous epoch. In particular, individual learning rates η and β are adaptive based on the local gradient information.

For this thesis, this approach was adopted for RNNs to also conservatively speed up the training over epochs, where the gradient is steadily descending to the same minimum. In contrast to the original RPROP, learning rates are adapted and multiplied directly with the partial derivatives instead of only using the sign of the

28Often called “gain scheduled”.

partial derivatives to determine the change of the learning step:

ηu,i,j =











min (ηu−1,i,j·ξ₊, η_max) iff ^∂h_error,u

∂w_i,j · ^∂herror,u−1

∂w_i,j

>0 max (ηu−1,i,j·ξ−, η_min) iff ^∂h_error,u

∂w_i,j · ^∂herror,u−1

∂w_i,j

ηu−1,i,j otherwise

, (4.41)

β_u,i =











min (βu−1,i·ξ₊, η_max) iff ^∂herror,u

∂b_i · ^∂herror,u−1

∂b_i

>0 max (β_u−1,i·ξ−, ηmin) iff ^∂h_error,u

∂b_i · ^∂h_error,u−1

∂b_i

βu−1,i otherwise

, (4.42)

whereξ₊ ∈]1,∞] andξ− ∈]0,1[ are the increasing or decreasing factors respectively and η_max > η_min are upper and lower bounds for both learning ratesη andβ. If the partial derivative of the current epoch u is pointing to the same direction as in the former epoch u− 1, then the learning rate is in-creased. If the direction of the partial derivative is pointing to the other direction, then the minimum has been missed and the learning rate is decreased.

Similarly to the RPROP in FFNs, the adapted approach for RNNs cannot guarantee for global convergence and might be slow for complex problems [13]. For this reason it is important to choose the parameter more conservatively. Rather then adopting the original parameter values (ξ₊= 1.2 and ξ−= 0.5), more careful speed-ups (e.g. ξ₊ = 1.01 and ξ− = 0.96) should be considered [242]. In particular, since in an RNN one weight might not only be important for a number of patterns but also for a number of time steps, such a careful setting is necessary when training many complex sequences.

Teacher Forcing

A generally very effective method to control the vanishing of gradients in RNNs is to artificially provide an error with respect to the desired activity in every training step [65, 299]. This is achieved by forcing the desired activity of the output neuron into the actual activity within the forward pass of the BP approach and thus determine an error for the respective time step as if the processing up to this time step would have been correct:

x= (α)x^∗+ (1−α)x , (4.43)

where the Teacher Forcing (TF) term α ∈]0,1[ adjusts the feedback rate of the desired activity x^∗, which is forced into the output, in proportion to the actual output activity. An experience made during the work for this thesis is that a small forced desired activity suffices to drive a successful training (around α= 0.1).

Noise and Jitter

In information processing in the brain, noise is inherently present. Particularly single neurons respond to incoming spikes only to a small fraction, but patterns of spikes usually correlate within populations [170, 251]. Reasons for individual variability – or noise – are manifold, ranging from simple sensor noise by changing physical properties (bending hair cells or saccades), over synaptic fluctuations, up to dynamics in columns of cells. Small sources of noise easily add up and the number of potential activity patterns is usually exponential. A model that has been proven accurate for tuning functions of neurons across the brain is assuming Gaußian noise with a certain width or variance σ [16, 57].

In machine learning it is a well-established method to add Gaußian white noise to the data while training [26, 247, 306]. In this field it is often called jitter, since the input moves within the feature space, e.g. based on a Gaußian PDF G:

x_u,t,i =x_t,i+x_noise | x_noise ∈Gµ=0,σ , (4.44)

where the mean µis set to zero and the variance σ is chosen appropriately for the architecture and task. It is difficult to determine a good variance analytically²⁹, thus the standard procedure is to determine the variance carefully and progressively from small values. The result is generating more data from the existing data set and overall increasing the generalisation, if the perturbation of the input by noise occurs on the important features. For adding noise to the data, it is important that the noise added on a data point is independent from the noise added to other data points in the same epoch as well as from the same data point in other epochs.

In cognitive modelling, noise is particularly interesting for mimicking the un-certainty within columns or layers of neurons. Perturbation, both in activity or synaptic efficacy, can change the self-organisation of the internal representation, which could be of key importance [176]. On the one hand, for an appropriate inform-ation processing mechanism, a certain latent pattern should emerge despite variable noise. On the other hand, activity trajectories are deviated from the patterns of noiseless reference activity. In this way, the inherent chaos and fluctuation can also be captured in firing-rate models.

Parallel Implementation in GPUs

Programming frameworks that enable researchers to use the massive number of cores in Graphical Processing Units (GPUs) have been made available³⁰ during the last years. By vectorising expensive matrix manipulations like in gradient descent to spread the computations over more than 1,000 cores reduces the computation time drastically. In addition, these frameworks facilitate a parallel thinking in developing plausible neural architectures which are, in fact, supposed to be massively parallel.

29Note that it is possible to estimate an overall good noise pattern, but this involves developing a model of the distribution of the data set with respect to employed features and classes.

30E.g. OpenCL or CUDA: [https://khronos.org/opencl,https://developer.nvidia.com].

Im Dokument Natural language acquisition in recurrent neural architectures (Seite 88-92)