Reinforcement Learning - Neuronal Models of Motor Sequence Learning in the Songbird

3.4 Learning

3.4.3 Reinforcement Learning

Unpaired actual output spikes that need to be deleted are put into the setD.

Unpaired desired output spike times are put into the set J, i.e. the set of spikes that have to be inserted.

To clarify, S contains pairs of “paired” actual and desired spike times, D contains the times of all unpaired actual spikes, andJ the times of unpaired desired spike times. With the PSP sum as above, the E-Learning rule is then

∆w_i =γ

⎡

⎣

∑

tins∈J

λ_i(t^ins)− ∑

t^del∈D

λ_i(t^del) +γ_r τ_q²

∑

(t^act,t^des)∈S

(t^act−t^des)λ_i(t^act)

⎤

⎦ . (3.37) γ is the learning rate, and γr is a factor to scale spike shifting relative to deletion and insertion.

The former two terms of the rule correspond to ReSuMe, except the kernel is not a simple exponential decay. The advantage of E-Learning is that the weight changes for spikes close to their desired location are scaled with the distance, which improves convergence and consequentially memory capacity.

3.4.2.3.3 FP-Learning.

FP-Learning [MR ¨OS14] was devised to remedy a central problem in learning rules like ReSuMe and others. Any erroneous or missing spike “distorts” the time course of the membrane potential behind it compared to the desired final state. This creates a wrong environment for the learning rule, and weight changes can potentially be wrong. Therefore, the FP-Learning algorithm stops the learning trial as soon as it encounters any spike output error. Additionally, FP-Learning introduces a margin of tolerable error for the desired output spikes. An actual output spike should be generated in the window of tolerance [t_d−ϵ, t_d+ϵ] with the adjustable margin ϵ. Weights are changed on two occasions:

1. If a spike occurs outside the window of tolerance for anyt_dat timet_err, then weights are depressed by ∆wi ∝ −λ_i(terr). This also applies if the spike in question is the second one within a given tolerance window.

2. Ift=td+εand no spike has occured in the window of tolerance, thenterr=td+ε and ∆wi ∝λi(terr).

In both cases, the learning trial immediately ends, to prevent that the “distorted” mem-brane potential leads to spurious weight changes. Because of this property, this rule is also referred to as “First Error Learning”.

is higher, it will keep the change and otherwise discard it.

In this way, e.g. irregularly spiking neural networks can estimate the gradient of the reward signal and perform a stochastic gradient ascent on the expected reward [XS04]. In their model, Xie and Seung introduce stochastic spiking neurons i, which receive input from input neurons j. The strength of the connection between j and i is given by the synaptic weight w_ij. The input current into neuron ithen given by

I_i(t) =∑

w_ijh_ij(t) (3.38)

whereh_ij evolves according to τ_sdh_ij

dt +h_ij =∑

δ(t−T_j^a)ζ_ij^a (3.39) where ζ_ij^a is a binary random variable modelling the stochastic nature of synaptic trans-mission andT_j^a is the time of the ath spike in input neuron j. The input current I_i(t) is converted to an instantaneous firing rateλi(t) by

λ_i(t) =f_i(I_i(t)) (3.40)

wheref_i is the current-discharge relationship.

The learning rule is then given by

∆w_ij =ηRe_ij (3.41)

whereR is a reward signal,η is a learning rate ande_ij is an eligibility trace given by eij =

∫ T 0

dtΦi(Ii) [si(t)−fi(Ii)]hij (3.42) whereT is the length of the learning episode and si(t) =∑

aδ(t−T_i^a) is the spike train in neuron i. Φi(Ii) is a function that scales the weight changes depending on the current firing rate. Learning works, because if the actual activation of the neuron is larger than the instantanous firing rate and this is rewarded (R > 0), then weights from positively contributing input neurons are increased to increase the instantaneous firing rate upon repetition. Similarly, if the actual output of the neuron is above the instantaneous firing rate and this is punished (R < 0), weights are changed to decrease the firing rate. The same mechanism also applies for an actual activation below the firing rate, such that the firing rate is changed in the desired direction. By this learning mechanism, a gradient ascent on the reward is performed.

For constant reward and an output spike train that is enforced by a teacher, this is very similar to theδ-rule.

In a study by Fiete et al., reinforcement learning has been shown to be applicable to more realistic neuron models, where the exploration is done by a perturbation of the conductance of the neuron [FS06].

3.4.3.1 Learning in Recurrent Networks

So far, only feed-forward classifiers on more or less complex patterns were discussed. In this section, I want to discuss learning in recurrent networks. This setup if of particular interest, because in biological neuronal networks, at least some degree of feedback input can be expected.

3.4.3.1.1 Hopfield networks

Hopfield networks [Hop82, Hop07] are arguably the most simple setup for a recurrent network: A population of N rate neurons is interconnected by weights wij, which define the recurrent input. The activation of each neuron i is binary with S_i ∈ {−1,1}, where Si = 1 is an activated state and Si = −1 is a silent state. In this setup, the input into each neuron is given by

hi=∑

wijSj (3.43)

The activation of the neuron will then be defined by an activation function, which is chosen to be the sign function, such that

Si := sgn

⎛

⎝

∑

wijSj

⎞

⎠ (3.44)

Activations of neurons can either all be updated the same time in a synchronous update or one after the other in an asynchronous update.

Imprinting P patterns ξ^µ onto this network can generate stable patterns, that are self-consistent in that they generate input into each neuron that is compatible with its own state. Patterns can even be completed, if a noisy version of the pattern is presented to the network, because an attractor around the original pattern is formed.

In the Hopfield model, a generalized Hebb rule is employed for learning:

wij = 1 N

∑

µ=1

ξ_i^µξ^µ_j (3.45)

With this learning rule, a number of patterns can be stored in such a neuronal network, such that these patterns form attractors. This is a very simple form of content-addressable memory and associative learning.

3.4.3.1.2 Temporal sequences of patterns

So far, stationary patterns were discussed. However, in biological neuronal networks, sequences of activation patterns are essential. In modelling studies, these activation se-quences are usually considered in closed loop situations, where the last activation pattern in the sequence restarts the sequence (limit cycles). This process can be seen as a stable trajectory in a phase space, where the vector of the states of the neurons is the state of the system. This state vector then follows a closed trajectory.

Recurrent neuronal networks, which are able to create continuous, self-sustained patterns of activity, are highly sensitive to noise. Hence, any learning algorithm that attempts to teach elongated activity sequences to recurrent networks needs to take this sensitivity into

account.

Laje and Buonomano introduce a model that produces robust patterns in recurrent neural networks in [LB13]. Their model is based on firing-rate units and generates locally stable trajectories in the phase space.

They use innate trajectories through the phase space as a starting point, that is, sequences of activation that the network generates due to the initialization. During a training period, they present the network with noisy input and apply a learning rule similar to the delta rule to achieve an output equivalent to the original trajectory. Thus, they generate areas of attraction around the desired trajectory, from which the network evolves back onto the trajectory. Thus, noise during learning is beneficial for the stability of learned trajectories.

To derive meaningful output from this recurrent network, they train an output neuron to respond to the learned pattern in a specific way.

Brea et al. [BSP13] introduce another model based on spike-response model neurons with stochastic spiking. They use a set of visible neurons that are part of the training pattern and a set of hidden neurons that can spike at liberty. Hidden neurons enable networks to solve more complex problems, that are not solvable with visible neurons only. In their study, they introduce a membrane potential

u_i(t) =u₀+

∑

w_ijx^ϵ_j(t) +x^κ_i(t) (3.46) wherew_ij is the synaptic strength from neuronjto neuroni,x^α_k(t) =∑∞

s=1α(s)x_k(t−s) represents the convolution of spike train xk with kernel α and u0 is the resting poten-tial. The postsynaptic kernel is given byϵ(s) = _τ ¹

1−τ₂(exp (−s/τ₁)−exp (−s/τ₂)) and the adaptation kernel byκ(s) =cexp(−s/τ_r) for s≥0. Both kernels are zero fors <0.

The spiking process is modelled as a stochastic process based on the deterministic mem-brane potential. The spiking probability of neuroniin time bin tis given by

P(xi(t) = 1|u_i(t)) =ρ(ui(t)) (3.47) withρ(u) = _{1+exp(−βu)}¹ .

In this setup, they develop a learning rule from the goal that the distribution of output spike trains in the visible neuronsPw(v) is as similar as possible to the target distribution P^∗(v). To that end they calculate a learning rule that performs a gradient descent on an upper bound on the Kullback-Leibler divergence given by

D(P^∗(v)||P_w(v)) =⟨logP^∗(v)

P_w(v)⟩_P∗(v) (3.48)

The derived learning rule is given by

∆w^batch_ij =η

∑

i=1

gi(t) (xi(t)−ρi(t))x^ϵ_j(t)

{1 ifivisible

logRw(v|h)−¯r ifihidden (3.49) where η is the learning rate. gi(t) = ^ρ

′ i(t)

ρi(t)(1−ρ_i(t)) with ρ^′_i(t) = ^dρ(t)_du |_u=u_i_(t), which implies gi(t) =βforρ(U) as defined above. Rw(v|h) is the probability of a visible activity pattern, given the past hidden pattern and ¯ris a constant. During the learning process, spike trains for the visible neurons are sampled from the target distribution v P^∗(v) and imposed on

the visible neurons. Hidden neurons follow the dynamics of the network. The main part of the learning rule is essentially equivalent to the delta learning rule: The difference between the actual and the real activation of each target neuron is correlated with the presynaptic activation.

With this learning rule, a network of spiking neurons can learn to approximate a desired output spike pattern distribution.

Im Dokument Neuronal Models of Motor Sequence Learning in the Songbird (Seite 37-41)