Unifying the MTRNN Model - Multi-modal Language Grounding

Multi-modal Language Grounding

6.2 Unifying the MTRNN Model

into a simulated iCub, can adapt well to different motor constellations and can generalise to new permutations of actions and objects. However, it is not clear how we can transfer the model to language acquisition in humans, since a number of assumptions have been made. The action, shape and colour descriptions (in binary form) are already present in the input of the motor and vision networks. Thus this information is inherently included in the filtered representations that are fed into the network for the linguistic description. Moreover, the linguistic network was designed as a fixed-point classifier that outputs two active neurons per input: one

‘word’ for an object and one for an action. Accordingly, the output is assuming a word representation and omits the sequential order.

In a framework for multi-modal integration, Noda et al. suggested [202] to integrate visual and sensorimotor features in a deep auto-encoder. The employed time delay neural network can capture features on varying timespan by time-shifts and hence can abstract higher level features to some degree. In their study, both modalities of features stem from the perceptions of interactions with some toys and form reasonable complex representations in sequence of 30 frames. Although language grounding was not pursued, the shared multi-modal representation in the central layer of network formed an abstraction of the perceived scenes with a certain internal structuring and provided certain noise-robustness.

6.2.1 MTRNN with Context Abstraction

To accomplish such an MTRNN architecture, we can reverse the concept of the context bias (compare chapter 4.4) and thus reverse the processing from the context to the Input-Output (IO) layer⁴. The concept of such an MTRNN with context abstraction is visualised in figure 6.1. For certain sequential input, provided as a dynamic pattern to the fastest neurons (with the lowest timescale)I_IO, the network is accumulating a commonconcept in the slowest neurons (with the highest timescale) I_Csc ∈ I_Cs. Since the timescale characteristic yields a slow adaptation of those Context-controlling (Csc) units, the information in these units will accumulate aspects pattern from the input sequence (filtered by potential neurons in an intermediate). The accumulation is characterised by a logarithmic skew to the near past and a reach-out to the long past depending on the timescale values τ_Cs (and τ_Cf).

...

Pattern

t+1 Sequence Input-Output

Layer Context-fast

Layer Context-slow

Layer

Context-controlling Units Context

Increasing timescale τ ...

Concept

t=T

...

Figure 6.1: The Multiple Timescale Recurrent Neural Network (MTRNN) with context abstraction architecture providing exemplary three horizontally parallel layers: Context-slow (Cs),Context-fast (Cf), and Input-Output (IO), with increasing timescaleτ, where the Cs layer includes some Context-controlling (Csc) units. While the IO layer processes dynamic patterns over time, the Csc units atlast time step (t=T) abstract the context of the sequence.

6.2.2 From Supervised Learning to Self-organisation

The MTRNN with context abstraction can be trained in supervised fashion to capture a certain concept from the temporal dynamic pattern. This is directly comparable to fixed-point classification with ERNNs or CTRNNs: With a gradient descent method we can determine the error between a desired concept and the activity in the Csc units, and propagate the error backwards through time. However, for an architecture that is supposed to model the processing of a certain cognitive function in the brain, we are also interested in removing the necessity of providing a concept a priori. Instead, the representation of the concept should self-organise based on the regularities latent in the stimuli.

4Note, in chapter 5.4.2 we observed that the MTRNN self-organised during training towards mostly feed-forward connectivity from Context-slow (Cs) to IO.

For the MTRNN with parametric bias, this was realised in terms of modifying the Csc units’ activity in the first time step (t = 0) backwards by the partial derivatives for the weights connecting from those units. To achieve a similar self-organisation, a semi-supervised mechanism, which allows modifying the desired concept to foster self-organisation, is developed in this thesis. Since we aim at an abstraction from the perception to the overall concept, the Least Mean Square (LMS) error function⁵ is modified for the internal state z at time stept of neuronsi∈IAll=IIO∪ICf∪ICs, introducing a self-organisation forcing constant ψ as follows:

∂h_error

∂z_t,i =







(1−ψ) (y_t,i−f(c_{T ,i}+b_i))f_sig⁰ (z_t,i) iff i∈I_Csc∧t =T

k∈I_All

wk,i

τ_k

∂herror

∂z_t+1,kf⁰(zt,i) +

1− 1 τ_i

∂E

∂z_t+1,i otherwise , (6.1)

wheref andf⁰denote an arbitrary sigmoidal function and its derivative respectively, b and w are the biases and weights, y denotes the neurons’ output, and c_{T ,i} are internal states at the final⁶ time step T of the Csc units i∈I_Csc ⊂I_Cs.

The particularly small self-organisation forcing constant allows the final internal states c_{T ,i} of the Csc units to adapt upon the data, although they actually serve as a target for shaping the weights of the network. Accordingly, the final internal states c_{T ,i} of the Csc units define the abstraction of the input data and are also updated as follows:

c_{u,T ,i} =c_{u−1,T ,i}−ψζ_i ^∂h_error

∂c_{T ,i} =c_{u−1,T ,i}−ψζ_i1 τ_i

∂h_error

∂z_{T ,i} iff i∈I_Csc , (6.2) where ζ_i denotes the learning rates for the changes.

Similarly to the parametric bias units, the final internal states c_T,i of the Csc units self-organise during training in conjunction with the weights (and biases) towards the highest entropy. We can observe that the self-organisation forcing constant and the learning rate are dependent, since changing ζ would also shift the self-organisation – for arbitrary but fixed ψ. However, this is a useful mechanism to self-organise towards concepts that are most appropriate with respect to the structure of the data.

6.2.3 Evaluating the Abstracted Context

To test in a preliminary experiment how the abstracted concepts form for different sequences, the architecture was trained for thecosine task⁷. Similar to the prelim-inary experiment, reported in chapter 4.5.1, the network is supposed to learn four sequences and is set up with |I_IO|= 2, τ_IO = 1, |I_Cf|= 8, τ_Cf = 8, |I_Cs|= |I_Csc|= 2, and τ_Cs = 32. Processing a sequence by the MTRNN with context abstraction will result in a specific pattern of the final Csc units’ activity as the abstracted

5Any other error function can be modified analogously.

6In this thesis, we use the termfinal to indicate the last time step of a sequence, which is in line with using the terminitial to indicate the first time step (compare chapter 4.4).

7For details compare chapter 4.5.1 and appendix D.5.

context. For determining how those patterns self-organise, the architecture was trained with predefined patterns (chosen randomly: ∀i∈ I_Csc, c_{T ,i} ∈R^[−1.0,1.0]) as well as with randomly initialised patterns that adapt during training by means of the varied self-organisation forcing parameter ψ. To measure the result of the self-organisation, two distance measures q_L²_-dist,avg, and q_L²_-dist,rel are used:

q_L²_-dist(c_k, cl) =

s X

i∈I_Csc

(c_k,i−cl,i)² , (6.3)

q_L²_-dist,avg = 1

(|S| −1)·(|S|/2)

|S|−1

k=1

|S|

l=k+1

q_L²_-dist(c_k, cl) , (6.4) q_L²_-dist,rel =

|S|−1

k=1

|S|

l=k+1

q_L²_-dist(c_k, c_l) q_L²_-dist,avg

!(|S|−1)·(|S|/2)¹

, (6.5)

where |S| describes the number of sequences and c_k =c_k,T,i denotes the final Csc units. With q_L²_-dist,avg, which uses the standard Lebesgue L² orEuclidean distance, we can estimate the average distance of all patterns, while with q_L²_-dist,rel we can describe the relative difference of distances. For example, in case the distances between all patterns are exactly the same, this measure would yield the best possible result⁸ of q_L²_-dist,rel = 1.0. Comparing both measures for varied settings of ψ provides an insight on how well the internal representation is distributed upon self-organisation.

The results for the experiment are presented in figure 6.2. From the plots we can obtain that patterns of the abstracted context show a fair distribution for no self-organisation (the random initialisation) up to especially small values of about ψ = 0.00001, a good distribution for values around ψ = 0.00005 and a degrading distribution for larger ψ. The scatter plots for arbitrary but representative runs in figure 6.2c–f visualise the resulting patterns for no (ψ = 0.0), too small (ψ = 0.0001), good (ψ = 0.00005), and too large self-organisation forcing (ψ = 0.0002). From inspecting the Csc units, we can learn that a “good” value for ψ leads to a marginal self-organisation towards an ideal distribution of the concepts over the Csc space during the training of the weights. Furthermore, a larger ψ is driving a stronger adaptation of the Csc patterns than of the weights, thus leading to a convergence to similar patterns for all sequences.

Concededly, the task in this preliminary experiment is quite simple, thus a random initialisation within a feasible range of values ([−1.0,1.0]) of the Csc units often provides already a fair representation of the context and allows for convergence to very small error values. However, for larger numbers of sequences, which potentially share some primitives, the random distribution of respective concept abstraction values is unlikely to provide a good distribution, thus self-organisation forcing mechanism can drive the learning.

8Given the dimensionality of the Csc units is ideal with respect to the number of sequences.

For example, when representing four sequences with two Csc units, we can find an optimal qL²-dist,rel= 0.9863.

0 125 250 375 500

0.0 0.000001 0.000002 0.000005 0.00001 0.00002 0.00005 0.0001 0.0002 0.0005 0.001

Numberofepochsu(in1,000)

Self-organisation forcing ψ (a) Training effort.

0.0 0.4 0.8 1.2 1.6

0.84 0.86 0.88 0.9 0.92

0.0 0.000001 0.000002 0.000005 0.00001 0.00002 0.00005 0.0001 0.0002 0.0005 0.001

MeanqL2-dist,avg(◦) MeanqL2-dist,rel()

Self-organisation forcing ψ (b) Mean and relative L² distances.

−1 0 1

cT ,1

cT,2

−1 0 1

cT ,1

cT,2

(d) ψ= 0.00001.

−1 0 1

cT ,1

cT,2

(e)ψ= 0.00005.

−1 0 1

cT ,1

cT,2

(f) ψ= 0.0002.

Figure 6.2: Effect of the self-organisation forcing mechanism on the development of distinct concept patterns for different sequences in the cosine task: Training effort (a) and mean q_L²_-dist,avg and q_L²_-dist,rel with standard error bars over varied ψ(b), each over 100 runs; representative developed Csc patterns (c–f) for the sequences aa (star), ab (cross),ba(plus), andbb(hexagram) for selected parameter settings of no, small, “good”, and large self-organisation forcing respectively.

6.3 Embodied Language Understanding with

Im Dokument Natural language acquisition in recurrent neural architectures (Seite 135-139)