• Keine Ergebnisse gefunden

Multi-modal Language Grounding

6.3 Embodied Language Understanding with Unified MTRNN ModelsUnified MTRNN Models

6.3.2 Evaluation and Analysis

The central objective for the comparison of the unified models with the previ-ous embMTRNN models is the generalisation capability. Additionally, the self-organisation forcing mechanism needs to be explored for both, its impact on the overall performance of the model and the developed internal representation (the self-organised abstracted context).

To further test the models, the data collection of the previous study was expanded for every scene and every example by a stream of visual input. With this data the experimental conditions of the previous study were replicated: dividing the samples into a training set and a test set (50:50, each scene is only included in one of the sets), training ten randomly initialiseduniMTRNNas well asso-uniMTRNN systems, and repeating this process for a 10-fold cross-validation (thus performing 100 runs for each model, experiment, or meta-parameter variation respectively).

The parameter settings (meta-parameter) for the additional MTRNNv parts of the refined models are listed in table 6.1, while the parameters for the MTRNNa and the training approach are kept with regard to the previous study (compare table 5.2).

Training was done for a maximum number of θ = 50,000 epochs or reaching a minimal average Mean Squared Error (MSE) v,Csc = 1.0×10−4 on the Cscv units. Since the visual representation has not changed, the number of neurons in the input layer |Iv,IO| is identical to |IEC| from the previous study. The number of Cscv units is depending on the number of Csca units in the uniMTRNN model.

For the so-uniMTRNN model the same number is kept for the sake of a fair comparison. The timescales for the MTRNNa are based on the resulting values for theembMTRNNmodel (τa,IO = 2,τa,Cf = 5, and τa,Cs = 70). For the MTRNNv the timescales are not crucial in the case of a scenario without movements (the change of visual perception over time is not assumed to be a composition of primitives).

Table 6.1: Standard parameter settings for evaluation of the unified MTRNN models.

Parameter * Description Domain Baseline Value

|Iv,IO| Number of IO neurons |Fsha|+|Fcol|+|Fpos| 21

|Iv,Cf| Number of Cf neurons N>0 40

|Iv,Cs| Number of Cs neurons N>0 23

|Iv,Csc| Number of Csc units N[1,...|Iv,Cs|] 12

W0v Initial weights range R[−1.0,1.0] ±0.025 CTv,T Init. final Csc values range R[−1.0,1.0] ±1.00

τv,IO Timescale of IO neurons N>0 2

τv,Cf Timescale of Cf neurons Nv,IO 5 τv,Cs Timescale of Cs neurons Nv,Cf 16

* Parameters for the MTRNNa and the training are identical as in table 5.2.

Nevertheless, based on the previous study a parameter search was conducted (not shown) and confirmed a setting ofτv,IO= 2,τv,Cf = 5, andτv,Cs= 16 for a progressive abstraction.

All mechanisms and meta-parameters for training are kept from the previous study. The sole difference is the initialisation of the internal state of the Cscv units in the so-uniMTRNN model. Instead of starting the training with very small values, the values are initialised in [−1.0,1.0]. Initialising with slightly smaller or larger value ranges of random values or with random values that subsequently have been normalised (with respect to the Cs layer) has been tested as well, but does not show a notable change in the properties of the model. A parameter search for good dimensions (Context-fast (Cf) layer) in addition to good timescales (as discussed above) has been conducted prior to the actual experiments, but is omitted here for brevity. Compared to the sequences of phonemes, the sequences of visual perception are undemanding, and thus these parameters are less crucial.

Generalisation with Dynamic Vision

To test if the refined models provide a similar performance, both models are compared with the results from the previous study on the mixed F1-score as well as the mixed edit distance. For this overall comparison, it is provided that the appropriate meta-parameters for the architectures and the training were previously determined. Most importantly, this includes a study on the self-organisation forcing parameter, which will be reported in detail later within this section.

The performance of each model (using the aforementioned standard parameters) is presented in table 6.2 and figure 6.5a–b. Additionally, for the refined models only, the training effort regarding the visual MTRNNv is given in figure 6.5c. We can obtain from the results that all models are able to generalise on comparable levels.

Table 6.2:Comparison of F1-score and mean edit distance for different MTRNN models.

qF1-score qedit-dist

Model * 1 2 3 1 2 3

training set best 1.000 1.000 1.000 0.000 0.000 0.000

test set best 0.476 0.476 0.476 0.553 0.545 0.540

training set best average ** 1.000 1.000 1.000 – – –

test set best average ** 0.337 0.320 0.337 – – –

training set average 0.999 0.996 0.996 0.001 0.002 0.002 test set average 0.171 0.173 0.172 0.676 0.643 0.640

mixed *** 0.626 0.620 0.627 0.338 0.322 0.321

* Models: embMTRNN (1), uniMTRNN(2), so-uniMTRNN (3).

** Averaged over all best networks of all data set distributions.

*** For definition compare equations 5.8 and 5.11.

1 2 3 0.0

0.1 0.2 0.3 0.4 0.5

qF1-score test set

Model (a) F1-score on test set.

1 2 3

0.0 0.2 0.4 0.6 0.8 1.0

qedit-dist test set

Model (b) Edit distance test set.

1 2 3

0 10 20 30 40

Number of epochs u (in 1,000)

Model (c) Training effort.

Figure 6.5: Comparison of MTRNN model variants in performing on the embodied language understanding scenario: embMTRNN(1), uniMTRNN(2), so-uniMTRNN (3). In (a) the dark/blue bars represent the mean F1-score, while the bright/red bars show the F1-score of the best network for the respective model (larger is better). In (b) the bright/cyan bars show the average mean edit distance with error bars for the standard error of means, while the dark/violet bars provide the mean edit distance of the best network for the respective model (smaller is better, worst possible: 2.0). In (c) the training effort is measured for the MTRNNv only.

The refined models show overall a slightly better performance, with theuniMTRNN performing marginally better on sentence level (average mixed F1 score of 0.173) and the so-uniMTRNN slightly better on phoneme level (average mixed edit distance of 0.321). It is remarkable that the number of errors made on phoneme level is significantly smaller (pt-test<0.001) for both refined models compared to the embMTRNN model. In contrast between the uniMTRNN andso-uniMTRNN models the error made could not be found statistically different (never determined pt-test < 0.01). The training effort for both refined models was not found to be crucially different: in the parameter setting for the self-organisation forcing parameter with the best performance the training effort was notable but not significantly smaller.

During inspecting the weights of a trained MTRNNv (for either of the refined models) it was observed that the weights from the Cf to the IO layer as well as from the Cs to the Cf layer converged to smaller but nonzero values, compared with weights in the opposite direction. Since the objective during training is to minimise the error on the Csc units, it is logical that a structure similar to the feed-forward layers of the embMTRNN model would emerge. Nevertheless, it seems that the existence of (small) recurrent connections might facilitate the processing of related features in the input.

Self-organised Abstracted Visual Context

To analyse how the self-organisation forcing parameter affects the internal rep-resentation and the generalisation capability of the so-uniMTRNN model, the parameter was varied on identical instances of the randomly initialised MTRNNv. The central hypothesis is that the self-organisation forcing mechanism can lead to a better distribution of the context patterns in the Csc space. This might eliminate the necessity of a priori given set of patterns or even may yield a overall higher performance.

For the self-organisation forcing parameter ψv was varied over the values as listed in table 6.3 yielding the results as shown in figure 6.6a–d. Identical to the experiments before, the F1-score and the edit distances were computed and the training effort measured for testing the wholeso-uniMTRNNmodel. Furthermore, the internal states of the final Csc values of the MTRNNv, which were collected from activating the MTRNNv with the training sequences (without additional updates of the network). In this way the abstracted context patterns were obtained, for which the network was trained, and could be studied by applying the previously suggested metrics for the average and relative L2 distances. Additionally, the|ICsc|-dimensional context patterns have been reduced to the first two principle components using the Principle Component Analysis (PCA) to allow for a visual inspection of these patterns11.

The results show that the performance is only marginally changing for a range of ψv values. Forψv = 0.0005 both, the mixed F1-score and the edit distance reach the best levels, but the difference is not significant (pt-test>0.01). However, the relative distances for the Csc patterns increase around this ψv value, before they degrade for larger ψv. The visualisation of the Csc patterns of a representative network in figure 6.6e–g shows that they were self-organised to distribute themselves better in the Csc space, although their absolute magnitudes decreased. This effect was observed across most runs, notably strong in well-performing networks.

At some point the, training effort is dropping and also general performance is degrading rapidly. On the one hand, the developed target internal states of the final Csc units cv,T tend to approach zero more quickly with a large ψv. On the other hand, the weights of networks were initialised at random but with rather small values, thus would result in a small summed internal state of the neurons z due the gradient descent strategy. As a consequence, the training reached a very small error more quickly and terminates before the weights were actually sufficiently trained.

Table 6.3: Parameter variation of self-organisation forcing in visual perception.

Parameter Values

Self-organisation forcing ψv {1,2,5} ·10−k, k ∈ {2,3,4}

11For the parameterψv= 0.00005 and the shown example, the first two components explain 68.62% of the variance in the patterns.

0.5 0.54 0.58 0.62 0.66

0.0 0.000001 0.000002 0.000005 0.00001 0.00002 0.00005 0.0001 0.0002 0.0005 0.001

qF1-score,mixed

Self-organisation forcingψv

(a) Mean mixed F1-score.

16 20 24 28 32

0.0 0.000001 0.000002 0.000005 0.00001 0.00002 0.00005 0.0001 0.0002 0.0005 0.001

Numberofepochsu(in1,000)

Self-organisation forcingψv

(d) Training effort.

0.31 0.33 0.35 0.37 0.39

0.0 0.000001 0.000002 0.000005 0.00001 0.00002 0.00005 0.0001 0.0002 0.0005 0.001

qedit-dist,mixed

Self-organisation forcing ψv (b) Mean mixed edit distance.

0.8 1.0 1.2 1.4 1.6

0.9 0.92 0.93 0.94 0.95

0.0 0.000001 0.000002 0.000005 0.00001 0.00002 0.00005 0.0001 0.0002 0.0005 0.001

MeanqL2-dist,avg(◦) MeanqL2-dist,rel()

Self-organisation forcing ψv (c) Mean and relative L2 distances.

−1 0 1

PC1

−1 0 1

PC2

(e) ψv= 0.0.

−1 0 1

PC1

−1 0 1

PC2

(f) ψv = 0.00005.

−1 0 1

PC1

−1 0 1

PC2

(g)ψv= 0.0001.

Figure 6.6: Effect of the self-organisation forcing mechanism on the development of concept patterns in the so-uniMTRNN model: mean mixed F1-score (a) and edit distance (c), training effort (b), and mean of average and relative pattern distances (d) with intervals of the standard error, each over 100 runs and over varied ψv respectively;

representatively developed Csc patterns (e–g) reduced from |ICsc| to two dimensions (PC1 and PC2) and normalised for selected parameter settings of no, “good”, and large self-organisation forcing respectively. Different shapes and colours are shown with different coloured markers (black depicts ‘position’ utterance).

Uncertainty in Visual Perception

While inspecting the recorded data for the dynamic visual perception, it was found that the represented features were nearly identical in each frame. Apparently, the combination of the developed method for visual object perception (compare chapter 3.3) and the (visual) low-noise conditions in the environment for the data recording led to a particularly coherent features representation of the visual shape, colour, and position characteristics. To study how the semantic context abstraction changes under altering morphology or general perturbation of the sensory input, the training of the uniMTRNNmodel was also performed with adding noise on that input. As the model for noise the Gaußian jitter12 was used. The parameter variation for increasing noise σv is provided in table 6.4.

From the results, as presented in figure 6.7a–b, we can obtain that smaller degrees of noise slightly facilitate the performance. For a noise level up to σv = 0.01 the average mixed F1-score reaches 0.631, while the average mixed edit distance decreases to 0.321. Beyond this noise level the performance drops rapidly. In fig-ure 6.7c a visualisation of the noisy input perception is presented, which shows the morphology of shape and colour (by omitting position features). This inspection shows that for noise levels larger than σv= 0.01 the shape representations – exem-plary shown on the most distinct shapes, the banana and the apple – increasingly get more difficult to differentiate. Similarly, the colour feature changes drastically over the course of the sequence of visual perception, thus leading to considerable confusion. For scenes with dice (square) or phone (rectangular) shaped as well as for green coloured objects a differentiation is particularly difficult. As a result, the MTRNNv learns to abstract similar characteristics that regress to the mean of the respective feature values. With respect to the training effort (not plotted), the variation of noise was in line with related work [247, 306]: Small degrees of noise speed up the training slightly, while larger degree of noise first degrades the performance and then also the convergence time.

Overall, the results show that the model is quite robust with respect to perception perturbation as long as noise is not leading to an overlap of the visual feature patterns. To successfully associate with an internal representation, which was formed for the language production, it seems sufficient to differentiate the entities on the available dimensions. Increasing the dimensionality of features thus could allow for a good scaling-up of different perceived scenes, despite reasonably small perturbation by noise.

Table 6.4: Parameter variation of noise in visual perception.

Perturbation model Parameter Values

Gaußian jitter variance σv {1,2,5} ·10−k, k∈ {1,2,3,4}

12Defined in chapter 5.4.5; also compare chapter 4.3.5.

Variance σv 0.5

0.54 0.58 0.62 0.66

qF1-score,mixed 0.0 0.0002 0.0005 0.001 0.002 0.005 0.01 0.02 0.05 0.1 0.2

(a) Mean mixed F1-score.

Variance σv 0.31

0.33 0.35 0.37 0.39

qedit-dist,mixed 0.0 0.0002 0.0005 0.001 0.002 0.005 0.01 0.02 0.05 0.1 0.2

(b) Mean mixed edit distance.

tv= 4 tv = 8 tv = 12

σv= 0.2 σv= 0.1 σv= 0.05 σv= 0.02 σv= 0.01 σv= 0.0

(c) Visualisation of perception with added noise.

Figure 6.7: Influence of perturbing visual input sequences by Gaußian jitter on training and generalisation: comparison of mixed F1-score and edit distance of varied variance parameter σv (a–b) with error bars reflecting the standard error, each over 100 runs respectively; comparison of visualised input perception with added noise (c) for arbit-rarily but representatively chosen time steps (4, 8, and 12) and scenes (red banana and yellow apple, omitting position).