• Keine Ergebnisse gefunden

Previous Studies on Grounding in Dynamic PerceptionPerception

Multi-modal Language Grounding

6.1 Previous Studies on Grounding in Dynamic PerceptionPerception

With the insight from the previous model of anMultiple Timescale Recurrent Neural Network (MTRNN), extended for embodied perception (theembMTRNNmodel1), we are able to describe language acquisition in a small and static environment. We learned that the recurrent connections can self-organise for the task of producing speech and that the timescales in information processing seem crucial for language.

By refining the architecture for processing dynamic visual perception, auditory perception (comprehension), and multi-modal perception, we can take a more rich and realistic environment and interaction into account. For achieving this endeavour, we will adopt additional principles as discussed in chapter 2.1 and insight from previous studies in the respective direction.

1Compare chapter 5.

6.1.1 Integrating Dynamic Vision

Models for grounding in dynamic visions are supposed to capture the alteration of e.g. perceived objects in terms of morphology by changing external conditions up to motion by self-induced manipulation. Due to the large complexity, models were often based on a certain decoupled preprocessing or simplification of the visual stream to achieve a feasible level of coherence in the visually perceived features.

For example, Yu developed a model that coupled lexical acquisition with object categorisation [305]. The model learns from visual data that is simplified and clustered towards colour, shape, and texture features and from spoken descriptions in terms of single or a small number of words to form word-meaning associations.

In particular, visual and auditory data was recorded from subjects reading from a picture book, while looking at its pages using a head-mounted camera. The learning processes of visual categorisation and lexical acquisition was modelled in a close loop and led to the emergence of the most important associations, but also to the development of links between words and categories and thus to linking similar fillers for a role. This development occurred over several iterations in which probabilities for a co-occurrence were adapted and thus bootstrapped a shared representation. Despite the aim for explaining early learning, the words were given in whole and therefore it was not tested how combinations of sounds (phonemes) could be composed to cover a visual category. The perception in the visual stream stemmed from unchanging shapes in front of a plain background and was preprocessed towards visual features that reflect little morphology over time.

Monner and Reggia modelled the grounding of language in visual object prop-erties [194]. Their model is designed for a micro-language that stems from a small context-sensitive grammar and includes two input streams for scene and auditory information and an input-output stream for prompts and responses for the input information. The scene input is based on a stream of synthetic object properties in a localist representation, discriminating size, colour, shape and spatial relation. For the auditory input, a stream of phonemes is fed in via a distributed representation.

For the prompts and responses, the object properties, some relation predicates, and one out of four labels are defined. The predicates and labels get presented to the network during training in a supervised manner, or are partially present (prompts) and need correctly get produced (responses) while testing. In between the input and input-output layer, several layers of LSTM blocks are employed that are able to find statistical regularities in the data. This includes the overall meaning of a particular scene in terms of finding the latent symbol system that is inherent in the used grammar and dictionary. Yet, the fed in object properties are – in principle – present as given prompts for the desired output responses. Therefore, it could also be the case that the emerging symbols in the internal memory layers are determined or shaped by the prompt and response data and are perhaps less latent. The resulting problem is still complex in terms of combinatorial power, but it is not clear how we can relate the emergence of pre-defined or latent symbols to the problem of grounding natural language in the dynamic sensory information to eventually understand how noisy perceived information contributes.

In sum, the studies show that dynamic vision can be integrated as embodied sensation, if the dynamics of the perception can be reasonably abstracted. For the model, however, it is crucial to control the complexity in perception to attempt explaining the emerging internal representation.

6.1.2 Speech Comprehension and Speech Production

Models for grounding in auditory perception often describe production and com-prehension as a close loop of speech signal from external and ego origin. These models mostly focus on a certain phase of linguistic comprehension and production competence2 to reduce complexity.

Plaut and Kello suggested a model for phonological development from auditory comprehension and articulatory production [218]. In an Elman Recurrent Neural Network (ERNN)-based framework, streams of sound inputs are linked over a recur-rent hidden layer to a recurrecur-rent phonology layer and from the phonology layer via a hidden layer to an articulation layer. Phonetic sounds in and out the framework are represented particularly precise. The acoustic perception is based on percep-tual capabilities of infants and includes formant frequencies, frication, bursts and loudness as well as the visually perceived jaw openness of the speaker. Articulatory production is defined by oral and facial muscle movements on constriction, tongue height and backness, and voicing. With monosyllabic nouns the framework can trained to comprehend sounds and produce the same sounds in a closed loop. An important insight from the model is support for the hypothesis that comprehension is a basis to form phonological representation, which is exploited by production, although sharing representations for acoustic perception and articulatory motor codes might occur more complex in the human brain (compare chapter 2.1.2). With a comparable model of reduced complexity but embedded in a social interaction scheme of communicating agents, Oudeyer showed that a certain speech code of sound can develop, which is comparable to human languages [205]. However, since the studies were limited to monosyllabic words (morphemes) the formation of a semantic concept from sequences of morphemes are not covered.

To cover an abstraction on concept level Rohde proposed a model for language comprehension and prediction based on a similar ERNN-based framework [236]. The semantic part of the model was trained to abstract the meaning or “the message”

of a sentence from a set of linguistic propositions, while the comprehension part of the network learned to extract this meaning from a sequence of words, which includes the distribution of the propositions. The network can also be used in the opposite direction, in a way that it can predict the first word for a given meaning and then predict the next words based on the feedback of the previous word and its meaning. The underlying claim of the model is that humans may learn to produce language based on the previously learned capability to formulate predictions as well as the simultaneous comprehension of language. In this architecture, the Recurrent Neural Network (RNN) is used as a statistical tool that can predict a sequence

2Compare chapter 2.1.3.

based on a training with structured representation (predefined role binding) and does not attempt to capture a self-organisation of comprehension and prediction from temporal dynamic input on sound level. In a similar architecture Chang et al.

showed for single-clause phrases that a structural priming3 facilitates the gradual joint development of both, comprehension and production capabilities [46].

Overall research is sparse on neural models for integrated production and comprehension of phrases in natural language because of the inherent complexity and the unknown dynamics in the human brain (compare chapter 2.1.2). In a recent hypothesis, Pickering and Garrod presume a tight coupling of speech comprehension and production and suggest an interwoven processing of either of them by means of predictive coding [217]. Currently the degree and level of interactivity remains unknown and is openly disputed [217, open peer commentary on p. 19ff].

6.1.3 Dynamic Multi-modal Integration

Integrating multiple modalities into language acquisition is particularly difficult, because the linked processes in the brain are extraordinary complex – and in fact – in large parts not yet understood. For this reason, to the best of the author’s knowledge, there is no model available that describes the language processing integrated in multi-modal temporally perception on full spatial and temporal resolution on the cortex without making difficult assumptions or explicit limitations. However, frameworks where studied that included temporally dynamic perception that form the basis for the grounding.

Marocco et al. defined a controller for a simulated Cognitive Universal Body (iCub) robot based on RNNs. Placed in front of a desk, the iCub was used to push an object (ball, cube, or cylinder) and observe the reaction in a sensorimotor way [185]. While the cylinder was not moveable, the cube was slidable and the ball just rolled away. The iCub’s neural architecture was trained to receive a linguistic input before the robot started to push the object. In their empirical results, the authors showed that the robot was not only able to to distinguish between the objects via the correct “linguistic” tags, but even without getting a linguistic input and a correct object description, it reproduced the linguistic tag via observing the dynamics. Despite the simplicity of the perception in the study, the authors concluded that the meaning of the labels is not associated to a static representation of the object, but to its dynamical properties.

Farkašet al. modelled the grounding of words in both, object-directed actions and visual object sensations [77]. In the model, motor sequences were learned by a continuous actor-critic learning that integrated the joint positions with a linguistic input and a visually perceived position of an object. These objects were learned a priori in a Feed-Forward Network (FFN) and capture the contour and the colour of objects in the field of view. Both networks for the action sequence and the visual perception project on an Echo State Network (ESN) for learning a description of the specific action. A specific strength of the approach is that the model, embedded

3Compare chapter 2.1.2.

into a simulated iCub, can adapt well to different motor constellations and can generalise to new permutations of actions and objects. However, it is not clear how we can transfer the model to language acquisition in humans, since a number of assumptions have been made. The action, shape and colour descriptions (in binary form) are already present in the input of the motor and vision networks. Thus this information is inherently included in the filtered representations that are fed into the network for the linguistic description. Moreover, the linguistic network was designed as a fixed-point classifier that outputs two active neurons per input: one

‘word’ for an object and one for an action. Accordingly, the output is assuming a word representation and omits the sequential order.

In a framework for multi-modal integration, Noda et al. suggested [202] to integrate visual and sensorimotor features in a deep auto-encoder. The employed time delay neural network can capture features on varying timespan by time-shifts and hence can abstract higher level features to some degree. In their study, both modalities of features stem from the perceptions of interactions with some toys and form reasonable complex representations in sequence of 30 frames. Although language grounding was not pursued, the shared multi-modal representation in the central layer of network formed an abstraction of the perceived scenes with a certain internal structuring and provided certain noise-robustness.