• Keine Ergebnisse gefunden

2.3 Grounded Language Understanding

2.3.1 Meaning via Experience

The importance of context for determining the meaning of a word or even longer expressions was explained in Section 2.1.1. Whilst Section 2.1.1 refers to ‘context’

as ‘context words’, here, the notion of ‘context’ will be extended to ‘context in the world of experience’. Whilst Section 2.1.1 argues for ‘the meaning of a word being determined by the context the word it appears in’, here, it is extended to ‘the meaning of a word being determined by the context of impressions it was experienced in’. This perspective originates from psycholinguistics.

Psycholinguistics. For humans, language understanding or text understanding incorporates different levels of experience and therefore involves many modalities when interpreting a text within its situational context. In psycholinguistics, or psy-chology of language, there is evidence for humans understanding scenes and also texts via mental simulations (see Barsalou, 1999; Zwaan and Madden, 2005). From the perspective of psycholinguistics, Fillmore’s frames (Fillmore, 2012, as introduced in Section 2.1.2) are regarded as an approach to ‘capture the structure of situations’

(Barsalou, 2008) in the context of amodal symbol systems (Barsalou, 1999). Re-ferring back to the explanatory example in Chapter 1, humans understand words when knowing what they refer to in their world of experiences (e.g., understanding of the entity ‘dog’ is grounded in the visual, acoustic and haptic modality) and they understand descriptions of whole situations when knowing again what they refer to in their world of experiences (e.g., understanding of the expression ‘running after a barking dog’ is grounded in multiple modalities).

‘As people comprehend a text, they construct simulations to represent its perceptual, motor, and affective content.’ (Barsalou, 2008)

2.3. Grounded Language Understanding

This states that the meaning of a word can be inferred from the situational context it appears in in the world of experiences. It motivates regarding word meaning as experienced impressions from different sensory modalities. In the field of Natural Language Processing, this perspective is implemented in multimodal approaches for embedding learning that integrate information from multiple modalities, which will be described in Section 3.5. The term ‘multimodal’ has been used in a broad range of different interpretations. In the common interpretation, modalities refer to sensory input in humans, such as audio, vision, touch, smell, and taste. Further definitions expand to different communicative channels such as language and gesture, or simply different ‘modes’ of the same modality (e.g., day and night pictures). Grounding in (human) modalities has different foci in Natural Language Processing so far;

Beinborn et al. (2018) partition the foci into concepts, phrases or whole sentences – which are reviewed in the following based on the survey by Beinborn et al. (2018).

Grounding Concepts. Modeling semantic relations between concepts is foun-dational to process language and to generalize from known concepts to new ones.

Beinborn et al. (2018) review literature about the grounding of concepts in the field of multimodal Natural Language Processing, of which the relevant parts to this thesis are summarized in the following.

The quality of concept representations, multimodal as well as unimodal, is com-monly evaluated by their ability to model semantic properties as for example re-lations between concepts. The similarity-relation is a basic but still challenging semantic property to be modeled: there are several similarity datasets to compare the performance of uni- and multimodal approaches to learning concept represen-tations, e.g.,WordSim353 (Finkelstein et al., 2002), SimLex-999 (Hill et al., 2015), MEN (Bruni et al., 2012),SemSim andVisSim (Silberer and Lapata, 2014)). These datasets contain pairs of words that have been annotated with similarity scores for the two concepts, e.g., journey and voyage are rated by humans as highly similar, whereas professor and cucumber are rated as highly different – according to Word-Sim353. The similarity is easy to judge by humans, however when only using words to describe the difference it would take longer than simply looking at the two images.

Grounding in perception motivates and requires multimodal concept represen-tations. So far, research and corpus creation has mostly focused on combining the textual and the visual modality to ground concept representations. Still, for dedi-cated tasks, perceptual information from further modalities have also been explored, e.g., the auditory (Kiela and Clark, 2015, 2017) and olfactory channel (Kiela et al., 2015).

Multimodal (textual plus visual) concept representations are found to outperform unimodal ones in modeling semantic similarity by evaluation studies of semantic models (Feng and Lapata, 2010; Silberer and Lapata, 2012; Bruni et al., 2014; Kiela et al., 2014) and by comparative studies of image sources and architectures (Kiela et al., 2016).

However, it remains an open question whether multimodal concept representa-tions contribute to an approximation of human conceptual grounding that is cogni-tively more plausible. Contradictory results by Bulat et al. (2017a) and Anderson et al. (2017) demonstrate the openness and difficulty of this question: both exper-iment with human brain activity scans of the perception of concepts and compare

different distributional models for concept similarity but on the one hand, Bulat et al. (2017a) observe visual information as beneficial for modeling concrete con-cepts, whereas on the other hand, Anderson et al. (2017) conclude that textual models sufficiently integrate visual properties.

Over a broad range of concept-related tasks, if there are multimodal studies, multimodal approaches seem to be advantageous: multimodal information was suc-cessfully integrated for the disambiguation of concepts (Xie et al., 2017) and named entities (Moon et al., 2018).

Imaginability of Abstract Concepts. Conceptual grounding of concrete words is straight-forward as they have direct reference in sensory experience (e.g.,

‘cup’ has an obvious visual correspondent). Building multimodal representations for abstract concepts is more challenging due to the lack of perceptual patterns associated with abstract words (Hill et al., 2014). In the same line, Bruni et al. (2014) and Hill and Korhonen (2014) find that multimodal representations are helpful for modeling concrete words, but have little to no impact when evaluating abstract words.

Unseen concepts can be modeled in multimodal space when projected into the representation space based on their textual relations to seen concepts. However, it is questionable whether the information about the textual relation is sufficient to infer relations between abstract concepts in multimodal space. Lazaridou et al. (2015) analyze projected abstract concepts and confirm that concrete objects are more likely to be captured adequately by multimodal representations. Still, they also find illustrating examples of situations or objects which represent abstract words surprisingly well (e.g., freedom can be associated with an image of a revolution-scene or theory can be associated with an image of a bookshelf full of books, lexica and papers).

Grounding Phrases: Abstract versus Concrete. In order to ground phrases, adding the meaning of abstract concepts to that of concrete ones is essential. The most straight-forward approach to compose phrases is the extension of concepts (nouns) by adjectives (adjective plus nouns). In the following, we summarize the relevant parts to this thesis of the literature review by Beinborn et al. (2018) with respect to the grounding of phrases.

With respect to the compositional meaning of adjective-noun combinations in terms of color adjectives, Bruni et al. (2012) find multimodal representations to be superior compared to unimodal ones, but difficulties remain regarding literal versus non-literal usage of color adjectives (e.g.,green/black cup versusgreen/black future).

Furthermore, Winn and Muresan (2018) propose to ground comparative adjectives describing colors in RGB space where their approach is able to model unseen colors and comparatives.

Concerning figurative language, Lakoff and Johnson (1980) argue that abstract concepts can be grounded metaphorically in embodied and situated knowledge.

Lakoff and Johnson’s theory of metaphor assumes metaphors to be a mapping from a concrete source domain to a more abstract target domain (e.g.,future can be viewed as a place in front of us, which we are approaching or which is flowing towards us).

2.3. Grounded Language Understanding

Turney et al. (2011) implement the theory of metaphor by leveraging the discrep-ancy in concreteness of source and target term to identifying metaphoric phrases.

This approach is in turn applied to adjective-noun combinations: Shutova et al.

(2016) and Bulat et al. (2017b) use multimodal models for identifying metaphoric word usage in combinations of adjective plus noun. Their models show that adjec-tives and nouns used in a metaphorical sense (dry wit) are less similar than words in literal phrases (dry skin).

Taken together, this indicates that multimodal compositional grounding is crucial for a more holistic understanding of figurative language processing.

Grounding Sentences. Finally, for the grounding of sentences, we summarize the relevant parts to this thesis of the literature review by Beinborn et al. (2018).

Multimodal representations of sequences or sentences are crucial when grounding descriptions of actions; following this, Regneri et al. (2013) ground descriptions of ac-tions in videos showing these acac-tions. Studies that require sentence representaac-tions go even further in terms of sequence length. Shutova et al. (2017) find promis-ing tendencies regardpromis-ing the use of multimodal information for the disambiguation of sentences. Still, the underlying compositional principles of combining multiple modalities for sentence comprehension are yet to be understood. Furthermore, in-terdisciplinary research is required to obtain a deeper understanding of cognitively plausible language processing (Embick and Poeppel, 2015).