• Keine Ergebnisse gefunden

The overall structure of this thesis reflects the structure of our approach and hence breaks down into three main parts: the outline of the model motivation in Part I, the detailed description of the proposed model and its computational implementa-tion in PartII and the discussion of the experimental results from model validation as well as the summary of the overall conclusions in Part III.

The model motivation in PartIbegins with the introduction provided in this chapter to delineate the thesis topic and to define the topical focus of the thesis. Chapter 2 reviews the state of the art, both in behavioural research and in computational modelling. We present central publications from the current body of literature on the interaction between vision and language and provide an overview over extant modelling efforts. A small number of more recent modelling implementations are discussed in detail.

An important constraint to our model is the requirement of its integrability into a more general theory of cognition. To this end, Chapter3introduces Ray Jackendoff’s Conceptual Semantics as a theoretical framework which offers an integrated account of the cross-modal interaction between vision and language.

Chapter 4 motivates the use of WCDG, a weighted-constraint dependency-parser, as the component for linguistic processing in our model. The chapter also outlines the benefits and limitations of approaching natural language parsing as a constraint-satisfaction problem. Chapter 4concludes our model motivation and the collection of modelling requirements.

PartIIprovides an in-depth description of our modelling decisions and the implemen-tation-specific aspects of the proposed model. We begin with a detailed description of the functional enhancements to the WCDG parser in Chapter5. These functional extensions were needed to enable the integration of visual context information into linguistic processing.

Another important aspect of our model is the representation of situation-invariant semantic knowledge and situation-specific visual scene knowledge. We describe our modelling decisions regarding the representation of these types of knowledge in Chapter 6. The chapter also outlines the role of the reasoner in our model and describes the types of inferences it draws.

The PPC is the central component in our model which enables the cross-modal influence of visual context upon linguistic processing. We described it in detail in Chapter 7. We outline how fundamental cognitive processes in the cross-modal interaction between vision and language such as grounding and cross-modal match-ing are implemented in our model and how visual context information can exert an effect upon linguistic processing.

In PartIII, finally, we report the behaviour of our model under various experimental conditions. The capability to perform semantic parsing constitutes a key prerequi-site for our model implementation. Chapter 8 describes a pre-experiment in which the coverage of the semantic extension to WCDG’s standard grammar in our model is evaluated on a corpus of unrestricted natural language.

Chapter 9 discusses the first application of our model implementation. The aim of this experiment is to demonstrate that an influence of visual scene information upon syntactic parsing can be enforced in our model. This chapter offers a discussion of the results obtained from enforcing an absolute dominance of visual context over linguistic analysis by integrating contextual information via hard integration con-straints.

In the subsequent chapters we report successive refinements to the initial context integration approach. The first improvement is provided by turning the context inte-gration constraints into soft constraints on linguistic analysis. Constraint relaxation permits to balance contextual against linguistic preferences such that the absolute dominance of visual context over linguistic analysis is resolved. As a consequence of constraint relaxation, our model can process and diagnose conflicts between lingu-istic and contextual preferences. The effects of constraint relaxation upon lingulingu-istic analysis and syntactic disambiguation are reported in Chapter 10.

Chapter 11 discusses the importance of grounding for the cross-modal influence of visual context upon linguistic processing. In these experiments we release the as-sumption that linguistic and visual modality provide information of the same degree of conceptual specificity. In that chapter we investigate the effect upon syntactic parsing that results from integrating conceptually underspecified representations of visual scene context.

Part III of the thesis concludes with Chapter 12 which contains a summary of the central findings and conclusions of this thesis as well as an outlook to future direc-tions of research.

The appendix to this thesis provides additional material to complement the exam-ples given in the argumentative parts of this thesis. Concretely, it contains the list of all requirements collected, the concept hierarchy used in context modelling, mathematical derivations of some of the more complex formulae quoted, the sen-tences studied in the experimental runs as well as all the parse trees for the reported experiments and the empirical data based on which the graphs were plotted.

Cross-Modal Interactions between Vision and Language

The scientific investigation of cross-modal interactions between vision and language has been intensifying continually since the report of the first linguistically rele-vant studies in the 1970s (e.g., Cooper,1974, 1976; McGurk and MacDonald,1976, 1978). A comprehensive view of the spectrum of these interactions needs to inte-grate insights from psycholinguistics, cognitive neuroscience, cognitive psychology, linguistics and cognitive science. It is the purpose of this chapter to provide a phe-nomenological overview over some of the central aspects of the cross-modal inter-actions between vision and language. We cite influential empirical reports that form a major source of motivation for the modelling attempt described in this thesis.

In the course of our discussion of the literature we identify relevant requirements for the implementation of a computational model. The empirical observations pre-sented in this chapter are intended to serve as a fact basis that an integrated theory of cognition needs to account for. One such theory will be discussed in Chapter 3.

This chapter begins with establishing the distinction between the cross-modal inter-actions in sensory and representational modalities in Section 2.1. From there we proceed with a focus on the interaction between vision and language, and outline cross-modal interaction phenomena at word and sub-word level in Section 2.2. Fol-lowing the course of historical development in the field, we discuss the findings of some very influential studies on the interaction between vision and language compre-hension at the level of linguistically more complex units such as phrases and entire sentences in Section 2.3. Section 2.4 reviews investigations aiming to illucidate the nature of the mental representations underlying the cross-modal interaction with language. Section 2.5 provides an overview of existing computational modelling efforts for the cross-modal interaction between vision and language.

11

2.1 Sensory versus Representational Modalities

For simple auditory-visual stimuli such as combinations of light flashes and beeps, multisensory integration has been reported to commence as early as visual cortical processing, about 46 ms after stimulus onset (Molholm et al.,2002). In comparison, the cross-modal interactions with the cognitively higher levels of linguistic process-ing such as language understandprocess-ing occur at a much later period in time. EEG studies reveal that specific brain responses to lexical, syntactic and semantic fea-tures of linguistic input are observed in the order of magnitude of one to several hundred milliseconds after stimulus onset. These latencies can be accounted for by considering that the linguistic information must first be extracted and decoded from the sensory input via which it has been received in the auditory, visual or haptic modality. Interactions with language understanding hence build on the results of sensory processing and consequently must be temporally posterior to the onset of sensory processing in the sensory input modality.1 Multisensory integration, on the other hand, occurs during early and cognitively lower-level sensory processing. The empirically observed and significant temporal differences in cross-modal integration responses provide a first indication of the qualitative difference between the cross-modal interactions of purely sensory and linguistic stimuli.

The categorisation of sensory stimulation is performed based on of the physical parametrisation of its sensorially detectable properties such as brightness, loudness, pressure, temperature, duration etc. If the information encoded in the stimulus is non-symbolic in nature, stimulus categorisation results in the formation of a direct link between the internal representation of the stimulus and the conceptual category it activates. If, on the other hand, the stimulus encodes symbolic information, its categorisation results in the identification of the encoded symbol. The retrieval of the symbol’s meaning is a separate process. In contrast to the linguistic symbols which do carry a meaning, non-symbolic percepts have no intrinsic meaning. It is in this respect, that cognitive processing of a purely sensory stimulus differs from that of a sensory stimulus which encodes symbols with an intrinsic meaning, such as language. We refer to a modality that encodes and processes the latter type of stimuli as a representational modality.Other, non-linguistic examples of representa-tional modalities are spatial, musical or visual scene understanding. In all of these, low-level sensory perception provides input which, upon categorisation of the en-coded symbols, is processed further in higher cognitive processes. We henceforth refer to a stimulus evoking purely sensory simulation that encodes exclusively non-symbolic information as asensory stimulus. A stimulus evoking sensory stimulation which encodes symbolic information is referred to as a representational stimulus. A special subset of representational stimuli arelinguistic stimuli in which the encoded information consists of linguistic symbols.

1This is not to say, however, that sensory and linguistic processing occur in strict temporal succession;

nor do they proceed in complete isolation of each other.

Processing a linguistic stimulus results in the categorisation of its sensory input as consisting of discrete1 linguistic building blocks or atoms in a temporal sequence.

For spoken language, these atoms are the identified phonemes; in reading and touch-reading, they are the individual letters perceived. Combinations of these atoms form arbitrary linguistic symbols, be they morphemes or words, that combine “rulefully”

(Harnad,1990) to make up an utterance. Each of these arbitrary linguistic symbols carries its own meaning that it contributes to the process of evaluating the utter-ance’s overall meaning. The categorisation of a linguistic stimulus hence gives rise to a discrete symbolic representation.

The diverse nature of the information encoded in different modalities — be they sen-sory or representational in nature — begs the question of whether — and if so, how

— different modalities can interact with each other at all. An integrated account of cross-modal interaction with language must be expected to provide an answer to this question. The general theory of cognition discussed in Chapter 3 does indeed offer an account of these phenomena.

In the further course of this thesis we refer to an early cross-modal interaction at the stage of sensory processing as multisensory integration. We continue to use the more general term cross-modal interaction for any type of interaction in which two modalities mutually affect each other. For a strictly unidirectional effect of one modality upon another we adopt the term cross-modal influence.

Both multisensory integration and cross-modal interactions between representa-tional modalities serve the purpose of minimising the amount of incompatible in-formation in cognition. How this goal is achieved, differs depending on the type of modalities that interact.

In the sensory modalities, multisensory integration produces a single, information-ally fused percept from multimodal sensory input whenever possible.2 When the information obtained from the different modalities is compatible with each other, multisensory integration gives rise to superadditive neural response patterns and produces a robust integrated percept of the different sensory inputs. This is ob-served, for example, in cases where and auditory and a visual stimulus temporally and spatially co-occur within well-defined temporal windows (e.g., Wallace et al., 1998).

In cases in which the information in the modalities is cross-modally incompatible, sensory processing still attempts to form a single, uniform percept from the sensory input. The physical parameters of that percept are chosen such that the overall perceptual conflict between the modalities is minimised. Interestingly, the percepts

1This holds true even if the sensory input via which language is received is encountered as a – more or less – continuous stream of input. Typical examples are the continuity of human-generated speech or the continuous flow of movements in the production of sign-language.

2A discussion of the boundary conditions under which multisensory integration occurs is beyond the scope of this thesis. Suffice it to say here that certain spatio-temporal constraints apply in order for multisensory integration to occur. Meredith et al.(1987), e.g., investigate the temporal constraints on stimulus co-occurrence in order for multisensory integration to occur.

thus generated do not truthfully represent the sensory input anymore; they are indeedsensory illusions created by our brain to satisfy the overall cognitive goal of reducing the perceptual conflict that arises from the incompatibility of the sensory inputs. Classic examples for this type of cross-modal conflict resolution by multi-sensory integration are visual capture phenomena such as the ventriloquist effect or the Shams illusion. In the ventriloquist effect, the presence of a dominant visual stimulus influences the spatial localisation of a co-occurring auditory stimulus (e.g., Bertelson and Aschersleben, 1998). In the Shams illusion, the perceived number of visual stimuli is modulated by a co-occurring auditory stimulus (Shams et al.,2002).

In representational modalities, cross-modal integration effects do not occur as part of sensory processing but during the subsequent stages of interpreting already classified symbolic input. To achieve cross-modal integration, an interpretation is generated in which the information from the different modalities is unified into a coherent overall interpretation. As an example, consider a situation in which a deictic pro-noun is used in the linguistic modality and a potential referent can be inferred from a pointing gesture in the process of visual understanding. If the properties of the identified referential candidate are compatible with the referent properties expected based on the pronoun, then the integrated interpretation will treat the deictic pro-noun and the pointing gesture as co-referential. If visual understanding provides several referential candidates that give rise to equally acceptable interpretations, further referential disambiguation may be required.

If the interpretation of the entities from visual and linguistic processing are incom-patible, e.g., because of an apparent number or gender disagreement of the deictic pronoun with the referential candidate pointed at, an alternative interpretation of the multimodal information needs to be found which removes – or at least minimises – these conflicts. Cognitive strategies for conflict resolution can be to initiate a vi-sual search for an alternative referent or to re-analyse the linguistic input in search of an alternative, compatible interpretation (e.g.,Spivey et al., 2001).

If no acceptable interpretation can be found, alternative communicative or percep-tual strategies may be triggered, depending on which modality’s input appears more reliable. These alternative strategies can be an attempt to either disambiguate the linguistic input, e.g., by means of clarification questions, or to improve the quality of cross-modal perception, e.g., by modification of the visual perspective.