• Keine Ergebnisse gefunden

Cross-Modal Interaction at Word and Sub-Word Levels

thus generated do not truthfully represent the sensory input anymore; they are indeedsensory illusions created by our brain to satisfy the overall cognitive goal of reducing the perceptual conflict that arises from the incompatibility of the sensory inputs. Classic examples for this type of cross-modal conflict resolution by multi-sensory integration are visual capture phenomena such as the ventriloquist effect or the Shams illusion. In the ventriloquist effect, the presence of a dominant visual stimulus influences the spatial localisation of a co-occurring auditory stimulus (e.g., Bertelson and Aschersleben, 1998). In the Shams illusion, the perceived number of visual stimuli is modulated by a co-occurring auditory stimulus (Shams et al.,2002).

In representational modalities, cross-modal integration effects do not occur as part of sensory processing but during the subsequent stages of interpreting already classified symbolic input. To achieve cross-modal integration, an interpretation is generated in which the information from the different modalities is unified into a coherent overall interpretation. As an example, consider a situation in which a deictic pro-noun is used in the linguistic modality and a potential referent can be inferred from a pointing gesture in the process of visual understanding. If the properties of the identified referential candidate are compatible with the referent properties expected based on the pronoun, then the integrated interpretation will treat the deictic pro-noun and the pointing gesture as co-referential. If visual understanding provides several referential candidates that give rise to equally acceptable interpretations, further referential disambiguation may be required.

If the interpretation of the entities from visual and linguistic processing are incom-patible, e.g., because of an apparent number or gender disagreement of the deictic pronoun with the referential candidate pointed at, an alternative interpretation of the multimodal information needs to be found which removes – or at least minimises – these conflicts. Cognitive strategies for conflict resolution can be to initiate a vi-sual search for an alternative referent or to re-analyse the linguistic input in search of an alternative, compatible interpretation (e.g.,Spivey et al., 2001).

If no acceptable interpretation can be found, alternative communicative or percep-tual strategies may be triggered, depending on which modality’s input appears more reliable. These alternative strategies can be an attempt to either disambiguate the linguistic input, e.g., by means of clarification questions, or to improve the quality of cross-modal perception, e.g., by modification of the visual perspective.

which the colour words were printed. Experiment 2, on the other hand, revealed a substantial increase in response time on ink colour naming for words that denoted a colour different from the ink colour they were printed in. Notably, this interference persisted even with training on the task.

Modern cognitive psychology emphasises the role of attention in the Stroop effect (MacLeod, 1991, p. 187). In the literature, the most common — though not undisputed (MacLeod, 1991, p. 188) — explanation for the effect and its inher-ent asymmetry is the relative-speed-of-processing account. According to this ac-count, words are read and comprehended faster than colours are named. In the Stroop experiments, the two processes compete with each other to trigger a re-sponse (rere-sponse-competition). The focus of attention determines which rere-sponse is desired. Hence, the observed interference between the two processes is larger when the focus of attention is on the completion of the slower process: By the time colour naming is performed, the result of the faster word reading process is already avail-able. The response to its outcome needs to be suppressed in order to permit the response of the attended-to slower process to come through. Clearly, this suppres-sion is not required when attention is directed to the output of the faster process.

In that case, the attended process returns a result before the slower process has completed, so no inhibition is required.

From the perspective of a cross-modal interaction between vision and language, the relative-speed-of-processing account is somewhat unsatisfactory as it grounds on the assumption that the two processes, word reading and colour naming, oc-cur independently of each other and only differ in the time they require to trigger a response. This account effectively adopts a modular view on processing in the Fodorian sense.1 The relative-speed-of-processing account also cannot explain two important additional observations related to the Stroop effect:

1. The gradience effect of semantic distance upon the strength of the observed Stroop interference reported by Dalrymple-Alford (1972): words that do not denote a colour themselves but are associated with a colour, such as the word sky, produce a stronger interference on the colour-naming task than words that are completely colour-neutral. Their effect is not as strong, however, as that of incongruent colour words proper.

2. Stroop facilitation as reported by Dunbar and MacLeod (1984) and others:

when colour word and ink colour coincide, response times for ink-colour naming are slightly faster than in the control conditions. The observed effect is smaller than the response delays in the incongruent cases, but still has been shown to be statistically significant.

1The modularity of the human language faculty goes back toFodor(1983). Modules in the Fodorian sense are informationally encapsulated cognitive units that process information individually and in parallel.

The interaction between modules is restricted to an interaction via their input and output, i.e., modules cannot interact with each other in the course of their processing. Modules process their input bottom-up in a strict feed-forward manner such that the higher-level cognitive functions, which Fodor labelscentral processes, do not influence lower-level processing. Modules process their input automatically, fast and domain-specifically. According to Fodor, each module is associated with a fixed neural architecture and hence exhibits characteristic breakdown patterns.

Successful attempts to model a large number of observations associated with the Stroop effect computationally have been reported (e.g., Roelofs, 2003). However, to date there is no unanimously accepted account of the effect that can explain all related findings listed inMacLeod (1991)1 and Roelofs (2003).

For the purpose of this thesis, suffice it to say that the Stroop effect is the result of an only partly understood complex and asymmetric interaction of reading, visual perception, attention and action at word-level. The interference, facilitation and semantic gradience effect observed in the colour-naming task support the interpre-tation that at some stage of visual and linguistic processing semantic represeninterpre-tations arising from different modalities are involved in the cross-modal interaction.

With the advent of eye tracking technology in the early 1970s, the interactions be-tween vision and language have become considerably more accessible to scientific enquiry. The first use of eye tracking technology to study interactions between vi-sion and language was reported byCooper(1974). Cooperused a camera to monitor eye movements of subjects who were simultaneously exposed to visual stimuli in the form of object depictions and auditory linguistic stimuli. This experimental pro-cedure subsequently became known as the visual-world paradigm.2 Cooper showed that spoken word semantics influenced subjects’ fixation patterns on co-present vi-sual stimuli. More specifically, Cooper found that from a selection of nine co-present visual stimuli subjects preferably fixated those that were either direct depictions of referents denoted by the words presented auditorily or depictions of items seman-tically related to the words’ referents. Cooper concluded that the eye movement patterns are a reflection of the on-line activation of word semantics from speech.

Huettig et al. (2006) point out that Cooper did not control for the type of se-mantic interaction that gave rise to the observed cross-modal effect. The fixation preference on the semantically related visual stimuli could have arisen from either associative relatedness or genuine semantic similarity. While associative related-ness (e.g., piano and practice)3 does not necessarily link concepts from the same semantic category, semantic similarity holds between members of the same semantic category only (e.g., trumpet and piano)4. The challenge in conducting word-based association task experiments is to disentangle associative relatedness from semantic relatedness. This differentiation becomes important in the light ofHuettig and Alt-mann (2005)’s findings that the degree of conceptual relatedness between concepts activated in vision and language has an effect upon the strength of the influence of word semantics upon fixation patterns. Huettig and Altmann (2005) found that

1This milestone paper provides an extremely detailed and comprehensive review of the first five decades of research on the Stroop effect.

2Cooper had the methodological foresight to realise that this novel technique constituted an experi-mental paradigm whose “linguistic sensitivity (. . . ) together with its associated small latencies suggests its use as a practical new research tool for the real-time investigation of perceptual and cognitive processes”.

3This example is taken fromNelson et al.(1998), a list of association norms for more than 5,000 English word primes and their associated targets. The lists are based on the responses of over 6,000 participants.

4This is a carefully constructed example fromHuettig et al.(2006) of a semantically related word pair that is not associatively related according toNelson et al.(1998).

fixation patterns are influenced by semantically related stimuli – but not by as-sociatively related stimuli. We therefore expect conceptual similarity to play an important role in cross-modal matching as discussed in further detail in Section 3.7.

The findings of Cooper (1974) and Huettig and Altmann (2005) support the view that the interaction between vision and language is mediated by a representation of linguistic meaning. We formulate this as modelling requirement R1.

Requirement R1

In a model for the interaction between visual context and linguistic under-standing, the cross-modal interaction must be mediated by a representation of linguistic meaning.

Another famous and frequently cited interaction between vision and language is the McGurk effect. We briefly discuss it here to make clear why we disregard it in the collection of requirements for our model. In the McGurk effect, the visual perception of lip movements and lip shapes interacts systematically with the auditory percep-tion of concurrently presented phones (McGurk and MacDonald, 1976, 1978). In their classical experiment, McGurk and MacDonald auditorily presented subjects with the phone /ba/ dubbed onto a video of a mouth producing the phone /ga/.

Subjects reported to hear the phone /da/. The McGurk effect hence occurs at the level of individual phones, i.e., at sub-word level. Exploiting the systematicity of the cross-modal effect, Massaro and Stork (1998) report the illusion also to occur for larger phonetic units such as an entire sentence: exposing subjects to /My bab pop me poo brive/ auditorily and /My gag kok me koo grive/visually induced the auditory percept ofMy dad taught me to drive.1 The perceived auditory percept can simply be predicted on the basis of concatenating the individual cross-modal interactions at phoneme level. The McGurk effect has been studied extensively and is observed robustly over a wide range of languages and different conditions such as speaker-gender incongruence between visual and auditory modalities and others.

The important difference between the McGurk effect and the interactions between vision and language observed in Stroop’s experiments is that the McGurk effect is based on an an early interaction between vision and the auditory perception of speech. The McGurk interaction affects the perception of phones rather than any later – and cognitively higher – stages of processing that involve language compre-hension. This view is consistent with the observation that the McGurk effect also occurs with single syllables, pseudo-words and non-words, none of which have a meaning or semantic representation that could form the basis of this interaction.

The McGurk effect has widely been interpreted as a bottom-up integration of incon-gruent cross-modal stimuli. More recent studies suggest, however, that top-down effects in the form of sentence context and word semantics can also modulate the strength of the effect (Windmann,2004;Ali,2007). The question whether the influ-ence of vision upon audition in the McGurk effect is due to a bottom-up integration

1Massaro and Stork also showed that the corresponding unimodal stimuli on their own were unin-telligible: The majority of the subjects gave an accurate phonemic description of the non-sensical audio input and were unable to extract a meaning by lipreading the video when either stimulus was presented in isolation.

in stimulus identification or due to an expectation-modulated interaction in stimulus discrimination — or a combination thereof — has not been answered conclusively.1 In analogy to the combination of bottom-up and tdown processes believed to op-erate during visual object recognition, we hypothesise that the McGurk effect also results from a convergence of bottom-up and top-down processes acting in parallel.

In the context of this thesis we classify the effect as a primarily sensory phenomenon which can experience top-down modulation under special conditions. The robust-ness of the effect in the absence of expectation- or knowledge-driven top-down effects further supports the interpretation in terms of a bottom-up integration. As such, we choose to exclude it from further consideration in our model of the influence of visual context understanding upon linguistic processing.

Summarising the cross-modal interactions between vision and language at word and sub-word levels, we can say that the Stroop effect and Cooper’s visual world experiments provide convincing evidence for the involvement of a semantic repre-sentation in the cross-modal interaction between visual and linguistic processing.

Cooper’s experiments suggest that the interaction between modalities is such that visual processing aims to identify entities in visual context which are conceptually related to the concepts activated linguistically. Huettig and Altmann refined this view to show that only semantic relatedness gives rise to the effect. The observa-tions of the Stroop effect suggest that the degree of conceptual overlap between the concepts processed in each modality has an effect on the ease with which certain tasks can be performed. For tasks exhibiting a Stroop effect, conceptual congruence results in task facilitation and conceptual incompatibility results in an interference.

The following section investigates the effect of non-linguistic information obtained from visual understanding upon the processing of more complex linguistic structures such as phrases and entire sentences.