• Keine Ergebnisse gefunden

advances of the audio or video channels can be integrated by the listener. The vis-ual and auditory modalities of produced syllables are integrated into a fused per-cept between an audio advance of 30 ms and an audio delay of 170 ms (van Wassenhove et al., 2007). A general AVI of bimodal syllables is possible at asyn-chronies of ±150 to ±250 ms, while a significant breakdown in the perceptual align-ment might be expected between ±250 ms and ±500 ms (Massaro et al., 1996).

While gesture and speech have not been proven to be causally related, they are strongly connected temporally as well as semantically in production (see Chapter 2). The windows of AVI found for speech-lip asynchronies might be indicative, even if only tentatively, of which asynchronies between co-produced speech and gestures viewer-listeners will be able to integrate.

4.4.1 Early observational studies

Eye-tracking research, for example by Gullberg and Holmqvist (1999; Chapter 4.2), has shown that listeners will perceive and even fixate speech-accompanying gestures produced by the speaker. An approach toward investigating the informa-tional gain from speech-accompanying gestures is to analyze what happens when speech and gesture contain contradictory information, that is, when they semanti-cally “mismatch”. In the common case where both modalities communicate con-gruent or complimentary information, it is hard to tell whether the listener used both or just one channel to gather their desired information – the verbal channel will possibly be the dominant source in most cases (e.g., Gullberg & Holmqvist, 1999; cf. Winter & Müller, 2010). However, when speech and gesture express dif-fering information, for example with regard to position, shape, or direction, a suc-cessive retelling by the listener can give an impression of which information they integrated more deeply. Mismatching as a research methodology is quite straight-forward, and it will be discussed in more detail regarding the experiments by Mc-Neill et al. (1994) and Cassell et al. (1999) in the following11. It is useful to note here that semantic speech-gesture mismatches caused by temporal shifts is by some considered a separate category of mismatches – since gestures are taken to be utterance-encompassing, holistic providers of information within the context of this dissertation, this distinction will not be made here. Rather, the impact of tem-poral asynchronies on the general acceptability of the multimodal utterances will be investigated on the level of audiovisual perception.

Speech-gesture mismatch experiments are aimed at showing that gestural in-formation is not only perceived by listeners, but also that inin-formation is taken from them. In the studies by Cassell et al. (1999), participants watched video-recordings of one of the male authors retelling narrations from Canary Row elicited according to McNeill and Levy (1993). The re-retellings had been recorded twice so “that 14 target phrases accompanied by gestures were produced once with a normal ges-ture and once with a gesges-ture mismatched to the content of accompanying speech”

(Cassell et al., 1999, p. 8). Semantically matching gestures agreed with the

co-11 Both publications discuss the same set of experiments and data.

produced speech, while mismatching gestures expressed contradictory informa-tion regarding the dimensions of space ('anaphor'), for example pointing in the wrong direction, perspective ('origo'), for example agent versus patient (Example 2), or manner, for example 'beckoning' versus 'grabbing' (pp. 9ff.).

Two groups of listener-viewers were asked to retell 21 utterances in three sets from the videos that included all types of matches and mismatches. In the retellings, which were elicited directly after each set of stimuli, participants con-veyed information contained in the narrator's gestures that werenot mentioned in speech as well as vice versa. For mismatches, participants also tried to accommo-date the semantic conflict in either or both modalities, for example by mentioning both manners, even if they were contradictory (see Example 3). Regardless of congruent or incongruent information (Cassell et al., 1999, p. 20), a high percent-age of gestures was integrated into the listeners' retellings (54 % manner, 50% ori-go, 32% object).

The stimuli used by Cassell et al. (1999) were partly produced spontaneously and partly acted out, but no manipulation of the narration videos took place. While the mismatched speech-gesture stimuli were not naturally co-produced but per-formed, they were also perceived and integrated multimodally by the listeners and could be used to interpret the naturally co-produced speech-gesture utterances.

Gullberg and Holmqvist (1999) commented on this that the experiments by Cassell narrator speech: “and Granny whacked him one”

narrator gesture: punching gesture

listener retelling: “And Granny like punches him or something and you know he whacks him”

Example 3: Example of mismatch accommodation (Cassell et al., 1999, p. 20).

Granny sees him and says “oh what a nice little monkey”. And then she [offers him a penny].

(a) normal: left hand proffers penny in the direction of listener.

(b) mismatched: left hand offers penny to self.

Example 2: Example of origo mismatch (Cassell et al., 1999, p. 10).

et al. (1999) provided only “a partial measure of the number of gestures that are. . . integrated into the cognitive representation” by the listener (p. 25) due to the nature of the retellings. Indeed, multiple choice or polarized questions might have been additionally informative regarding the gestural information that might have been integrated by the listeners from the stimuli. However, this would not have been informative regarding natural communicative situations, since conver-sations usually make do without too detailed questioning for feedback.

The fact that listeners integrated information shown in a video is also useful for further research into speech-gesture perception, because video editing allows for more finely grained stimulus manipulation than instructing actors. Holler, Shovel-ton, and Beattie (2009) relate to the findings of Gullberg and Holmqvist (2006) for the general visual perception of screen stimuli on varying screen sizes compared with real-life interaction. Their analysis of gaze behavior shows no greater seman-tics-related differences in the three conditions. Instead, Gullberg and Holmqvist do find that “[f]ewer gestures are fixated on video than live, but [that] the transition mainly affects gestures that draw fixations for social reasons” (2006, p. 76). The in-direct methodology used by Cassell et al. (1999) provided first intuitions on what listeners integrate in a speech-gesture conversational setting, albeit by using quite possibly unnatural timed and displayed utterances.

Monitoring listeners with an EEG while presenting them with stimuli is another way to inductively gather information on whether and how they perceive matching or mismatching information from speech-gesture utterance. This methodology has already been briefly discussed above in relation to speech-only stimuli (e.g., Win-ter & Müller, 2010) and will be expanded upon with regard to speech-gesture utWin-ter- utter-ances in the following Chapter 4.4.2.

4.4.2 ERP studies on gesture cognition

Özyürek et al. (2007) monitored participants for ERP while showing them videos of spoken sentences and accompanying gestures. Their methodology followed Mc-Neill et al. (1994; Cassell et al., 1999; also Holler et al., 2009) in that the stimuli showed an actor performing previously observed iconic gestures. Özyürek et al.

(2007) created semantic mismatches of three different kinds: The verb changed but the original gesture remained, the gesture changed but the original verb re-mained, or both gesture and verb were changed, but were semantically congruent among themselves (p. 608). The separately recorded gestures were manually syn-chronized at the stroke with complementing or conflicting verbs within selected sentences “because in 90% of natural speech-gesture pairs the stroke coincide[s]

with the relevant speech segment” (Özyürek et al., 2007, p. 610; after McNeill, 1992); some issues with this presumption have been discussed in Chapter 3.3 re-garding lexical affiliation. In the stimuli used by Özyürek et al. (2007), the initial part of the sentence served as the prime and the paired prosodic peak (e.g., pitch accent) and gesture stroke as the target for the ERP. At the point of simultaneous exposure, the listeners showed about the same ERP-response to all target stimuli:

“In all conditions, the N400 component reached its peak around 480 msec” (p.

612), with or without semantic congruency. The researchers interpreted these ho-mogeneous results to indicate a non-sequential AVI of speech and gesture, that is, that the integration of both modalities might happen in parallel, as it has been found in speech-lip research (p. 613).

The findings by Özyürek et al. (2007) are highly relevant for researching the AVI of speech and gesture in that they further supported that listeners will perceive and process co-speech gestures using a methodology much different from by Cassell et al. (1999; McNeill et al., 1994). As with various other studies, the stimuli, which were recorded using actors, were of non-natural and deliberate speech and ges-tures, but even artificially incongruent speech and gestures were integrated as if they were congruent. This agrees, for instance, with the findings by Cassell et al.

(1999), who deduced that gestures are not only registered by the listener but that even ‘mismatched’ information is taken from them.

Habets et al. (2011) followed up on the experimental setup and findings by Özyürek et al. (2007). They added audio offsets, that is, temporal asynchronies, to the stimuli and expanded on the matter of semantic congruency. Their stimuli rep-resenting concrete events, for example connecting, were created by combining video clips with separately recorded lone-standing verbs that had been deemed

congruent or incongruent with the gestures by the authors (see Example 4). The channels were either synchronized at the prosodic peak and gesture stroke or the audio was delayed after the video (G before S).

Target Gesture Target Words

Match Mismatch

(1) The two fists are placed on top of each other, as if to hold a club, and they move

away from the body twice.

Battering Hurdling

Example 4: Example of stimulus construct used by Habets et al. (2011, p. 1849).

Across brain regions, the stimuli produced similar results in the participants for the synchronized condition as for when the audio was delayed by 160 ms (GS).

The authors concluded from the lack of an N400 effect at an audio delay of 360 ms (GS) that “gesture interpretation might not be influenced by the information car-ried by speech” (p. 1852). They also claimed that “speech and iconic gestures are most effectively integrated when they are fairly precisely coordinated in time” (p.

1853). The semantic mismatches triggered significantly higher activity, quite possi-bly due to more complex AVI processes (p. 1851). For combinations of single words and gestures that did not naturally co-occur, the study by Habets et al.

(2011) supported the findings by Özyürek et al. (2007) on incongruent speech-ges-ture signals. Still, the ERP results did not testify to what happens in complete, nat-urally co-occurring speech-gesture utterances, and the AVI window for single words with gestures might extend to somewhere between an auditory delay of 160 ms and 360 ms (GS). It is also not quite clear from Habets et al. (2011) what hap-pens to AVI when the speech precedes the gesture. The authors deduce that “the interpretation of the gesture was fixed before the speech onset” in their study (p.

1852), which would be difficult if the channels were shifted to S before G.

Özyürek et al. (2007) showed semantic congruency was not a factor when the modalities were synchronized at prosodic peak and gesture stroke onset, even when a contextual sentence preceded the critical stimulus. This is compatible with van Wassenhove et al. (2007), who found only a minimal difference of about 30

ms between congruent and incongruent signals at which an audio advance was in-tegrated. Habets et al. (2011) also investigated “the aspect of semantic integration of gesture and speech” (p. 1846). Since they used artificial speech-gesture pairs (p. 1848), their results can only hint at the integration of naturally co-produced ut-terances. Also, as in Özyürek et al. (2007), the forced synchrony of the modalities was helpful for an ERP analysis but could have made the stimuli seem even more unnatural. The cutting off of the preparation phase of the gestures could also have influenced their results.

In order to further transfer these findings onto real-life communicative situations, whether with the modalities in their original production synchrony or not, needs to investigate complete, naturally co-produced utterances. The research conducted by Cassell et al. (1999; 1994) and Holler et al. (2009), for instance, has proven the direct as well as the indirect communicative influence of speech-gesture utter-ances. Özyürek et al. (2007) and Habets et al. (2011) have supported that listen-ers perceive speech-accompanying gestures using a different methodological ap-proach. The studies conceptualized and conducted within the scope of this disser-tation will focus on investigating the relevance of timing in speech-gesture produc-tion for the percepproduc-tion of such utterances under the premise that the semantic co-operation of speech and gestures is independent of timing or semantic congruen-cy. Based on the GP theory and how it fits into the cycle of speech-gesture produc-tion and recepproduc-tion, the SP hypothesis will be further specified before the back-ground of the previous findings on the perception of speech-gesture utterances in the following Chapter 4.5. The details of the methodologies with which the hypoth-esis will be tested will be introduced in Chapter 5.