• Keine Ergebnisse gefunden

6   Conclusion

3.2.2   Analysis of gesture

Speech often co-occurs with gestures. These are also crucial for interaction and shall be analyzed in the following. Based on Kendon’s (2004) definition, gestures are seen here as deliberate movements with sharp onsets and offsets. They are an “excursion” in that a part of the body (usually the arm or the head) moves away from and back to a certain position. They start with preparation, continue with the prestroke hold, which is followed by the stroke. The last phase is the poststroke hold (McNeill, 2005). The movement is interpreted as an addressed utterance that conveys information. Participants in an interaction readily recognize gestures as communicative contributions. Goldin-Meadow (2003) further differentiates gestures from functional acts (for example, opening a jar). While functional acts usually have an actual effect, gestures are produced during communicative acts of speaking and do not have such effects.

If gestures are communicative acts, one could wonder why people still produce them if the other person is not present, as has been shown for telephone communication (see Argyle, 1988). The same seems to be true for the data presented here. Even though the robot cannot display gestures and the users might not be sure whether it can interpret the gestures they produce, most participants will be found to gesture (see Sections 4.2.4 and 5.1.2). This is in line with McNeill’s (2005) premise that speech and gesture combine into one system in which each

modality performs its own function and the two modalities support each other or in Goldin-Meadow’s words:

“We use speech and other aspects of the communication context to provide a framework for the gestures that the speaker produces, and we then interpret the gestures within that framework. (Goldin-Meadow, 2003, p.9)

That gesture and speech are integrated systems can be proven because gestures occur with speech, gestures and speech are semantically coexpressive (each type of gesture has a character-istic type of speech with which is occurs), and gesture and speech are temporally synchronous (Goldin-Meadow, 2003). The integration of gesture and speech takes place quite early in the development, even before children begin to produce words together with other words.

However, many researchers have argued about this claim. According to McNeill (2005), Butterworth and Beattie (1978) are usually cited to make the point that gesture precedes speech because they found that gesture started during pauses of speech. However, they did not merely occur during pauses (McNeill, 2005). Most gestures by far occur during phonation even though more gestures occur per unit of time in pauses (the unit of time is the problem because there are probably a lot fewer pauses than speech segments). Therefore, McNeill (2005) supports the synchrony view which implies that gesture and speech are co-occurring and the stroke coincides with the most important linguistic segment. Also Argyle (1988) and Cassell (2000) argue in favor of synchrony between words and gestures and Kendon (2004) supports the synchrony view and claims that both modalities are produced under the guidance of a single plan.

While being synchronous, gesture and speech do not convey the exact same information (Goldin-Meadow, 2003). Gestures carry meaning and they are co-expressive with speech but not redundant (McNeill, 2005). Even though both modalities express the same idea they express it in different ways or in other words, co-expressive symbols are expressed at the same time in both modalities. This is due to the fact that gesture and speech convey meaning differently.

Speech divides the event into semantic units that need to be combined to obtain the composite meaning. In gesture the composite meaning is presented in one symbol simultaneously (McNeill, 2005). As Goldin-Meadow (2003) puts it, speech confirms to a codified system and gesture does not. Speakers are constrained to the words and grammatical devices of a language, which sometimes fail them. In contrast, gestures make use of visual imagery to convey meaning. The information can be presented simultaneously, whereas in speech the presentation has to be sequentially, because language only varies along the single dimension time, whereas gesture might vary in dimensions of space, time, form, trajectory, etc. These different systems allow the speaker to create richer information. The degree of overlap (and divergence) between gesture and speech differs. It may well depend on the function of the gesture. Argyle (1975) characterizes the various roles that gesture can play in HHI (express emotion, convey attitudes, etc.), but he gives gesture no role in conveying the message itself.

Gestures do not only have a function for the listener but also for the speaker. They are a link from social action to individual cognition, can lighten cognitive load, and may even affect the

course of thought (Goldin-Meadow, 2003 and McNeill, 2005). This also explains why people gesture even if the other person cannot see the gestures.

Several categorizations of types of gestures that accompany speech have been introduced in the literature. Goldin-Meadow (2003) depicts the categorizations shown in Table 3-3.

Table 3-3. Categorizations of gesture types (Goldin-Meadow, 2003, p.6) Krauss, Chen, and Gottesman

(2000)

McNeill (1992)

Ekman and Friesen (1969) Kinetographic gestures Spatial movement gestures Iconic Gestures

Pictographic gestures Lexical Gestures

Metaphoric gestures Ideographic gestures

Deictic gestures Deictic gestures Deictic gestures

Motor gestures Beat gestures Baton gestures

The categorizations mainly differ in the number of categories that they use (McNeill, 2005). In the following, the categorization of McNeill (1992) shall be introduced briefly. Iconic gestures represent body movements, movements of objects or people in space, and shapes of objects or people. Iconic gestures are rather concrete and transparent. The transparency depends on the speech that comes with the gesture. Metaphoric gestures present an abstract idea rather than a concrete object. Deictic gestures are used to indicate objects, people, and locations in the real world which do not necessarily have to be present. Beat gestures are beats with the rhythm of speech regardless of content. They are usually short, quick movements in the periphery of the gesture space and can signal the temporal locus in speech of something that seems important to the speaker.

However, to McNeill (2005) the search for categories seems misdirected because most gestures are multifaceted (they belong to a different degree to more than one category). Therefore, he proposes to think in dimensions rather than in categories which would additionally simplify gesture coding. Also, Goodwin (2003) stresses that one should analyze the indexical and iconic components of a gesture, rather than using categories. Even though the categorization of gesture is useful for statistical summaries, it plays a minor role in recovering gesture meaning and function. In order to determine meaning and function, the form of the gesture, its deployment in space, and the context are more important than gesture type.

Gestures can be differentiated in two basic categories: reinforcing and supplementing (Iverson, Longobardi, & Caselli, 1999). Especially deictic gestures such as pointing, references to objects, locations, and actions, can be reinforcing. The gestures label what is pointed at, but the referent can be understood with speech only. In contrast, when the gesture is supplementing the referent that is pointed at is not clear without the gesture (for example, if someone says “This” a pointing gesture is needed to clarify what is referred to with the utterance). The same differentiation has been made for iconic gestures that depict the physical characteristics of an object or action and manipulative gestures that are task-based gestures that facilitate the understanding of the action for the learner. Manipulative gestures were coded as being

reinforcing when the referent was also mentioned verbally. For Italian mothers when interacting with their children at the age of 16 and 20 months, Iverson, Longobardi, and Caselli (1999) found that very few iconic gestures were produced. Their findings replicate earlier findings of studies in the US. Altogether, mothers gestured relatively infrequently and when they did gesture, the gestures tended to co-occur with speech and were conceptually simple, i.e., less complex metaphoric gestures and beat gestures were produced. Also Rohlfing (to appear) has found that most gestures that the mothers produced in their tasks were deictic. Noncanonical relationships triggered the reinforcing function of deictic gestures. Since the robot like a child has a very restricted spatial lexicon it can be hypothesized that most gestures that the users produce are deictic. Based on this assumption, in the following the research on deictics will be introduced in more depth.

“Deixis refers to those linguistic features or expressions that relate utterances to the circumstances of space and time in which they occur.” (Kendon, 2004, p.222)

Thus, deictic gestures connect utterances to the physical setting. The physical world in which the conversation takes place is the topic of the conversation (Cassell, 2000). Pointing is the most obvious way to create deictic references. When pointing, the “body part carrying out the pointing is moved in a well defined path, and the dynamics of the movement are such that at least the final path of the movement is linear” (Kendon, 2004, p.199f.). Commonly post-stroke holds occur. As can be seen in Figure 3-1, Kendon (2004, p.206) identified different pointing gestures in interaction data. Index finger extended is most commonly used when a speaker singles out a particular object.

Figure 3-1. Pointing gestures according to Kendon (Kendon, 2004, p.206)

Mostly the speaker also uses a deictic word (for example, “here” or “there”). Open hand is used in their data if the object being indicated is not itself the primary focus or topic of the discourse but is linked to the topic (either as an exemplar or as a class, as the location of some activity under discussion, or it should be regarded in a certain way because this leads to the main topic).

Deictic words are observed less often with open hands than when the index finger is used for pointing (Kendon, 2004). Open hand palm up is usually used when the speaker presents the object to the interlocutor as something that should be looked at or inspected. Open hand oblique, as they find in their data, is used when someone indicates an object, when a comment is being made either about the object itself or about the relationship between the interlocutors and the object (usually the object is a person in this case). Open hand prone (palm down) is used when the speaker is referring to an object in virtue of its aspects of spatial extent, or when several objects are considered together as an ensemble. Finally, pointing with the thumb occurs when the objects are either to the side or to the rear of the speaker. It is often used when a precise identification or localization of the object is not necessary (because it is known to both interlocutors or it has been referred to before). These findings show that the different pointing gestures are used in different situations. Kendon (2004) found that the gestures used in his two samples (Northhamptonshire and Campania) were not exactly similar because of the distinct situations and maybe in part because of cultural particularities.

Goodwin (2003) also describes pointing as a situated interactive activity:

“Pointing is not a simple act, a way of picking out things in the world that avoids the complexities of formulating a scene through language or other semiotic systems, but is instead an action that can only be successfully performed by tying the act of pointing to the construals of entities and events provided by other meaning making resources as participants work to carry out courses of collaborative action with each other.” (Goodwin, 2003, p.218)

Accordingly, pointing can only be understood with respect to the situation that must contain a minimum of two participants and pointing is at least based on the contextualization of the following semiotic resources:

“(a) a body visibly performing an act of pointing;

(b) talk that both elaborates and is elaborated by the act of pointing;

(c) the properties of the space that is the target of the point;

(d) the orientation of relevant participants toward both each other and the space that is the locus of the point; and

(e) the larger activity within which the act of pointing is embedded.” (Goodwin, 2003, p.219) As has been mentioned before, people also gesture when the other person cannot see it.

However, as (a) infers, pointing only has a meaning when the other person is present. According to Clark (2003), speakers can point with any body part with which they can create a vector (finger, arm, head, torso, foot, face, eyes). With regard to (b), Goodwin (2003) has shown with a

stroke patient who could only speak three words that pointing is also possible without speech.

However, the effort is much larger because speech facilitates the pointing by relating the gesture to the space in which it is occurring. Therefore, also the space (c) is an important semiotic resource. The semiotic resource (d) points to the importance of the body orientation of the participants and the objects in space. Pointing can occur together with a postural orientation.

Based on the orientation, pointer and addressee form a participation framework (Goodwin, 2003). The framework includes orientation toward other participants and orientation toward specific phenomena located in the environment. How orientation is treated here will be elaborated on in Section 3.2.3. During the pointing activity, participants also frequently perform gaze shifts because they have to attend to multiple visual fields including each other’s bodies and the region being pointed at. The pointers may initially look at the region/object they point at and then at the addressees to make sure that they respond in the correct way which would be to look at what is being pointed at. Gaze is further discussed in Section 3.2.4. Finally, (e) stresses the influence of the context which has been discussed in Section 2.1.2.

Based on the assumption that pointing activities are situated and context sensitive, it can be concluded (a) that gesture behavior has to be analyzed with the semiotic resources in mind, i.e., taking other modalities, the situation, and the context into account by assembling locally relevant multimodal packages (Goodwin, 2003); moreover, (b) that it is not advisable to use a predefined set of gestures for the analysis. Rather a specific description of gestures that occur in the data used here will have to be retrieved. Taking a first comparative look at the data from the home tour studies and from the object-teaching studies in the laboratory underlines this finding.

It shall be anticipated here that the gestures used to show objects were quite different in the two settings. While in the home tour mainly pointing gestures were used, in the object-teaching study the participants preferred manipulative gestures. Moreover, the gestures used were clearly task based. While the only task in the laboratory was to teach objects to the robot, the apartment included the tasks of guiding the robot and showing rooms. Again other gestures were used in these tasks. The findings will be discussed in depth in Section 5.1.2.

Here the situatedness of gestures gives reason to introduce another differentiation that has been suggested by Clark (2003). He distinguishes pointing from placing. Like pointing, placing is used to anchor communication in the real world and it is also a communicative act. While pointing is a form of directing attention to the object with gesture and speech, placing-for means to put the object in the focus of attention. Clark (2003) explains placing-for with the example of the checkout counter in a store where customers position what they want to buy and where clerks expect to find the items that they need to put on the bill. These joint construals are usually established by other means than talk, i.e., items are placed in special places. Besides material things, people can also place themselves (Clark calls these self-objects and other-objects). Placing-for follows the preparatory principle and the accessibility principle.

“Preparatory Principle. The participants in a joint activity are to interpret acts of placement by considering them as direct preparation for the next steps in that activity.” (Clark, 2003, p.260)

“Accessibility Principle: All other things being equal, an object is in a better place for the next step in a joint activity when it is more accessible for the vision, audition, touch, or manipulation required in the next step.” (Clark, 2003, p.261)

Accordingly, the act of placement transmits a certain meaning (for example, I place items on the checkout counter in order to buy them) and has the function to make the object more accessible for the senses of the interlocutors. Placing-for can be divided into three phases:

“1. Initiation: placing an object per se.

2. Maintenance: maintaining the object in place.

3. Termination: replacing, removing, or abandoning the object.”

(Clark, 2003, p.259)

In general, the same phases are true for actions of directing-to but the maintenance phase is very short in contrast to placing-for actions where it is continuing. Therefore, placing-for has certain advantages over directing-to: joint accessibility of signal (everyone has access to the place of the object for an extended period of time), clarity of signal (the continuing placement makes it possible to resolve uncertainties about what is being indicated), revocation of signal (placing is easier to revoke than pointing), memory aid (the continuing presence of the object is an effective memory aid), preparation for next joint action (object is in the optimal place for the next joint action). In contrast, directing to has the advantages of precision timing (some indications depend on precise timing which can more easily be done with directing-to because it is quicker), works with immovable and dispersed objects, can be used to indicate a direction and complex referents. Clark (2003) explains this with the example of pointing at a shampoo bottle saying “that company” and actually referring to Procter & Gamble. This kind of reference cannot be established with placing-for. These advantages indicate that both behaviors are valuable in different situations. The situations in the home tour and the object-teaching studies will be analyzed regarding these aspects in Sections 4.2.4 and 5.1.2 in order to determine when the participants prefer placing-for or directing-to.

Gestures in HRI

The usage of gestures has also been researched in HRI. For example, Nehaniv et al. (2005) propose five classes of gestures that occur in HRI but also mention that often gestures do not fit in one class only:

1. irrelevant (gestures that do not have a primary interactive function) and manipulative gestures (gestures that involve displacement of objects)

2. side effect of expressive behavior (motions that are part of the communication in general but do not have a specific role in the communication)

3. symbolic gestures/emblems (gestures with a culturally defined meaning)

4. interactional gestures (gestures used to regulate interaction with a partner, for example, to signal turns)

5. referencing/pointing gestures (gestures used to indicate objects or locations)

According to Nehaniv et al. (2005), it needs to be considered to whom or what the gesture is targeted (target) and who is supposed to see it (recipient).

As basis for their HRI research, Otero, Nehaniv, Syrdal, and Dautenhahn (2006) conducted an experiment in which the participants had to demonstrate how to lay the table to the robot once with speech and gesture and once using gesture only. They used the same categories as Nehaniv et al. (2005) and analyzed which gestures were used how often. However, the five categories are not specific about the movements that make up the gestures and their meaning. Hence, the applicability to the questions asked here seems limited. Moreover, manipulative gestures are from their point of view part of the category irrelevant gestures because they are not considered to have a communicative meaning. This might be true for their research questions and their scenario; however, manipulative gestures were found to be highly important by Rohlfing (to appear) who explicitly defines them as facilitating the interaction for the learner in their scenario. Therefore, they might also be important for the data analysis in the following.

Coding of gestures

The question remains of how the gestures should be coded in the data of the user studies.

Goldin-Meadow (2003) points out that the gestures need to be identified in the stream of motor behavior. Approaches to describe gestures are often borrowed from sign-language literature.

Accordingly, Goldin-Meadow (2003) describes the trajectory of the motion, the location of the hand relative to the body, and the orientation of the hand in relation to the motion and the body.

Another scheme for coding hand orientation, hand position, and gesture phases is described by McNeill (2005, p.274f.).

Based on this information about movement one needs to attribute meaning to the gesture (which according to Goldin-Meadow (2003) is the most difficult task). To identify its meaning, the context in which the gesture occurs is important. Gesture meaning needs to be identified in relation with the task at hand. For the analysis presented below, the question arose whether the movement or the meaning of the gestures should be coded. It was decided that when the meaning was not clear right away the movement should be coded. However, for gestures with a clear meaning, a lot of time was saved in the annotation process by only coding this meaning.

Therefore, for the following analyses conventionalized and unconventionalized gestures are differentiated (Kendon, 2004). Conventionalized gestures are all gestures that can be clearly associated with a meaning in a certain cultural context (for example, raising both forearms with open palms towards the approaching robot is clearly recognized as “stop” gesture). They could also be called symbols or emblems (Nehaniv et al., 2005). For these gestures it is sufficient to code the meaning. Unconventionalized gestures do not have such an unambiguous meaning attached to them. Therefore, for these gestures the movements need to be annotated in order to interpret the meaning in a second step. Also the gestures that the participants produced during the teaching tasks in the laboratory and in the apartment were coded as movements. Even

though these gestures have the unequivocal meaning of guiding the attention of the robot to the object, how this is done might make a difference in the function as Clark (2003) and Kendon (2004) have shown. To alleviate the annotation process, the movements were categorized. In a data-driven approach, typical movements were identified and categorized in both scenarios, resulting in specific coding schemes for the relevant studies (see Sections 4.2.4, 5.1.2.1, and 5.1.2.2). The coding schemes have been checked for interrater reliability.

The analysis of the gesture codings was guided by the questions of what gestures were used, i.e., what gestures did the participants’ gesture repertoire consist of for certain tasks; how often were these gestures used in which situations; how long were they used; when did the participants switch between gestures; and what was the meaning of the gestures.

Table 3-4. Overview of analyses of gesture Object-teaching 1 no analysis of gestures

Object-teaching 2 statistical analysis of gestures (Section 4.2.4) Home tour 1 no analysis of gestures

Home tour 2 statistical analysis of gestures (pointing gestures, conventionalized and unconventionalized gestures) (Section 5.1.2)