• Keine Ergebnisse gefunden

3.3 Existing approaches about automatic recognition of prosodic events 60

3.3.7 Verbmobil

In a series of papers12connected with the German Verbmobil13 project (cf. Wahl-ster et al. 1997; WahlWahl-ster 2000) a prosody module was developed that was designed to improve the output of this automatic speech translation system. The authors claim that the system is “the world wide first and so far only complete speech understanding system, where prosody is really used [...]” (Batliner et al., 2000, p. 108).14 The prosody module interacts with several other modules within the whole system, that is syntactic analysis, dialog processing, semantic construction, translation, speech synthesis, and provides a number of improvements in these modules. One of the main improvements is achieved in the classification of word boundaries, that is the decision whether a “full prosodic boundary” or one of the other three boundary classes (intermediate phrase boundary, normal word bound-ary, agrammatical boundbound-ary, e.g. hesitation) used in the system appeared after a word (cf. Nöth et al. 2000, p. 523).

The concept of the Verbmobil prosody module is guided by the domain it is used within, that is the speech translation of appointment scheduling dialogs. Therefore

12Cf., e.g. Kompe et al. 1995; Hess et al. 1997; Niemann et al. 1997; Batliner et al. 1999; Batliner et al. 2000; Buckow et al. 2000; Nöth et al. 2000; Batliner et al. 2001a.

13Verbmobil was intended to provide a speech-to-speech (e.g. German-Japanese and vice versa) translation for appointment scheduling dialogs.

14The use of prosodic information in automatic speech recognition (ASR) systems has already been proposed in 1980 by Lea; cf. also Vaissière 1988; Waibel 1988; Nöth 1991; Kompe 1997.

3.3 Existing approaches Chapter 3. Literature Review the prosody module takes both the output of the automatic word recognition mod-ule (the so called “word hypotheses graph (WHG)”, ibid., p. 520) and the speech signal as input. This is motivated by the availability of the phoneme classes and the time-alignment. The output of the prosody module is a WHG with annotated probabilities for accent, clause boundary, and “sentence mood”. Two basic classes of prosodic features are extracted: (a) acoustic features from the speech signal like F0, energy, and durational features which are provided by the output of the word recognizer, and (b) linguistic features provided by lexicon lookup, for instance syl-lable boundaries or position of lexical stress. The authors mention that it is still an open issue, especially for spontaneous speech, which prosodic features are neces-sary for the classification, and how they are connected with each other (cf. Nöth et al. 2000, p. 523). No attempt is undertaken to solve this question by phonetic or linguistic analysis but instead it is handed over to a statistical classifier. Therefore as “many relevant prosodic features as possible are extracted from different over-lapping windows around the final syllable of a word or a word hypothesis” (Nöth et al., 2000, p. 522).15 By doing so they end up with 276 features which consider a context of±2 words. Though the authors also present a study (Batliner et al., 1999) where they reduced this enormous feature set to 11 for boundaries and 6 for accents while simultaneously keeping the recognition rate in reasonable areas, it is questionable whether this large feature set can be motivated on linguistic reasons.

The reference point for the computation of the prosodic features is the end of a word. Among the features used are: duration for each syllable nucleus, sylla-ble, and word; for each syllable and word the normalized (to mean F0) minimum and maximum of F0 and their position relative to the reference point, absolute and normalized maximum energy and their position relative to the reference point (for more detailed information about the set of features used see Nöth et al. 2000, p. 522-523). Though Kompe (1997, p. 191-193) lists the error rates of the F0 tracker used in the Verbmobil approach and also mentions that the “fine shape of the contour is not very informative with respect to accentuation and boundaries”

(ibid., p. 193) there is no further notion of how the approach deals with faulty or microprosodically affected F0 values.

To train their statistic classifiers reference labels are needed which are provided by perceptually labeled boundary and accent classes (cf. Reyelt & Batliner, 1994).

These classes include four different types of word-based boundary labels: B0: nor-mal word boundary, B2: intermediate phrase boundary with weak intonational marking, B3: full boundary with strong intonational marking, and B9 “agram-matical” boundary, for instance hesitation or repair. The four labels for syllable based accents are: PA: primary accent; SA: secondary accent; EC: emphatic or contrastive accent; A0: any other syllable, not labeled explicitly (cf. Batliner et al., 1999, p. 2315). However, this set of prosodic labels is reduced to a two way dis-tinction in both cases, that is whether there is or is not a boundary after a word

15The feature selection procedure is described in detail in Kießling, 1997.

Chapter 3. Literature Review 3.3 Existing approaches or whether or not the word is accented. Three classes of “sentence mood” are distinguished: statement, question, and continuation rise.

The prosodic boundary labels seemed to be not sufficient which encouraged the authors to develop a new labeling scheme which they call “The Syntactic-Prosodic M-Labels” (cf. Nöth et al. 2000, p. 523 and Batliner et al. 1996, 1998). These la-bels (placed at the end of the relevant domain) mark different prosodic domains like

“main/subordinate clause”, “embedded sentence/phrase”, “constituent, marked pro-sodically” (cf. Nöth et al. 2000, p. 523 ff) and are, among other things, motivated by their inclusion of labels specific for spontaneous speech. A total of nine classes is differentiated but reduced to the main three categories for the use in Verbmo-bil. Pattern recognition means are used to train models on basis of the manually labeled data. The average recognition rate for the classification of boundary vs. no boundary is 88.3% and the one for accented vs. unaccented word is 82.6%. These recognition rates could only be achieved when using the whole feature set. Though, when using only F0 features the recognition rate for accents dropped only slightly to 79.4%. In (Nöth et al., 2000, p. 525) it is stated that for boundary classification F0 and energy are the most important features and for accent classification F0 is most important and in contrast to the boundary classification more relevant than the energy features. Interestingly in a paper presented later the authors changed the order of relevance of these features towards the following hierarchy: duration, energy, pauses, F0 (cf. Batliner et al. 2001a).

The WHGs are subsequentely annotated with the probabilities for each prosodic class. These probabilities are then used by other modules in the Verbmobil sys-tem, that are: Syntax, semantic construction, dialog processing, transfer, and speech synthesis. In order to make the prosodic information annotated available to thesyntax modulea symbol for a clause boundary is introduced “at positions where either a M3 or a B3 boundary is expected” (Nöth et al., 2000, p. 527). The authors show that the number of different readings as well as the parse time is significantly reduced by the usage of these prosodic information. Thesemantic constructionmodule uses information about accents for the interpretation of par-ticles like “noch” stillto disambiguate competing discourse representation struc-tures. The module responsible for dialog processing has to identify dialog acts like greeting, confirmation of a place, etc. Astonishingly the dialog act recognition drops significantly (cf. Kompe 1997, p. 283) when using the automatically recog-nized segment boundaries. The latter lacks a detailed explanation except that there has been not enough training data. Thetransfer moduleinvolves the translation from German to English. Here accent information is used to disambiguate the in-terpretation of particles and sentence mood information is used for the distinction of questions and non-questions. Finally it was intended to adapt thesynthesized output of the system to the voice characteristics of the original speaker. Though, the synthesis is only switched to a male or a female voice with respect to the input F0 contour. No details are stated about the reliability or benefit of this.

The prosody module in Verbmobil has shown that prosodic annotations can be of

3.3 Existing approaches Chapter 3. Literature Review great benefit for individual other modules of such a system, like syntax parsing or dialog processing, but there are still a number of open issues regarding the question how to integrate prosodic information successfully in existing or future ASR sys-tems. The way prosodic information is acquired in the Verbmobil approach is by using information extracted from the raw speech signal like F0 and energy and si-multaneously using the information provided by the automatic speech recognition module represented by word-hypothesis graphs. Enormous statistic processing is used to handle the input parameters as is expressed inter alia by the fairly large fea-ture set (276) used. Even in some cases the intended statistical processing of the input data could not be conducted since the amount of time it needed prohibited their usage (cf. Nöth et al. 2000, p. 526 and p. 528). Despite the fact that Verb-mobil showed a number of pros and cons in the usage of prosodic information in an automatic speech translation system, some of its conceptual aspects especially regarding the enormous statistical effort, are questionable.

Im Dokument Automatic Detection of Prosodic Cues (Seite 69-72)