Motivation - Automatic Detection of Prosodic Cues

This study is about an approach that formulates an explicit way from continuous acoustic parameters to discrete and abstract phonological entities. The method is implemented in a computer program and uses a linguistic theory about the under-lying structure of prosody in speech. The program is designed to automatically detect the position of prosodic events from acoustic speech signals. Such a pro-gram can be of great benefit for the linguist working with large acoustic databases.

It enables the researcher to process unlabeled speech material automatically and systematically. The program can search for specific intonational patterns in a given language, or can test a theory about the underlying structure of prosody against the acoustic reality or the language learner can use it by seeing some visual feedback to his or her freshly acquired foreign language abilities. Furthermore the program can be used for labeling prosodic events in a spoken speech synthesis corpus and consequently improve the synthesis quality. Last but not least there are possible applications in the field of automatic speech recognition.

Prosody is used in speech communication as a supplementary knowledge source, providing information not available from the lexical meaning of the words alone.

Prosodic features are variations in pitch, length, loudness and rhythm during a stretch of speech. Traditionally the term ‘prosody’ was used to refer to the char-acteristics and analyzes of verse structure. In the present study the analysis of prosody encompasses two ‘worlds’: on the one side is the physical world includ-ing the acoustic speech signal and its measureable entities fundamental frequency, duration, and intensity.¹ On the other side is the abstract world including

per-1These three entities are all physically measurable each having a unit and a fixed definition of how to extract it from a waveform (acoustic speech signal). The units are: fundamental frequency or F0 measured in Hertz [Hz], duration measured in milliseconds [ms], and intensity measured in RMS (RootMeanSquare)-amplitude [Pa=Pascal] or decibel RMS-amplitude [dBRMS]. See e.g. Reetz (1999, p. 19 ff), for more detailed information about these parameters.

1.1 Motivation Chapter 1. Introduction ceived entities of pitch, length, and loudness as well as linguistic representations that are assumed to play a crucial role in the process of speech understanding. Both

‘worlds’ are connected in speech recognition and understanding. Utterances are ex-pressed with variations in frequency, duration and loudness and these units are the conveyors of informations, ideas, instructions, etc. However, to become informa-tion the physical parameters have to be interpreted by a listener and it is a common observation that obviously different acoustic signals can be interpreted by listeners as conveying the same information. For instance the word “information” uttered by a male and a female speaker in the same context may show clearly different individual acoustic properties like segment durations, energy contours, F0 move-ments, etc. but both realizations are usually easily interpretable by human listeners as conveying the same “information”. Abstraction from measureable acoustic pa-rameters towards meaningful units is a process that is not easily manageable by machines. This everyday experience is still a controversial subject in the field of linguistics and automatic speech recognition. The present study focuses on a part of these processes, namely the extraction of prosodic information from the acous-tic signal (cf. figure 1.1). The mentioned parameters are most important for the perception of prosodic events, but additional parameters may contribute as well as is symbolically expressed by the unfilled boxes in figure 1.1. One of the additional parameters could be for instance the formant values which are the most dominant acoustic correlates of perceived phoneme quality.

This study uses explicitly the term ‘prosodic cues’ in its title to state that not only variations in F0 are taken into consideration, but also variations in duration and intensity. Although the term ‘intonation’ is often used interchangeably with

‘prosody’,²it is usually used to refer solely to variations in pitch and subsequently only to variations in F0. Here both terms will be used interchangeably but when terminological differences appear they will be mentioned.

Intonation is used in communication to express differences of expressive meaning (e.g. happiness, surprise, anger). It is also very important for the naturalness of language, which is of course most obvious in speech synthesis systems.³ Beside the latter aspects, intonation serves a grammatical function distinguishing one type of sentence from another. Thus, a phrase likeHundred Eurosaid from a cashier behind the counter when one has to pay for something that is worth the price like a DVD-player or the newest book about the latest linguistic model usually begins with a high or medium pitch and ends with a lower one (i.e. falling melody) is a simple request, whereasHundred Euro? said as response to the same request but for paying something whose value is far away from the price demanded, like a bag of popcorn or two lollypops, will be usually expressed with a rising melody (ending in a high pitch) or even a rise-fall-rise melody and increased emphasis,

2See e.g. Hirst & Di Christo (1998, p. 3 ff) for a more detailed discussion of this terminological problem.

3The present study focuses on the automatic analysis of prosody and therefore does not explicitly deal with aspects of prosody for speech synthesis purposes.

Chapter 1. Introduction 1.1 Motivation

Figure 1.1: Depiction of the physical and perceptual levels in the process of intonation perception. From the acoustic speech signal acoustic features are extracted and related to perceptual dimensions. There is no one-to-one relation between the physical level and the perceived entities. Most dominant relations are marked with thicker lines. Unfilled boxes indicate additional parameters not already depicted.

and indicates a surprise question (see also 2.3 and Ladd 1996, p. 43 ff for the discussion of a rising-falling-rising tune). Additionally these melodies may be used for different purposes in different languages, that is, they are language dependent.

This example shows that the same sentence can be expressed with different intona-tional tunes.⁴ In phonology a tune is usually characterized as a structured sequence of abstract intonation labels and is associated with a functional aspect.⁵ Each of these tunes could have consequential influences on the interpretation of the sen-tence. The other way around the same tune can be overlaid on many different sentences (as will be shown in 2.1). Therefore intonation conveys additional infor-mation to the selection of words and their lexical meaning, to mark communicative purposes, like asking a question, emphasizing a specific word or a part within a sentence, structuring the speech in specific ways, or simply sounding funny, hu-morous, depressed, etc. One of the tasks in linguistic modeling is to set up a

sat-4When terms are introduced the first time they are written in italics.

5The labels are called ‘abstract’ because they are not exactly defined in terms of concrete quan-titative limits but are thought of as covering a wide range of acoustic events that build a distinct perceptual class from another abstract label. A specific notational system that describes the structure of tunes is presented in chapter 3.2.1. “[...] tunes are linguistic entities, which have independent identity from the text. Tunes and texts cooccur because tunes are lined up with texts by linguistic rules.” (Pierrehumbert, 1980, p. 19).

1.1 Motivation Chapter 1. Introduction isfying description of a specific subset of intonational phenomena, namely those which do not express some sort ofparalinguistic interpretation.⁶

A linguistic model should be able to explain explicitly the underlying processes and structures in the recognition process. Therefore a purely acoustically based analysis can only give very limited insights. This is reflected by the problems of automatic speech recognition systems to deal with acoustic variation without in-cluding a model of the underlying structure of a given language. With respect to the automatic recognition of prosodic patterns this means that a purely acoustically based analysis system could achieve only a limited recognition of principally dif-ferent prosodic patterns. In this thesis the working hypothesis is,that the acoustic analysis is the ‘igniting device’ for a general process ‘prosody recognition’.

The whole process involves crucially the formative influence of a predefined or acquired linguistic structure on the acoustic continuum. One of the aims in this study is to uncover the rules of this process and to formalize them. This faces us with a number of problems, because we have to deal with strong variation in the acoustic parameters, where the source of variation is often unclear or results from a complex interaction of many factors. The approach presented here tries to take the different sources of variation into account and to handle them in an integrated approach of automatic detection of prosodic events. It has to be stated, however, that this is only a part of the whole process of speech recognition and understand-ing. A complete system would have to identify the individual segments, syllables, and words as well. Often this segment detection was the only analysis strategy in former (and still in most of the current) automatic speech recognition systems and larger units (‘supra-segmentals’) had not been taken into account. However, prosody is incorporated into automatic speech recognition systems (e.g., Hess et al.

1997; Batliner et al. 2001b).

What is meant by the title of the thesis: “Automatic detection of prosodic cues”?

First of all what is presented is an “automatic” procedure, that means there is no hand labeling involved. During the development of the algorithm manually la-beled data was used only for the acquisition of selection criteria. All steps in the process are executed in a computer program. The input to the program is a speech signal and the output is a set of labels with information about the type of prosodic event and where it appears in time in the speech signal (see figure 1.2).

The procedure involves no segmentation of the speech signal into words, syllables or phonemes before the abstract prosodic entities are determined. It is solely the (sometimes complex) amalgam of the above mentioned acoustic parameters that is taken into consideration as an initiation of the search for adequate prosodic enti-ties. Both,bottom-up(from acoustic-to-phonological entities) andtop-down(from phonological-to-acoustic entities) processes are involved to determine the abstract

6Paralinguistic intonational phenomena are differences of sex, age, social status, sadness, etc.

These distinction is drawn to focus on the underlying linguistic structure and not on speaker individ-ual or task specific specialties. However, a distinction is not always clear cut. See also the discussion of this subject in Ladd (1996, p. 33 ff).

Im Dokument Automatic Detection of Prosodic Cues (Seite 13-17)