Method of Parameter Assessment - Automatic Detection of Prosodic Cues

In order to get concrete quantitative values for the underlying acoustic parameters a manually labeled corpus was taken as basis for an investigation. Though it is important to know that “Manually labeled speech corpora may not be sufficiently consistent for successful training or modeling for recognition or TTS systems”

(Syrdal & McGory, 2000, p. 238) it provides a starting point for the estimation of acoustic features of the individual tones as well as their individual selectivity. Fur-thermore, to reduce inconsistencies between labelers, the GToBI training corpus was chosen because it includes prototypical examples of the individual tones and probably examples that represent a generally agreed mean description within the given framework. The GToBI corpus has the advantages of providing a reasonable number of examples for each individual pitch accent and boundary tone postulated in the underlying phonological model and also provides the acoustic material along with the prosodic label files (Grice & Benzmüller, 1997).² The inventory of pitch accents and boundary tones in the GToBI model is listed in table 4.1 and was dis-cussed in section 3.2.3. For each of these pitch accents or boundary tones there are examples in the accompanying acoustic material of the GToBI training corpus. The quantitative analysis of these tones is described in the next section. Here the goal is to get an overview of the possible acoustic features and to assess reliable quanti-tative criteria for the individual tones that might later serve as selection parameters during the detection process.

Although the GToBI corpus is solely based on German speech material the basic method of the ProsAlign algorithm should be usable for any language. Possible adaptations in the acoustic features and necessary adaptations in the underlying

2The GToBI training corpus is available under http://www.coli.uni-sb.de/phonetik/projects/Tobi-/index_training.html

Chapter 4. ProsAlign 4.2 Parameter assessment

Table 4.1: Inventory of pitch accents and boundary tones in the GToBI model (Grice &

Benzmüller, 1997).

phonological inventory have to be made in order to adapt the algorithm to another language.

When analyzing acoustic features underlying manually labeled pitch accents and boundary tones, it becomes obvious on an early stage that the exclusive inspection of the course of F0 can only result in limited success with regard to automatic de-tection purposes. Because the F0 movements that are the indicators of individual pitch accents vary drastically and do often not provide sufficient selectivity. More-over, those pitch movements can often not be separated from those that are not associated with pitch accents. Categorically different pitch accents like H* and L*

can not simply be detected by searching for maxima (in the case of H*) or minima (in the case of L*) in the F0 contour. First of all these pitch accents are not al-ways associated with maxima or minima. Second, the course of F0 does often not clearly distinguish them, and third there are different sized maxima and minima, that is having steeper or flatter increases or decreases before or after or a larger or smaller number of voiced neighbors that are not always associated with pitch ac-cents. Without a clear definition of these parameters (and probably others as well) one can expect only limited detection success.

However, when one starts to define threshold values, for instance regarding the amount of increase in F0 for individual pitch accents, it soon will appear that there is on the one hand often a considerable amount of overlap in the search criteria to get a reasonable coverage and on the other hand a very poor recognition rate when the criteria is defined too restrictive. In addition, F0 movements consisting of faulty F0 values or microprosodically affected ones (see footnote on page 36 and section 5.1) might erroneously be taken as pitch accent indicating F0 movements. There-fore, the separation of those effects from linguistically meaningful parts within a F0 contour is an important aspect for an automatic detection procedure. With re-spect to this problem, the voicing parameter plays an important role, since it allows the determination of the location of an individual F0 value within a voiced part. In addition, knowing that F0 values are most often erroneous or microprosodically af-fected up to 5 periods from the beginning or end of voicing, the voicing parameter

4.2 Parameter assessment Chapter 4. ProsAlign could help to decide (together with other parameters) whether a F0 value is likely to be faulty or not.³

Based on the observations from the manually labeled pitch accents and boundary tones it became clear that the synchronous course of amplitude and its representa-tion in the so called RMS amplitude is an important addirepresenta-tional criteria to the course of F0 and may provide sufficient selectivity for an automatic detection purpose.⁴ The course of RMS gives information about increasing and decreasing amplitude values as well as the relative height of maxima and minima in it and although it does not provide absolutely reliable information about it (since there is no segmen-tal analysis), it provides important features about onsets and offsets and centres of vowels, syllables or words and subsequently of intonation phrases.⁵

The RMS amplitude (RootMeanSquare amplitude) of a stretch ofnsamples that is said to be a rough estimate of perceived loudness (Reetz 1999, p. 19) is calculated as follows:

RMS amplitude=

rSum o f all squared elongations Number o f elongations =

In a concrete implementation this formula is applied within a predefined analysis frame of the speech signal. For example theget_f0program from the ESPS/waves-tools calculates RMS values “based on a 30 ms hanning window [...]” (Talkin &

Lin, 1997, p. 1).⁶

As a consequence of the visual inspection of the possible acoustic features of pitch accents and boundary tones the following three parameters were taken as baseline in the present approach:⁷

3“The probability that an individual period will be markedly erratic from the trend is highest at a point up to five cycles after the onset or before the offset of voicing.” (Laver, 1994, p. 453). See also Viswanathan & Russel, 1984.

4Interestingly Batliner et al., 1999 note that F0 features are not more important than energy or duration features in their evaluation of prosodic features for pitch accent and boundary classification.

5Relying solely on RMS features is of course not fully sufficient for those purposes but provides nevertheless important clues for it. The importance of the RMS feature is also reduced when there are strong background noises or other disturbing influences on the RMS contour.

6See also the description ofget_f0in Talkin (1995). A ‘hanning-window’ is a specific type of analysis window also called ‘cosine window’ since it uses a cosine function (w(i) =0.5+0.5∗ cos(^2πi_N)) and is used to focus the calculations made in the central part of the analysis window and simultaneously putting less weight on the edges of it to reduce the influences of sudden jumps at these edges (see reference before and Reetz 1999, p. 72).

7Actually a fourth parameter is not listed explicitly here because it is a inherent feature of the mentioned parameters, that are durational parameters like the duration of increases or decreases in F0 or RMS; or the duration of voicing before pitch accent location, etc. “Duration” here does not include phoneme, syllable, or word duration since these units are not recognized in the presented approach.

Chapter 4. ProsAlign 4.2 Parameter assessment

1. F0, 2. voicing,

3. RMS amplitude.

Since voicing is a necessary requirement for the extraction of F0, it is possible to subsume it under the F0 parameter. However, since separate listing makes the criteria more transparent, it was kept an individual class. Other parameters like formants or phoneme durations would introduce another source of possible errors in the recognition step and since phoneme identity is not yet easily and reliably recognizable automatically, no attempt was undertaken to detect segment identity or exact segment boundaries.

A parameter acquisition program was designed which aimed to acquire quantitative criteria for the subsequent implementation of the automatic detection program. But what exactly should be covered within the acoustic parameters? A first approxima-tion towards an answer to this quesapproxima-tion was provided by the visual inspecapproxima-tion of the acoustic features F0, RMS, voicing and duration, combined with the auditory control for several instances of each individual tone. The visual inspection of the course of F0 some distance around the tone location allowed us to think about pos-sible strategies for capturing the F0 movements. The simultaneous auditory control served a further clue and though it remains an open question how to integrate the latter, especially without knowing the segmental content, it was basically used for the estimation of the relative importance of possible features.

Several possibilities are conceiveable to capture the acoustic features of tones, for instance, duration of a voiced stretch, duration of F0 increase, amount of F0 in-crease, etc. Associated questions are: Where to start or stop the duration measure-ment of a F0 increase? How to calculate the amount of F0 increase (relative or absolute)? Are there correlations between the three parameters and how to account for them? What temporal domain should be covered? As a starting point and as a result of the visual and auditory inspection of a number of tones⁸ it was decided to analyze the manually labeled pitch accents and boundary tones in the GToBI corpus with respect to the following criteria:

• duration of increasing and decreasing parts of F0 and RMS before and after,

• amount of increase since the start of increase and amount of decrease before the end of decrease (see figure 4.3),

8Batliner et al. (1999) present another method of finding the most efficient parameter set for automatic classification of prosodic events by using linear discriminant analysis (LDA) to minimize the number of features while simultaneously preventing too much loss in classification performance.

They start with 276 features and reduce this set to 11 for boundaries and 6 for accents. Among the features for accent classification are: lower energy after accent location; more energy variation at accent location, F0 is falling before and rising at accent location.

4.2 Parameter assessment Chapter 4. ProsAlign

• duration of voiced or voiceless parts before or after pointt0(see explanation below).

END OF INCREASE = START OF DECREASE

END OF DECREASE START OF INCREASE

(A) AMOUNT of INCREASE =

Value at the end of increase Value at the start of increase

Value at the end of decrease Value at the start of decrease A

A BB

F0or RMS

Figure 4.3:Idealized illustration of F0 or RMS track for showing the method of amount estimation.

Each manually labeled tone was analyzed with respect to the above-mentioned parameters within an interval of±400 ms around its labeled position. The decision for this time window was based on a first inspection of pitch accents and boundary tones and seemed to be a reasonable analysis frame as to cover enough contextual material for the selection of acoustic features.

The decision to use the linear Hertz frequency scale was based on the knowledge that the transformations of the F0 values into a logarithmic scale did not seem to be of significant influence for the approach chosen here. Although some researchers have explicitly chosen the logarithmic scale (semitones) because it represents hu-man perception of frequency (cf. the IPO model in section 3.1.1 and ’t Hart &

Cohen 1973; Silverman 1987), the transformation of frequency values from a lin-ear scale to a logarithmic one does not solve the problems faced with when mod-elling the F0–phonology interface, and it was decided to stay with the linear Hertz frequency values (cf. Taylor 1994, p. 85-86).

The parameter analysis should result in the identification of perceptually important F0 movements as well as in the differentiation of these movements from

perceptu-Chapter 4. ProsAlign 4.3 Results

Im Dokument Automatic Detection of Prosodic Cues (Seite 80-85)