Prominence Detection - Acoustic Packaging as a Basis for Feedback on the iCub Robot 69

6. Acoustic Packaging as a Basis for Feedback on the iCub Robot 69

6.2. Prominence Detection

The role of the prominence detection module in the acoustic packaging system is to identify highlighted parts in the tutor’s speech. This way, the system is able to perceive words or expressions for an action or another term the tutor has focused on. For example, if the tutor shows a cup and focuses on the cup’s color in the tutoring situation, s/he will probably emphasize the color term. Portions of this section were previously published by the author (Schillingmann et al., 2011).

6.2.1. Perceptual Prominence

Perceptual prominence of linguistic units is defined as the unit’s degree of standing out of its environment (Tamburini and Wagner, 2007). This definition results in the following aspects which need to be modeled. First, it is necessary to define which type of linguistic unit the module should operate on. Syllables are typically used in prominence detection methods (Tamburini and Wagner, 2007). They have the advantage that speech can be segmented into syllables without using models that require a known lexicon. Second, an environment has to be defined, which is analyzed to compare linguistic units. In this

Chapter 6. Acoustic Packaging as a Basis for Feedback on the iCub Robot

work, syllables will be ranked on a per utterance basis, which is a common approach and integrates well with the speech segmentation the acoustic packaging system already performs. Third, features to rate the level of prominence of each unit and a method to rank the results are required. These features have to be chosen carefully according to their robustness in noisy acoustic environments. Additionally, possible features vary depending on the language. For German, a possible set of features consists of nucleus duration, spectral emphasis, pitch (F0) movements, and the overall intensity (Tamburini and Wagner, 2007).

However, for the scenario in this work, the set of features needs to be reduced. One reason is the relatively noisy environment of a tutoring situation. Especially if the module is used for analysis of adult-child interaction, the feature must be noise robust. This environment makes pitch estimation and nucleus duration features less reliable. A nucleus duration feature would additionally depend on the accuracy of the syllable segmentation.

Furthermore, findings on prosodic event detection show that the difference between combining several features compared to single features is relatively small (Rosenberg et al., 2012). Tamburini and Wagner (2007) achieved optimal results using high weighting factors for spectral emphasis and nucleus duration features. Considering these results as well as the finding that in adult-child interaction prosodic features are more exaggerated (Brand et al., 2002), the prominence detection module will rely on spectral emphasis in

its implementation.

6.2.2. The Prominence Detection Module

According to the model described in the previous section, the prominence detection module operates on utterance level. If a new utterance hypothesis is completed, the prominence detection module retrieves the acoustic signal from active memory and performs the subsequent steps. First, the speech stream is segmented into linguistic units, which in the present case are syllables. The second step rates these linguistic units according to the acoustic parameters which correlate to the perceived prominence. The result is a syllable segmentation which includes a prominence rating for each syllable.

The utterance hypothesis is extended with this information and made available to other modules by inserting the updated hypothesis into the active memory. The syllable segmentation method and the implementation of prominence detection are described in more detail in the following.

Syllable Segmentation

A modified version of the Mermelstein algorithm (Mermelstein, 1975) is used to segment utterances into syllables. In a first step, the signal is filtered using an equal loudness filter (Robinson, 2011). The filtered signal is further bandpass filtered using a 4th order Butterworth bandpass filter with cut-off frequencies at 500 Hz and 4000 Hz. Then, the signal is full wave rectified and low-pass filtered with a second order Butterworth filter at

Convex Hull Syllable Boundary

Hypothesis

Signal Envelope

Figure 6.2.: Visualization of the Mermelstein convex hull based syllable segmentation algorithm. The convex hull is drawn at multiple iterations to visualize its approximation of the energy envelope.

Parameter Name Value

Difference threshold between hull and envelope 1.39 dB

Minimal segment duration 80 ms

Table 6.2.: Values of relevant parameters for the syllable segmentation method in a typical configuration.

40 Hz to obtain an estimation of the signal’s envelope. The basic idea of the Mermelstein algorithm is to detect minima in the signal’s energy envelope. The locations of these minima are the desired syllable boundaries. The minima detection is described in the following: The signal’s envelope is approximated using a convex hull. A syllable boundary is identified at the maximum difference between the convex hull and the signal’s envelope (see Figure 6.2). The algorithm is carried out recursively for the intervals left and right to the syllable boundary. The recursion is terminated if the maximum distance drops below a certain threshold or the interval between two boundaries falls below a minimal length. Table 6.2 gives an overview of these parameters and common values. Values were determined by a parameter optimization on a subset of the Verbmobil corpus (Kohler et al., 1994). The general idea behind this approach is to prioritize the most significant minima in the signal’s envelope.

Prominence Rating

As discussed in Section 6.2.1, spectral emphasis is used to rate the syllable segments.

The syllable segment with the highest spectral emphasis rating is considered the most prominent syllable in the utterance. The spectral emphasis feature is calculated by

Chapter 6. Acoustic Packaging as a Basis for Feedback on the iCub Robot

Matches Deletions Insertions 68.65% 31.35% 31.35%

Table 6.3.: Evaluation results of our syllable detection method on utterances from the Verbmobil corpus.

Matches Utterances ^∗Words

59.71% 139 4.45

Table 6.4.: Evaluation results of prominence detection approaches on utterances from adult-infant interactions. The results are 2.7 times better than chance.

(^∗Average number of words per utterance)

bandpass filtering the signal with a 4th order Butterworth filter in the band 500 Hz to 4000 Hz. Then, RMS energy is computed for each syllable segment and normalized per utterance.

6.2.3. Evaluation

Both the syllable segmentation approach and the prominence rating method were evalu-ated. Syllable segmentation was evaluated on a subset of the Verbmobil corpus (Kohler et al., 1994), since an accurate syllable segmentation is available. The subset consists of 2,000 randomly selected utterances containing 68,276 syllables in total. A syllable bound-ary is considered a match if a boundbound-ary hypothesis is within 50ms distance. Table 6.3 shows results with balanced insertion and deletion rates.

The prominence rating algorithm has been evaluated on a corpus with adult-infant interactions (Rohlfing et al., 2006). For the evaluation, a subset where adults explain children how to stack cups was used. The acoustic channel has been recorded from a distant microphone and thus contains environmental noise e.g. from the cup stacking task and in some cases from the child. Word boundaries were automatically determined from a transcription by performing a forced alignment. A human annotator has marked the most prominent word in each utterance. If the center of the syllable with the highest prominence ranking lies within the word boundaries, a match is counted. Utterances with very bad acoustic conditions where even the forced alignment failed were not taken into account. 139 utterances have been used in the evaluation. The results are presented in Table 6.4.

6.2.4. Summary

A prominence detection module was described including its integration in the acous-tic packaging system, where it detects semanacous-tically relevant information linguisacous-tically highlighted by a tutor. Evaluation results on speech data from adult-infant interactions

Acoustic Packaging Integration Framework Sensory Cues

Active Memory

Visualization and Inspection Temporal

Association

Motion Segmentation Acoustic

Segmentation Prominence

Detection Color Saliency Based Tracking

Robot Feedback

Module

Figure 6.3.: System overview with highlighted layers and their relation to the acoustic packaging system.

show a 59.7% agreement with human raters. This means that through a fully automated approach of syllable segmentation and prominence detection more than half of the stressed words can be obtained. While this may seem a low recognition rate, it should be noted that the results were achieved on highly realistic data which include much noise from toy playing and children’s interruptions. Although the prominence module’s agreement with a human rater is not perfect, the method works in the more difficult acoustic conditions of tutoring scenarios. Furthermore, the method definitely works considerably better than chance. By using more complex acoustic features, the results could possibly be improved.

For German, including nucleus duration would likely lead to an improvement as long as it is estimated robustly.

6.3. Integration of Color Saliency and Prominence Detection

Im Dokument A computational model of acoustic packaging (Seite 86-90)