Architecture and functions of a speech dialogue system

4 Human-computer communication with in-car speech dialogue systems

4.1 Architecture and functions of a speech dialogue system

A speech dialogue system is an interface that allows communication between a human being and a machine (Wahlster, 2006; Reithinger, 2005; Wahlster, 2007). According to the definition given in Chapter 2, 2.2 a usual in-car speech dialogue system is a multimodal interface, with two input

and two output modalities (at least): manual and speech input and graphical and speech output.

In order to accept spoken input, understand and process it and answer appropriately while simultaneously synchronising spoken interaction with graphical output, several components have to interact successfully: speech input and output combined with a dialogue management system as well as a synchronisation component to exchange parameters between speech control and the graphics/haptics side of the user interface. Figure 4.1 presents the typical architecture of a speech dialogue system.

Figure 4.1: Architecture of a multimodal in-car speech dialogue system Speech control

Speech control consists of a speech input and output module and a dialogue manager. The module for speech input and output is responsible for recognising spoken input such as commands, digits and spelling and also, for turning system output into speech. Accordingly the module comprises two components: a first module for automatic speech recognition, the results of which are passed on to a parser or a unit for natural language understanding, and a second module for text-to-speech synthesis.

Automatic speech recognition

To be able to recognise spoken input, an automatic speech recogniser (ASR) contains a lexicon and a language model. The lexicon comprises all words including their acoustic models

(Hidden-Markov Models, see e.g. Schmandt, 1994; Gibbon, 1997) the user is allowed to say (see Figure 4.2). In case of natural language input the lexicon also includes morphosyntactic features, e.g.

form, category, subcategorisation, case, number and gender.

Figure 4.2: Process of automatic speech recognition (adapted from Berton, 2004, p.14) The language model (LM) can either be a grammar comprising all possible word sequences or, in case of less restricted input, a statistical model (McTear, 2004, p.86).

The grammar describes how lexical entries may be combined to phrases and sentences. The syntax of the grammar format is presented in Augmented Backus-Naur Form (ABNF) (Hunt, 2004; Hunt, 2000). To account for the productivity of language, so-called rules in ABNF are applied in a grammar. ABNF rules are structured hierarchically like a tree diagram.

The statistical model is based on N-grams, reducing context to a maximum number of N words.

Trigrams as in the example “Ich möchte in die Olgastraße“ with the probabilities P (in | ich möchte), P (die | möchte in) and P (Olgastraße | in die) are not precise but robust with regard to spoken language input, e.g. when it comes to recognising hesitations, restarts, ungrammatical utterances etc.

To recognise spoken input automatic speech recognition applies a complex search algorithm (see Figure 4.2). During the first step called signal processing the incoming digitalised speech signal is split into e.g. 10ms frames (McTear, 2004, p.83). By means of Fourier-Transformation each frame can be analysed according to particular features that describe the frame’s spectral energy distribution. These features in turn are subsumed under a feature vector. To reduce complexity of the represented signal, feature vectors x1,…, xZ undergo an additional process of feature transformation and classification, resulting in class indexes and probabilities (Berton, 2004, p.14). The results of this process are passed to a decoder. In the decoder the use of a statistical language model aims at determining the most probable transcriptions (e.g. Hamburg, Homburg, Humburg) given a spoken utterance, i.e. a word hypothesis graph (Berton, 2004, p.14; also see Kuhn, 1996 and Jelinek, 1990). Alternatively, the use of a grammar results in an n-best list that is passed on to the component for Parsing/ Natural Language Understanding.

Natural language understanding

The module for natural language understanding (NLU) gets the result of the speech recogniser.

Its task is to analyse the recognition result such that the system is able to ‘understand’ what has been said by the user and deal with it accordingly. The analysis process is two-stage (Fromkin, 1993, p.483 et seqq.; McTear, 2004, p.91 et seqq.). In stage one syntactic analysis or parsing splits a user utterance into its constituents. A constituent analysis of the sentence “I want to go to Munich” for example would result in a phrase marker as presented in Figure 4.3.

To ascertain phrase markers the module requires access to two knowledge bases:

y The lexicon, comprising speakable words including morphosyntactic features, semantic features (e.g. [+ direction] or [+ location]) (Schwarz, 1996) and rules for combining meanings on the basis of thematic roles (e.g. whether a noun in a particular context has to be an agent or theme).

y The grammar knowledge base, describing possible combinations of lexical entries resulting in phrases and sentences. The rules could for example be phrase structure rules (e.g. IPÆNP I’; I’Æ I PP; PPÆ P’ VP etc.) (cf. Radford, 1988; Crain, 1999). Phrase structure rules reflect the domination relation within a sentence, categorising nodes that dominate other nodes, until the lowest nodes, i.e. the final nodes of a tree diagram, are reached (Crain, 1999, p.91). They are recursive rules from which phrase markers like the following can be deduced.

NP I’

I PP

P’ VP

V PP

P’

P NP

N’

P V’

I want to go to Munich

N’

Figure 4.3: Phrase marker of the sentence “I want to go to Munich.”

The syntactic structure of an utterance is a prerequisite for deducing the overall meaning of an utterance. For example the word ‘play’ has two different functions and meanings in the verb

phrase “play music” and in the noun phrase “Shakespeare play”. Considering these utterances in the context of a speech dialogue system the latter ‘play’ is not a direct command to the system for playing any kind of music. Instead it represents a part of an audio item the user wants to select. As soon as an utterance has been parsed, the structural combination of words becomes subject to semantic analysis. This process requires the lexicon for accessing the semantic features of words and their defined combinations. Having located the verb ‘go’ in a sentence like for example “I want to go to Munich” a semantic representation as presented in Figure 4.4 could be ascertained.

Figure 4.4: Semantic representation of the sentence “I want to go to Munich.”

Dialogue manager

As its name already implicates, the dialogue manager (DM) is in charge of controlling the dialogue. Depending on what was entered by the user this module determines a corresponding system reaction and/or system prompt and is responsible for interacting with external modules from the outside world (see Figure 4.5). Examples may be to

y Change applications, e.g. from navigation to audio (4,1)

y Request the user to enter information, e.g. an artist or point of interest (speakable database entries) (3, 1, 6)

y Access a database, e.g. to check all titles (2)

y Request the user to give additional information, e.g. “Which artist?” in case the title is ambiguous (3,1)

y Perform an action after user input has been completed and all information necessary is available, e.g. to play a particular title (4)

y Process barge-in and timeout (5)

The digits in brackets refer to the interaction steps illustrated in Figure 4.5.

Figure 4.5: Sample tasks of a dialogue manager Text-to-speech synthesis

Text-to-speech (TTS) synthesis technology synthesises speech from utterances created by a response generator (McTear, 2004, p.102), e.g. to return system prompts or give feedback to the user. The technology is recommended for applications that contain unpredictable data, such as audio or telephone applications. First, the text from the response generator is analysed comprising four steps (McTear, 2004, p.103):

1. Text segmentation and normalisation:

x Splits text into reasonable units such as paragraphs and sentences

x Solves ambiguous markers like for example a full stop that can be used as sentence marker or component of a date or acronym

2. Morphological analysis:

x Reduces amount of words to be stored by unifying morphological variants x Assists with pronunciation by applying morphological rules

3. Syntactic tagging and parsing:

x Determines the parts of speech of the words in the text x Permits a limited syntactic analysis

4. Modelling of continous speech effects to achieve naturally sounding speech:

x Adjusts weak forms and coarticulation effects

x Generates prosody, i.e. pitch, loudness, tempo, rhythm and pauses

The second step involves generating continuous speech from the above text analysis (McTear, 2004, p.104). Over the past years enormous advances have been made in the field of TTS synthesis. This is due to a process called concatenative speech synthesis (Cohen, 2004, p.25).

According to that a database of recorded sentences is cut into syllables and words. When it comes to outputting speech the corresponding system utterance is produced by concatenating a sequence of these prerecorded segments. Boundaries between segments are smoothed out to make concatenation splices inaudible. As soon as dynamic data from applications such as audio are involved comprising various languages, speech synthesis needs to be supplemented by G2P (grapheme-to-phoneme) conversion (cf. Chapter 5).

Synchronisation module

The synchronisation module (Sync) turns a speech dialogue system into a multimodal system by connecting and synchronising spoken and graphics-haptics world. It stores data coming from the display and hands over the corresponding parameters to the dialogue manager. These parameters comprise the contents of buttons and lists displayed on the screen, the current state of the Push-To-Activate (PTA) button and actions performed by the user (e.g. change of application, abort etc.). The dialogue manager is then able to initiate a particular dialogue the results of which are, after successful recognition, returned to the display via the synchronisation module.

Graphics-haptics interface

The graphics-haptics control follows the model-view-controller paradigm (Reenskaug, 1979).

Models (state charts) and views (widgets) are described in the graphical user interface (GUI) module. The controller module contains the event management and the interface (CAN bus) to the central control switch, which can be pressed, pushed and turned. Such a control switch is the typical control element in advanced cars, such as Audi, BMW and Mercedes-Benz.

Im Dokument User concepts for in-car speech dialogue systems and their integration into a multimodal human-machine interface (Seite 63-71)