• Keine Ergebnisse gefunden

3 Obstruents in speech production and perception

3.3 Automatic speech recognition

3.3.1 HMM based speech recognition

HMM (Hidden Markov Models) based speech recognition refers to the underlying statistical structure. A HMM is used to model the pronunciation of an acoustic unit, as well as the speaking rate. HMM based speech recognizers need to be trained on a large amount of speech data that fits to the application, the speech recognizer has to operate in. Training means that the recognizer has to estimate a mixture distribution for every unit to maximize its ability to cope with variation in the speech signal. The speech recognizer automatically learns its parameters during the statistical training phase, resulting in an acoustic model. The training data should contain the same distribution in demographic (dialect, age, gender, etc) and technical factors (phone-line, platform, etc), as the potential application would demand. In general, the acoustic model training is performed prior to setting up the complete system. But in speaker-dependent speech recognizers, the users can also tune their systems to their

voice and speaking habits, although the application has already been implemented, for example, in dictation systems.

A large amount of speech data needs to be provided to train the acoustic model because all units must be equally distributed for the parameter estimation. The units usually consist of phonemes, triphones or words. For a word-model, the acoustic references are estimated on word-level. Word-models are the most robust ones in automatic speech recognition. As a disadvantage, every word that has to be recognized in the forthcoming application needs to be represented several times by numerous speakers for the acoustic model training. Thus, word-models are usually employed for small vocabulary applications, such as digit recognition. For large vocabulary systems, monophone- or triphone-models are used. In a monophone model, the smallest units are represented by phonemes. For triphone-models, the parameter estimation is performed on phonemes including the right and left phoneme context (resulting in a sequence of three phonemes). Triphone-models also take care of co-articulation because each phoneme is modeled in the context of the possible neighbouring sounds, allowing a more accurate recognition, which is essential for large vocabulary applications.

A speech recognizer can consist of the following components: an acoustic model, lexicon, language model, grammar and the acoustic front end containing the algorithm for the feature extraction (cf. Figure 2).

Figure 2: Diagram of an HMM based automatic speech recognizer

The lexicon contains at least all key words that have to be recognized in the respective application. Words, which are not in the lexicon, cannot be recognized at all. The language model and the grammar support the acoustic front-end in medium or large vocabulary applications. The language model is also a statistical model that is based on text input containing the most probable sequences of words, which could occur in the respective application. If an expression is more likely to occur, a word hypothesis from the acoustic model recognition gets a higher score than a hypothesis that did occur less frequent in the database for the language model. In addition, the grammar provides a fixed framework of possible utterances.

The core of the whole application is the ASR engine. The first step to be performed for speech recognition is again the feature extraction from the incoming speech signal. The acoustic signal is digitalized, windowed, for example, every 10 ms and transformed to a sequence of feature vectors. Various types of feature extraction techniques exist, such as Filterbank, Mel Cepstrum and others. The resulting sequence of feature vectors is the input for the acoustic model. The algorithm aims to find the distance between the input feature vectors and those stored in the acoustic model by using HMMs. Hence, the sequence of feature vectors is mapped

ASR Engine Speech Input

Acoustic Model Language Model

Lexicon

Grammar

to a path of consecutive HMM states to generate a word-hypothesis. Depending on the speaking rate, the sequence of feature vectors could “stay” in a state (slow speaker), go to next state (normal speaker) or even skip a complete state (fast speaker, and/or co-articulation). For each possible path, a matching probability is calculated. By means of a special algorithm (e.g. Viterbi) the different paths are evaluated and those with a too low probability are skipped. The lexicon supports the search process, for example, a tree-organized search activates only those words, which share a particular phoneme at the word beginning. The language model and grammar join the recognition process, whenever a word has been recognized. The language model evaluates the hypothesis on a sentence level. The grammar controls the dialog, as explained above.

In general, HMM based speech recognizers operate only for the language they had been trained on. Several attempts had been made to train the acoustic models for different languages, but this turned out to be hardly feasible. The reason is that the phoneme inventory, which is needed to build the underlying statistical models, is different for each language. For example, if a German-French automatic speech recognizer had to be trained, the linguistic base units would have to contain all German and French phonemes. The nasal vowels in the French phoneme inventory would concur with those unit models for the German non-nasal vowels, resulting in an awkward mixture of units. A rule of thumb says that as few units as possible contribute to the recognition accuracy. A possible work around was to train two different acoustic models, one for each language or to employ linguistic information into the speech recognizer, as in the FUL speech recognizer described in the next section.