Generation of Word Hypotheses

Renato De Mori a & Fabio Brugnara b

1.5.4 Generation of Word Hypotheses

Generation of word hypotheses can result in a single sequence of words, in a collection of the n-best word sequences, or in a lattice of partially overlapping word hypotheses.

This generation is a search process in which a sequence of vectors of acous-tic features is compared with word models. In this section, some distinctive characteristics of the computations involved in speech recognition algorithms will be described, first focusing on the case of a single-word utterance and then considering the extension to continuous speech recognition.

In general, the speech signal and its transformations do not exhibit clear indication of word boundaries, so word boundary detection is part of the hy-pothesization process carried out as a search. In this process, all the word

models are compared with a sequence of acoustic features. In the probabilistic framework, “comparison” between an acoustic sequence and a model involves the computation of the probability that the model assigns to the given sequence.

This is the key ingredient of the recognition process. In this computation, the following quantities are used:

α_t(y^T₁, i): probability of having observed the partial sequence y^t₁ and being in stateiat timet

α_t(y^T₁, i)≡

p(X₀=i), t= 0 p(X_t=i,Y^t₁=y^t₁), t >0

β_t(y^T₁, i): probability of observing the partial sequence y^T_t+1 given that the model is in stateiat timet

β_t(y^T₁, i)≡

p Y^T_t+1=y^T_t+1|X_t=i

, t < T

1, t=T

ψ_t(y^T₁, i): probability of having observed the partial sequencey^t₁along the best path ending in stateiat time t:

ψ_t(y^T₁, i)≡





p(X₀=i), t= 0

maxi^t−1₀ p X^t−1₀ =i^t−1₀ , X_t=i,Y^t₁=y^t₁ t >0

αandβ can be used to compute the total emission probabilityp(y^T₁|W) as p(Y^T₁ =y^T₁) = X

α_T(y^T₁, i) (1.1)

= X

π_iβ₀(y^T₁, i) (1.2) An approximation for computing this probability consists of following only the path of maximum probability. This can be done with theψquantity:

Pr^∗[Y^T₁ =y^T₁] = max

i ψ_T(y^T₁, i) (1.3) The computations of all the above probabilities share a common framework, employing a matrix called a trellis, depicted in Figure 1.6. For the sake of simplicity, we can assume that the HMM in Figure 1.6 represents a word and that the input signal corresponds to the pronunciation of an isolated word.

Every trellis column holds the values of one of the just introduced probabil-ities for a partial sequence ending at different time instants, and every interval between two columns corresponds to an input frame. The arrows in the trellis represent model transitions composing possible paths in the model from the ini-tial time instant to the final one. The computation proceeds in a column-wise

1.5 HMM Methods in Speech Recognition 27

manner, at every time frame updating the scores of the nodes in a column by means of recursion formulas which involve the values of an adjacent column, the transition probabilities of the models, and the values of the output distributions for the corresponding frame. For αand ψ coefficients, the computation starts at the leftmost column, whose values are initialized with the values of π_i, and ends at the opposite side, computing the final value with (1.1) or (1.3). For the β coefficients, the computation goes from right to left.

The algorithm for computingψcoefficients, known as theViterbi algorithm, can be seen as an application of dynamic programing for finding a maximum probability path in a graph with weighted arcs. The recursion formula for its computation is the following:

ψ_t(y^T₁, i) =

( π_i, t= 0 maxj ψ_t−1(y^T₁, j)a_j,ib_j,i(y_t), t >0

By keeping track of the state j giving the maximum value in the above recursion formula, it is possible, at the end of the input sequence, to retrieve the states visited by the best path, thus performing a sort of time-alignment of input frames with models’ states.

All these algorithms have a time complexityO(M T), whereM is the num-ber of transitions with non-zero probability and T is the length of the input sequence. M can be at most equal to S², where S is the number of states in the model, but is usually much lower, since the transition probability matrix is generally sparse. In fact, a common choice in speech recognition is to im-pose severe constraints on the allowed state sequences, for example a_i,j= 0 for j < i, j > i+ 2, as is the case of the model in Figure 1.6.

In general, recognition is based on a search process which takes into account all the possible segmentations of the input sequence into words and thea priori probabilities that the LM assigns to sequences of words.

Good results can be obtained with simple LMs based on bigram or trigram probabilities. As an example, let us consider a bigram language model. This model can be conveniently incorporated into a finite state automaton as shown

in Figure 1.7, where dashed arcs correspond to transitions between words with probabilities of the LM.

p₁

p_k

p_N

p_k,k p_k,1

W₁

k,N

W W_k

Figure 1.7: Bigram LM represented as a weighted word graph. p_h,k stands for p(W_k|W_h), p_h stands for p(W_h). The leftmost node is the starting node, rightmost ones are finals.

After substitution of the word-labeled arcs with the corresponding HMMs, the resulting automaton becomes a large HMM itself, on which a Viterbi search for the most probable path, given an observation sequence, can be carried out.

The dashed arcs are to be treated as empty transitions, i.e., transitions with-out an associated with-output distribution. This requires some generalization of the Viterbi algorithm. During the execution of the Viterbi algorithm, a minimum of backtracking information is kept to allow the reconstruction of the best path in terms of word labels. Note that the solution provided by this search is subop-timal in the sense that it gives the probability of a single state sequence of the composite model and not the total emission probability of the best word model sequence. In practice, however, it has been observed that the path probabilities computed with the above mentioned algorithms exhibit a dominance property, consisting of a single state sequence accounting for most of the total probability (Merhav & Ephraim, 1991).

The composite model grows with the vocabulary and can lead to large search spaces. Nevertheless, the uneven distribution of probabilities among different paths can help; when the number of states is large, at every time instant, a large portion of states have an accumulated likelihood which is much less than the highest one. It is therefore very unlikely that a path passing through one of these states would become the best path at the end of the utterance.

This consideration leads to a complexity reduction technique calledbeam search (Ney, Mergel, et al., 1992), consisting of neglecting states whose accumulated score is lower than the best one minus a given threshold. In this way,

compu-1.5 HMM Methods in Speech Recognition 29

tation needed to expand bad nodes is avoided. It is clear from the naivety of the pruning criterion that this reduction technique has the undesirable property of being not admissible, possibly causing the loss of the best path. In practice, good tuning of the beam threshold results in a gain in speed by an order of magnitude, while introducing a negligible amount of search errors.

When the dictionary is of the order of tens of thousands of words, the network becomes too big and other methods have to be considered.

At present, different techniques exist for dealing with very large vocabular-ies. Most of them use multi-pass algorithms. Each pass prepares information for the next one, reducing the size of the search space. Details of these meth-ods can be found in Alleva, Huang, et al. (1993); Aubert, Dugast, et al. (1994);

Murveit, Butzberger, et al. (1993); Kubala, Anastasakos, et al. (1994).

In a first phase, a set of candidate interpretations is represented in an ob-ject called word lattice, whose structure varies in different systems: it may contain only hypotheses on the location of words, or it may carry a record of acoustic scores as well. The construction of the word lattice may involve only the execution of a Viterbi beam search with memorization of word scoring and localization, as in Aubert, Dugast, et al. (1994), or may itself require multi-ple steps, as in Alleva, Huang, et al. (1993); Murveit, Butzberger, et al. (1993);

Kubala, Anastasakos, et al. (1994). Since the word lattice is only an intermedi-ate result, to be inspected by other detailed methods, its generation is performed with a bigram language model, and often with simplified acoustic models.

The word hypotheses in the lattice are scored with a more accurate language model, and sometimes with more detailed acoustic models. Lattice rescoring may require new calculations of HMM probabilities (Murveit, Butzberger, et al., 1993), may proceed on the basis of precomputed probabilities only (Aubert, Dugast, et al., 1994;

Alleva, Huang, et al., 1993), or even exploit acoustic models which are not HMMs (Kubala, Anastasakos, et al., 1994). In Alleva, Huang, et al. (1993), the last step is based on an A^∗ search (Nilsson, 1971) on the word lattice, allowing the application of a long distance language model, i.e., a model where the probability of a word may not only depend on its immediate predecessor. In Aubert, Dugast, et al. (1994), a dynamic programming algorithm, using tri-gram probabilities, is performed.

A method which does not make use of the word lattice is presented in Paul (1994). Inspired by one of the first methods proposed for continuous speech recognition (CSR) (Jelinek, 1969), it combines both powerful language model-ing and detailed acoustic modelmodel-ing in a smodel-ingle step, performmodel-ing an A^∗ based search.

1.5.5 Future Directions

Interesting software architectures for ASR have been recently developed. They provide acceptable recognition performance almost in real time for dictation of large vocabularies (more than 10,000 words). Pure software solutions require, at the moment, a considerable amount of central memory. Special boards make it possible to run interesting applications on PCs.

There are aspects of the best current systems that still need improvement.

The best systems do not perform equally well with different speakers and dif-ferent speaking environments. Two important aspects, namely recognition in noise and speaker adaptation, are discussed in section 1.4. They have diffi-culty in handling out-of-vocabulary words, hesitations, false starts, and other phenomena typical of spontaneous speech. Rudimentary understanding capa-bilities are available for speech understanding in limited domains. Key research challenges for the future are acoustic robustness, use of better acoustic features and models, use of multiple word pronunciations and efficient constraints for the access of a very large lexicon, sophisticated and multiple language models capable of representing various types of contexts, rich methods for extracting conceptual representations from word hypotheses and automatic learning meth-ods for extracting various types of knowledge from corpora.

Im Dokument Overview: Formal Tools and Methods (Seite 42-47)

Renato De Mori a &amp; Fabio Brugnara b

1.5.4 Generation of Word Hypotheses

1.5.5 Future Directions

Renato De Mori a & Fabio Brugnara b