Speech Coding - A Fine-Grain Scalable and Channel-Adaptive Hybrid Speech Coding Scheme for Voic

The telephone network is the most common example of an electronic speech commu-nication system. Due to its scale, a signicant amount of research has been directed towards nding ecient methods of transferring the spoken word. Developments in this area have resulted in the gradual replacement of analog components of the network by digital circuits. Digital communication systems outperform analogue systems in the presence of noise and also provide the possibility of easily encrypt-ing the data for security purposes. However, they do have a major disadvantage in that the required bandwidth is increased, and so the cost of the network is also increased. Speech coding attempts to reduce the required channel bandwidth, while at the same time maintain the high quality of the input speech.

Speech signals are inherently analog by nature, therefore a transformation from the continuous domain into the discrete domain must be carried out before any digital coding is performed. This transformation utilizes time discretization or sam-pling followed by amplitude discretization or quantization. Samsam-pling is information preserving if it is carried out at a suciently high rate. Quantization, on the other

hand, always introduces some distortion into the signal. So the operation of the quantizer is critical to the overall performance of the system.

The goal in speech coding is to design a low complexity coder that produces high quality speech with the lowest data transmission rate possible. The properties of low complexity, high quality output and a low bit rate tend to be mutually exclusive and a trade o exists in practical coders.

4.2.1 Speech Production and Perception

The human vocal and auditory organs form one of the most useful and complex communication systems in the animal kingdom. This section briey describes this system and its properties, with a view to exploiting them when transmitting speech.

4.2.1.1 Speech Production

All speech sounds are formed by blowing air from the lungs through the vocal tract.

In the adult male the vocal tract is approximately 17 cm long with a cross-sectional area that varies from zero to about 20 cm² [ISP05].

The sole purpose of the lungs, in speech production, is to act as an air supply. The vocal folds are two membranes situated in the larynx. These membranes allow the area of the trachea (the glottis) to be varied. For breathing the vocal folds remain open but during speech production they open and close. Speech can be broken into two broad classes, namely voiced or unvoiced. Although in reality this distinction is not well dened and speech tends to be a mixture of the two classes.

During voiced speech the vocal folds are normally closed. As the lungs force air out pressure builds up behind them. Eventually the pressure is so great that the folds are forced open. This ow of air causes them to vibrate in a relaxation oscillation. The frequency of this vibration is determined by the length of the folds and their tension. This vibration frequency is known as the pitch frequency and for normal speakers it can be anywhere in the range of 50 to 400 Hz. Women and children tend to produce a higher average pitch frequency than men since their vocal folds are shorter. The eect of this opening and closing of the glottis is that the air passing through the rest of the vocal tract appears as a quasi-periodic pulse train.

During unvoiced speech the vocal folds are normally open so air can pass freely into the rest of vocal tract.

At this point the speech signal consists of a series of pulses or random noise depending on whether the speech is to be voiced or unvoiced. The spectrum of this signal is essentially at, although there will be a ne spectral structure in the spectrum of the voiced signal due to the pitch frequency and its harmonics. This at spectrum is then passed through the rest of the vocal tract that can be viewed as a spectral shaping lter. The eect of this is that the vocal tract forces its own

Figure 4.1: The vocal tract

(http://www.clas.ufl.edu/users/sshear/)

frequency response on the incoming signal. This frequency response is governed by the size and shape of the vocal tract.

The nasal cavity is an auxiliary path for sound. It begins at the velum and is about 12 cm in length [ISP05]. When the velum is lowered the nasal tract is acoustically coupled with the rest of the vocal tract, dramatically changing the nature of the sound. It is the voluntary variations in the shape of the vocal tract (due to moving the tongue and the mouth) along with the varying state of the vocal folds that produces speech.

4.2.1.2 Properties of Speech

Dierent speech sounds are distinguished by the human ear on the basis of their short time spectra and how these spectra evolve with time. The eective bandwidth of speech is approximately 7 kHz. For voiced sounds such as vowels, the vocal tract

acts as a resonant cavity. For most people the resonance frequencies are centered on 500 Hz and its odd harmonics. This resonance produces large peaks in the resulting speech spectrum. These peaks are known as formants. The formants of the speech contain almost all the information contained in the signal. This fact means that the vocal tract can be eectively modeled using an all-pole linear system. Voiced speech will also exhibit the eects of the vibrating vocal cords. The eect of this vibration is to introduce a quasi-periodicity into the speech.

The periodic nature of the speech is clearly visible. The rst four formants are labeled in the spectrum. The ne spectral information, due to the vibrating vocal folds, is also visible in the spectrum. It should be noted that the formant structure appears to break down above 4 kHz as noise introduced by the turbulent ow of air through the vocal tract begins to dominate. The spectrum also shows the enormous dynamic range and lowpass nature of voiced speech.

Nasals are produced when the oral cavity is blocked and the velum is lowered to couple the nasal cavity with the vocal tract. They are also voiced in nature but the coupling of the oral and nasal cavities introduces anti-resonances or nulls instead of resonances so the formants disappear.

Unvoiced hiss-like sounds like s, f, sh are generated by constricting the vocal tract close to the lips. Both the time and frequency views demonstrate the noise like nature of the signal. Unvoiced speech tends to have a nearly at or a highpass spectrum. The formants that are so obvious in voiced speech are gone and so is the ne pitch structure. The energy in the signal is also much lower than that in voiced speech.

4.2.1.3 Speech Perception

The sense of hearing is the least understood component of the human speech com-munication system. Little is known about how the brain decodes the acoustic in-formation it receives. However, quite a lot is known about the receiver it uses to detect these signals, the ear.

The human ear (depicted in gure 4.2) consists of three main sections, the outer, the middle and the inner ears. The outer ear consists of the ear lobe (pinna) and the external auditory canal. The function of the ear lobe is to channel sounds into the ear and aid in the localization of sounds. The external auditory canal channels the sound into the middle ear. The canal is approximately 2.7 cm in length and is closed at one end by the eardrum. Hence it can be viewed as an acoustic tube that resonates at 3055 Hz [ISP05].

The eardrum is a hard membrane, approximately 0.1 mm thick, which is exible at the edge. When a sound wave strikes this membrane it vibrates. This vibration is then transferred to the three-bone structure in the middle ear and from there to the inner ear. These bones act as a transformer and match the acoustic impedance

Figure 4.2: The human ear

(http://www.owlnet.rice.edu/~psyc351/)

of the inner ear with that of air. Muscles attached to these bones suppress the vibration if it is too violent and so protect the inner ear. This protection only works for sounds below 2 kHz and it does not work for impulsive sounds. The Eustachian tube connects the middle ear to the vocal tract and removes any static pressure dierence between the middle ear and the outer ear. If a signicant pressure dierence is detected then the Eustachian tube opens and the dierence is removed.

The inner ear is composed of the Semicircular canals, the Cochlea, and auditory nerve terminations. The function of the Semicircular canals is to control balance.

The Cochlea is uid lled and helical in shape (it resembles the shell of a snail).

Inside the Cochlea there is a hair-lined membrane called the Basilar membrane. This membrane converts the mechanical signal into a neural signal. Dierent frequencies excite dierent portions of this membrane allowing a frequency analysis of the signal to be carried out. So the ear is essentially a spectrum analyzer that responds to the magnitude of the signal. The frequency resolution is greatest at low frequencies.

Like any receiver, there is a limit to the sensitivity of the ear. If sounds are too weak, they will not be detected. This is known as the threshold of audibility. This threshold varies with frequency and it can be increased at any given frequency by the presence of a large signal at a nearby lower frequency. This phenomenon is called masking and it is widely used in speech coding. If the quantization noise

-3 -2

-1 0

1 2

-1 3 -0.5 0 0.5 1

(a)

-3 -2

-1 0

1 2

-1 3 -0.5 0 0.5 1

(b)

Figure 4.3: The aect of sampling. (a) the continuous signal and (b) the sampled signal

can be concentrated around the formants, then it will be rendered inaudible to the listener.

4.2.2 Sampling

A speech signal is continuous in time. Before it can be processed by digital hardware it must be converted to a signal that is discrete in time. Sampling is a process that converts a continuous time signal into a discrete time signal by measuring the signal at periodic instants in time.

Figure 4.3 shows the eect of sampling on a sinusoidal signal. It is clear that as the number of samples per second (the sampling rate) increases, the sampled signal approximates the continuous signal more closely. In fact, if the sampling rate is high enough the sampled signal will contain all the information that is present in the continuous signal.

The Nyquist sampling theorem states that a signal may be perfectly reconstructed if it is sampled at a rate greater than or equal to twice the frequency of the highest frequency component of the signal.

4.2.2.1 The Nyquist Sampling Theorem

It is necessary to determine under what conditions a continuous signal g(t) may be unambiguously represented by a series of samples taken from the signal at uniform intervals of T . It is convenient to represent the sampling process as that of multi-plying the continuous signalg(t)by a sampling signal s(t)which is an innite train of impulse functions δ(nT). The sampled signal g_s(t) is:

g_s(t) =g(t)s(t)

Now the impulse train s(t) is a periodic function and may be represented by a Fourier series:

The spectrum G_s(ω) of the sampled signal may be determined by taking the Fourier transform of gs(t). This is most readily achieved by making use of the frequency shift theorem which states that:

Ifg(t)G(ω) theng(t)e^jω⁰^tG(ω−ω₀)

The above equation shows that the spectrum of the sampled signal is simply the sum of the spectra of the continuous signal repeated periodically at intervals of ω0 = ^2π_T . Thus the continuous signal g(t) may be perfectly recovered from the sampled signal g_s(t)provided that the sampling interval T is chosen so that:

2π

T >2ωB

whereωB is the bandwidth of the continuous signal g(t). If this condition holds then there is no overlap of the individual spectra in the summation of expression 4.1 and the original signal can be perfectly reconstructed using a lowpass lter H(ω) with bandwidth ωB:

H(ω) =

1, |ω|< ω_B 0, otherwise

This theory underpins the whole of digital audio and means that a digital audio system can in principle be designed which loses none of the information contained in the continuous domain signal g(t). Of course the practicalities are rather dierent as a band-limited input signal is assumed and the reconstruction lter H(ω), the A/D and D/A converters must be ideal.

4.2.3 Main Classes of Speech Coding

The central problem in speech coding is to represent the speech signal using as little bits as possible so that quality and intelligibility get damage as little as possible.

Speech coding is important in digital mobile phone systems and that's why speech coding methods have advanced considerably in the last 10 years. Thinking commer-cially, speech coding is the most important application of speech processing eld.

The requirements for a good speech codec can be:

• quality of speech suers as little as possible

• the speech is compressed in a small amount of bits

• coding-decoding yields only small delay

• codec is not sensitive to errors in transmission of bits

• coding/decoding is computationally fast

• the codec should perform well with noisy speech (and if possible with other musical signals etc.)

• several consequent encodings should not impair the quality too much

There are no perfect codecs satisfying all the requirements because part of the requirements are contradictory. However by making dierent compromises, a large number of coding standards for dierent applications have been developed. For instance in the speech codec of a mobile phone all the requirements above are essential, whereas in recording of speech in databases delay, computational load and error resiliency are inessential - only the quality and good compression ratio counts.

There are plenty of coding methods but they can be divided in roughly three main classes:

• Waveform Coding

• Source Coding

• Hybrid Coding

In waveform coding an eort is made to retain the waveform of the original signal and the coding is based on quantization and removal of redundancies in the waveform. In source coding the parameters of speech (the type of excitation, model of vocal tract, formant frequencies etc.) are coded enabling reconstruction in the decoder. The border line between these classes is volatile, especially in modern synthesis-analysis-codecs in which one tries to reconstruct the waveform of the original speech with good selections of parameters.

In the following will be assumed that the bandwidth of speech is the same as in mobile phone network which is 300-3400 Hz and the sampling frequency is 8 kHz. This is so called narrowband speech. In some application the narrowband network is not used and a higher sampling frequency may be used, for instance in video conferences. In these application the bandwidth is usually 50-7000 Hz and the sampling frequency is 16 kHz. This is so called wideband speech.

4.2.3.1 Waveform Coding

In general, waveform codecs are designed to be signal independent. They are de-signed to map the input waveform of the encoder into a facsimile-like replica of it at the output of the decoder. Because of this advantage, they can also encode a secondary type of information such as signaling tones, voice band data, or even music. Because of this signal transparency, their coding eciency is usually quite modest. The coding eciency can be improved by exploiting some statistical signal properties, if the codec parameters are optimized for the most likely categories of input signals, while still maintaining good quality for other types of signals as well.

The waveform codecs can be further subdivided into time-domain waveform codecs and frequency-domain waveform codecs.

Time-Domain Waveform Coding The most well-known representative of signal-independent time-domain waveform coding is the A-law companded pulse code mod-ulation (PCM) scheme. This coding has been standardized by the CCITT at 64 Kbps using non-linear companding characteristics to result in near-constant signal-to-noise ratio (SNR) over the total input dynamic range. More explicitly, the non-linear companding compresses large-input samples and expands small ones. Upon quantizing this companded signal, large-input samples will tolerate higher quantiza-tion noise than small samples. Also well-known is the 32 Kbps adaptive dierential PCM (ADPCM) scheme standardized in the ITU Recommendation G.721 and the adaptive delta modulation (ADM) arrangement, where usually the most recent sig-nal sample or a linear combination of the last few samples is used to form an estimate of the current one. Then their dierence signal, the prediction residual, is computed and encoded with a reduced number of bits, since it has a lower variance than the incoming signal. This estimation process is actually linear prediction with xed coef-cients. However, owing to the non-stationary statistics of speech, a xed predictor cannot consistently characterize the changing spectral envelope of speech signals.

Adaptive predictive coding (APC) schemes utilize two dierent time-varying pre-dictors to describe speech signals more accurately: a short-term predictor (STP) and a long-term predictor (LTP). The STP is utilized to model the speech spectral envelope, while the LTP is employed to model the line-spectrum-like ne structure representing the voicing information due to quasi-periodic voiced speech. All in all, time-domain waveform codecs treat the speech signal to be encoded as a full-band signal and attempt to map it into as close a replica of the input as possible. The dierence among various coding schemes is in their degree and way of using predic-tion to reduce the variance of the signal to be encoded, so as to reduce the number of bits necessary to represent it.

Frequency Domain Waveform Coding In frequency-domain waveform codecs, the input signal undergoes a more or less accurate short-time spectral analysis. The signal is split into a number of subbands, and the individual subband signals are then encoded by using dierent numbers of bits in order to obey rate-distortion theory on the basis of their prominence. The various methods dier in their ac-curacies of spectral resolution and in the bit-allocation principle (xed, adaptive, semi-adaptive). Two well-known representatives of this class are subband coding (SBC) and adaptive transform coding (ATC).

Waveform coding tries to encode the waveform itself in an ecient way. The signal is stored in such a way that upon decoding, the resulting signal will have the same general shape as the original. Waveform coding techniques apply to audio signals in general and not just to speech as they try to encode every aspect of the signal.

4.2.3.2 Source Coding

The philosophy of vocoders is based on a priori knowledge of the way the speech signal to be encoded was generated at the signal source by a speaker. The air com-pressed by the lungs excites the vocal cords in two typical modes. When generating voiced sounds, they vibrate and generate a quasi-periodic speech waveform, while in the case of lower energy unvoiced sounds they do not participate in the voice pro-duction and the source behaves similarly to a noise generator. The excitation signal denoted by E(z) in z-domain is then ltered through the vocal apparatus, which

Im Dokument A Fine-Grain Scalable and Channel-Adaptive Hybrid Speech Coding Scheme for Voice over Wireless IP (Seite 143-186)