schmidt@informatik.
haw-hamburg.de
Introduction to Audio Coding
o Fundamentals of Sound o Sampling & Quantization o Compression
o Coding Standards
schmidt@informatik.
haw-hamburg.de
The Nature of Sound
o Conversion of energy into vibrations in the air (or some other elastic medium)
o Most sound sources vibrate in complex ways leading to sounds with components at several different
frequencies
o Sound is characterized at a very small timescale o
Frequency spectrum
– relative amplitudes of the frequency componentso Range of human hearing: roughly 20Hz–20kHz, falling off with age
schmidt@informatik.
haw-hamburg.de
Psychoacoustics
schmidt@informatik.
haw-hamburg.de
Waveforms
o Sounds change over time
- e.g. musical note has attack and decay, speech changes constantly
o Frequency spectrum alters as sound changes o Waveform is a plot of amplitude against time
- Provides a graphical view of characteristics of a changing sound
- Can identify syllables of speech, rhythm of music, quiet and loud passages, etc
schmidt@informatik.
haw-hamburg.de
Digitization – Sampling
o Audio timescale implies minimum rate of 40kHz to reproduce sound up to limit of hearing
- CD: 44.1kHz
- Sub-multiples often used for low bandwidth - e.g. 22.05kHz for Internet audio
- DAT: 48kHz
- (Hence mixing sounds from CD and DAT will require some resampling, best avoided)
schmidt@informatik.
haw-hamburg.de
Digitization – Quantization
o 16 bits, 65536 quantization levels, CD quality o 8 bits: audible quantization noise, can only use if some distortion is acceptable, e.g. voice communication
o Dithering – introduce small amount of random noise before sampling
- Noise causes samples to alternate rapidly between quantization levels, effectively smoothing sharp
transitions
schmidt@informatik.
haw-hamburg.de
Undersampling & Dithering
schmidt@informatik.
haw-hamburg.de
Data Size
o Sampling rate r is the number of samples per second, Sample size s bits
o Each second of digitized audio requires rs /8 bytes
o CD quality: r = 44100, s = 16, hence each
second requires just over 86 kbytes (k=1024), each minute roughly 5Mbytes (mono)
o CD quality, stereo, 3-minute song requires over
25 Mbytes
schmidt@informatik.
haw-hamburg.de
Compression
o In general,
lossy
methods required because of complex and unpredictable nature of audio data o Difference in perceiving sound and image meansdifferent approach from image compression:
+ Sampling & Quantization: PCM – DPCM – ADPCM + Special voice coding: Vocoders
+ Transform coding: DFT, DCT, MDCT + Perceptual coding
+ Entropy coding
schmidt@informatik.
haw-hamburg.de
Companding
o Non-linear quantization
o Higher quantization levels spaced
further apart than lower ones
o Quiet sounds represented in
greater detail than loud ones
o µ-law, A-law
(16 → 8 bit)
schmidt@informatik.
haw-hamburg.de
Improved Pulse Code Modulation
o Differential Pulse Code Modulation
- Similar to video inter-frame compression
- Compute a predicted value for next sample, store the difference between prediction and actual value
o Adaptive Differential Pulse Code Modulation
- Dynamically vary step size used to store quantized differences
schmidt@informatik.
haw-hamburg.de
Speech Coders
There is a variety of well known ITU voice coders:
o G.711 – ‘the’ ISDN codec, µ / A-law companding, 8 bit@8kHz, (2:1)
o G.721, G.722, G.723
o G.726, G.727 – ADPCM with linear prediction, variable codewords 2 – 5 bit@8kHz (≤ 8:1) o GSM 06.10 - ADPCM with linear prediction,
1,6bit@8kHz (10:1)
o Current compression values up to 50 : 1, 10 : 1 with
decent quality (see www.speex.org)
schmidt@informatik.
haw-hamburg.de
Perceptually-Based Compression
o Identify and discard data that doesn't affect the perception of the signal
- Needs a
psycho-acoustical model
, since ear and brain do not respond to sound waves in a simple wayo Threshold of hearing – sounds too quiet to hear
o Masking – sound obscured by some other sound
schmidt@informatik.
haw-hamburg.de
Compression by Masking
o Split signal into bands of frequencies using filters
- Commonly use 32 bands
o Compute
masking level
for each band, based on its average value and a psycho-acoustical model- i.e. approximate masking curve by a single value for each band
o Discard signal if it is below masking level
o Otherwise quantize using the minimum number of bits that will mask quantization noise
o Also: Temporal Masking
schmidt@informatik.
haw-hamburg.de
MPEG Audio Coding
o MPEG Audio, Layer 3
o Three layers of audio compression in MPEG-1 (MPEG-2 essentially identical)
o Layer 1...Layer 3, encoding proces increases in complexity, data rate for same quality decreases
- Layer 2: temporal & frequency masking – Layer 3: MDCT + …
- e.g. Same quality 192kbps at Layer 1, 128kbps at Layer 2, 64kbps at Layer 3 o 10:1 compression ratio at high quality
o Variable bit rate coding (VBR)
schmidt@informatik.
haw-hamburg.de
References
• Z. Li, M. Drew: Fundamentals of Multimedia, Pearson Prentice Hall, 2004.
• K. Rao, Z. Bojkovic, D. Milavanovic, Multimedia Communication Systems, Prentice Hall, 2002.
• N. Chapman, J. Chapman: Digital Multimedia, 2nd edition, Wiley, Chichester, GB, 2004.