Multimedia Databases
Wolf-Tilo Balke Janus Wawrzinek
Institut für Informationssysteme
Technische Universität Braunschweig
• Audio Retrieval
- Basics of Audio Data
- Audio Information in Databases - Audio Retrieval
7 Previous Lecture
7 Audio Retrieval
7.1 Low Level Audio Features 7.2 Difference Limen
7.3 Pitch recognition
7 Audio Retrieval
• Typical Low Level Features
– Mean amplitude (loudness) – Frequency distribution,
bandwidth
– Energy distribution (brightness)
– Harmonics – Pitch
• Measured
– In the time domain:
any given time is assigned to an amplitude
7.1 Low-level Audio Features
• Fourier analysis:
– Simple characterization by Fourier transform
– Fourier coefficients are descriptive feature vector
• Issues:
– Time-domain does not show the frequency components of a signal
– Frequency-domain does not show when frequencies occur
7.1 Fourier Analysis
• Spectrograms: combined representation of time and frequency domain
– Raster image – X-axis as time
– Y-axis as the frequency components
– Gray value of a point is the energy of that frequency at that time
• Allows for example, analysis of regularity
7.1 Spectrogram
• Spectrogram of the spoken word “durst”
7.1 Spectrogram
• Use of different low level features for automatic classification of audio files
– Different audio classes have typical values for various properties
– Thus, various typical feature vectors
7.1 Classification
• Distinguish speech and music
• Characteristics are in each case difficult to predict, but there are general trends
• Don’t just use a single feature, but evaluate combination of all features
• Dependent and independent features
7.1 Example
• Bandwidth
– For speech rather low
100–7000 Hz 100-7000 Hz
– In music it tends to be high, 16-20000 Hz
• Brightness
(central point of the bandwidth):
– In language it is low
(mainly due to the low bandwidth)
7.1 Example
• Proportion of silence
– Frequent pauses in speech (between words and sentences)
– Low percentage of silence for music (except for solo instruments)
• Variance of the zero crossing rate (over time)
– In speech there is a characteristic structure of syllables:
short and long vowels, therefore fluctuating zero crossing rate
– Music often has a consistent rhythm,
7.1 Example
• Simple classification algorithm (Lu and Hankinson, 1998)
7.1 Example
Brightness
Portion of silence
Variance of zero crossing rate
Music
Speech
high
low low
high low
Solo music
Audio
• Quantitative high / low estimates are highly dependent on the collection
– Determine reference vector for each class by a set of training examples
• Assignment of new audio files to classes is based on minimum distance of its feature vector to one of the reference vectors of the respective class
7.1 Example
• Low-level Features for Audio Retrieval
– For good feature vectors, the audio signal must be divided into time slots
– Compute a vector for each window
• Calculate low level features in the time window
• Build statistical characteristics about low level features
– Perceptional comparison of audio files
7.1 Static Coefficients
• Four statistical characteristics (Wold and others, 1996)
– Loudness (perceived volume)
• Measured as the root mean square (RMS) of the amplitude values (in dB)
• More sophisticated methods take into account differences in the perceptionallity of
parts of the frequency spectrum
(<50 Hz, >20Khz)
7.1 Static Coefficients
– Brightness (perceived brightness)
• Defined as the center-of-gravity of the Fourier spectrum
• Logarithmic scale
• Describes the amount of high frequencies in the signal
– Bandwidth (frequency bandwidth)
• Defined as a weighted average of the differences of the
Fourier coefficients to the center-of-gravity of the spectrum
• Amplitudes are used as weights
7.1 Static Coefficients
– Pitch (perceived pitch)
• Calculated from the frequencies and amplitudes of the peaks
within each interval (pitch tracking)
• Pitch tracking is a difficult problem, therefore, often in simpler systems approximated by the fundamental
frequency
7.1 Static Coefficients
• Time-dependent function for each size in each
time window
• E.g., laughter
– Loudness – Brightness – Bandwidth – Pitch
7.1 Static Coefficients
• Aggregate statistical description of the four functions through:
– Expected value (average value) – Variance (mean square deviation) – Autocorrelation
(self-similarity of the signal)
• Either for each window or for the whole signal (results in 12-dimensional feature vector, such as IBM's QBIC)
7.1 Static Coefficients
• Example: Laughter
• Each sound has typical values
• Thus we can classify audio files
7.1 Static Coefficients
• Training set of sounds of a class leads to a perceptional model for each class
– Compute the vector of the mean
– Calculate the covariance matrix
7.1 Static Classification
• For every new audio file, compute the Mahalanobis distance to each class:
• Order the data of a class (either on the threshold or on the minimum distance)
• Determine the probability of correct classification as:
7.1 Static Classification
• Classification for laughter (Wold and others, 1996)
7.1 Application in Classification
• Statistical properties for retrieval and
classification work well with short audio data
– Parameters statistically represent human perception – Easy to use, easy to index, query by example
– The only expansion in commercial databases
• DB2 AIV Extenders (development discontinued)
• Oracle Multimedia
7.1 Evaluation
• Ok for differentiating between speech and music or laughter and music
– But purely statistical values are rather unsuitable in order to classify and differentiate between musical pieces
– Detection of notes from the audio signal (pitch determination) does not work very well
– How does one define the term “melody”?
(especially for queries, query by humming)
7.1 Evaluation
• Recognition of notes from signal
– Variety of instruments
– Overlap of different instruments (and possibly voice)
• Simple, if we have data in MIDI format and the audio signal was synthesized from it
7.1 Problem
• Definition of “melody”
– Melody = sequence of musical notes – But querying for a melody has to be:
• Invariant under pitch shift (soprano and bass)
• Invariant under time shift
• Invariant under slight variations
• Often, not the sequence of notes themselves, but a sequence of their differences
7.1 Problem
• Pitch has something to do with the frequency
– Only useful for periodic frequencies, not for noise – Harmonic tones have one main oscillation and several
harmonics
7.2 Frequencies and Pitch
7.2 Frequencies and Pitch
• Harmonics
• Interference make the automatic detection of the dominant pitch difficult
• Human perception often differs from physical measurements
– E.g., fundamental frequency ≠ pitch
• However we need the pitch to extract the melody line
7.2 Problem
• Exactly how do people perceive frequency differences?
– Difference Limen is the smallest change that is reliably perceived (“just noticeable difference”) – Accuracy varies with different pitch, duration and
volume
– Experimental determination of average values for sine waves (Jesteadt and others, 1977)
7.2 Difference Limen
• Determined through
psychological testing
– Two tones with 500 ms duration and small tone difference are played one after the other
– Subjects determine whether the second tone was higher or lower
– This results in a psychometric function between the difference in frequency and accuracy of the
classification (50% -100%)
7.2 Difference Limen
• (Jesteadt and others, 1977): Difference Limen by 0.2%
7.2 Difference Limen
• 0.2% Difference Limen means that most people can distinguish a 1000 Hz tone from a 1002 Hz tone reliably
7.2 Difference Limen
• Quality of separation is not uniform across the frequency band
– Worse at high and low frequencies
• Tone duration is important
– Increasingly better for tones between 0 –100 ms, constant for longer
• Volume is important
– Increasingly better for tones
between 5 - 40 dB, then constant
7.2 Difference Limen
• ANSI standard (1994)
– „Pitch is that attribute of auditory sensation in terms of which sounds may be ordered on a scale extending from low to high. Pitch depends mainly on the
frequency content of the sound stimulus, but it also depends on the sound pressure and the waveform of the stimulus“
• Typically, limit to the melody line, to distinguish pitch from timbre
– E.g., “s” – and “sh“ sounds, rather have different timbre
7.3 Pitch Definition
• Experiments on frequency scale
– Adaptation of a generated sine wave (with 40 dB) to the perceived pitch by using test subjects (Fletcher, 1934)
– Histograms show the pitch (x Hz) and the compliance of all test subjects (x ± y Hz)
– Multimodal distributions indicate several pitches
• E.g., polyphonic music: some persons concentrate on one instrument while others on another instrument
7.3 Pitch Determination
• Location dependent pitch detection
– Cochlea perceives different frequencies at different places
– High frequencies on
entrance of the cochlea – Low frequencies
at the end of the cochlea – The brain recognizes
which neurons were
7.3 Theoretical Models
• The stimulation of different neurons along the approximately 35 mm long basilar membrane in the cochlea is a typical pattern for audio coding
• The pitch can be detected from
this patterns
7.3 Location-dependent Models
• Formula in millimeters (z), of the excitation (Greenwood, 1990)
7.3 Location-dependent Models
• The coding of the sound is not purely location-dependent, but rather by
temporal synchronization of the neurons
– All neurons fire spontaneously in random sequence depending on their refraction characteristic
– When a sound with some frequency starts, it causes more neurons to fire synchronously
– The brain determines the pitch based on an “auto- correlation function” of the pattern
7.3 Time-dependent Models
• The two models address recognizing the pitches of individual sounds
• Pitch detection is more difficult in the case of complex tones
– Groups of neurons are excited in several locations or with interfering synchronization patterns
• Which neuron excitement is the pitch?
7.3 Theoretical Models
• Lowest frequency generates harmonics …
thus pitch as the fundamental frequency?
– Psychological experiments with and without the fundamental frequency in the same note, show, that the pitch of the note is still rated the same
– Since synchrony remains the same with or without the fundamental frequency, then we should consider the time-dependent model
• But how do we evaluate the synchrony?
7.3 Fundamental Frequency
• The hearing analyses complex sounds in different frequency bands
• The hearing processes organizes and integrates the different impressions
• Decide the pitch by matching against the harmonic templates (Goldstein, 1973)
7.3 Auditory Organization
• Experiments favor centralized template-matching
– We can feel pitches, even if we split the signal into disjoint units which are
then heard with both ears
– The pitch is synthesized (but it doesn’t work on all partitions; they are usually heard as more pitches) – The listeners can be mislead by ambient noise to
perceive a false template
7.3 Auditory Organization
• The pitch can also be synthesized as non- occurring frequency e.g., 296 Hz for the
simultaneous play of the following non-harmonic tones:
7.3 Auditory Organization
Apparent 2/3 harmonic
• Pitch is a feature of the frequency at a particular time
– Pitch tracking in the frequency domain – Pitch tracking in the time domain
• Length of time window for frequency spectrum
– At least twice the length of the estimated period
7.3 Pitch Tracking Algorithms
• Requirements
– Frequency resolution in the range of a semitone with the correct octave
– Detection of different instruments with well-defined harmonies (e.g., cello, flute)
– (Recognition of pitches for conversion into symbolic notation in real time for interactive systems)
7.3 Pitch Tracking Algorithms
• HPS-pitch detection is one of the simplest and most robust method in the frequency domain
(Schroeder, 1968), (Noll, 1969)
– A certain frequency range is analyzed for the fundamental frequency
• E.g., 50-250 Hz for male voices
– All frequencies in this area are analyzed for harmonic overtones
7.3 Harmonic Product Spectrum
• X(ω): strength of the frequency ω in the current time window
• R: number of harmonics to be checked
– E.g., R = 5
• Pitch is then the maximum of the product
spectrum Y over all frequencies ω
1,ω
2,... in the frequency range to be investigated
7.3 Harmonic Product Spectrum
7.3 Harmonic Product Spectrum
• Problems occur due to noise at frequencies below 50 Hz
• Other problems occur due to the frequent octave errors
– Pitch is often recognized an octave too high
• Rule-based solution:
– If the next closest amplitude under the pitch candidate has approx. half the frequency of the pitch candidate, and its amplitude is above a threshold, then select the pitch one octave below the pitch candidate
7.3 Harmonic Product Spectrum
7.3 Harmonic Product Spectrum
• HPS, example
• The ML algorithm (Noll, 1969) compares the possibly “ideal” specters with this spectrum and selects the most similar based on the shape
• Ideal specters are chains of pulses of the dampening function of the signal window
• For the dampening function
– The signal section, represented by each signal window, (e.g., length 40 ms) is dampened (mainly at the edges)
7.3 Maximum Likelihood Estimator
• Generation of an ideal spectrum
7.3 Maximum Likelihood Estimator
Dampening function
Convolution