Multimedia Databases
Wolf-Tilo Balke Janus Wawrzinek
Institut für Informationssysteme
Technische Universität Braunschweig
• Shape-based Features
- Chain Codes
- Area-based Retrieval - Moment Invariants - Query by Example
Previous Lecture
6 Audio Retrieval
6.1 Basics of Audio Data
6.2 Audio Information in Databases 6.3 Audio Retrieval
6 Audio Retrieval
• Information transfer through sound
– Audio (Latin, "I hear")
• Three different types of data:
– Music
– Spoken text – Noise
6.1 Basics of Audio Data
• Auditory perception
through pressure fluctuations in the air
• Eardrum vibrates synchronously
• Ear bones amplify and transmit the vibrations
• Auditory hair cells in the ear cochlea, are stimulated by the vibrations
• Neurons produce
6.1 Basics
6.1 Basics
• Our brain
only interprets two major
properties of sound:
– Pitch – Volume
6.1 Basics
• Quantitative performance of the sound wave
• Amplitude as volume
– Logarithmic perception (tenfold increase in
amplitude doubles
the perceived loudness)
• Frequency as pitch
– Number of periods per unit time
6.1 Basics
• Audio signals are time-dependent (overlapping) waveforms
6.1 Basics
6.1 Basics
• Constructive and destructive interference
6.1 Basics
• Audio examples
6.1 Basics
• Musical instruments are classified based on the vibration generator
– E.g., string-, blowing-, percussion
• Acoustics depends on the vibration generator
– E.g., strings-, air, membrane instruments
• Synthetic creation needs an oscillator
• The oscillator generates voltage oscillations
• Speaker transmit the voltage changes on a membrane
6.1 Sound Creation
• Influence of the oscillator
– Higher voltage ⟹ Higher frequency (Moog, 1964)
• Amplifier influences the amplitude thus the volume
• ADSR (attack-decay-sustain-release) envelope influences the loudness of a sound in time
6.1 Sound Creation
6.1 Sound Creation
Moog 901B (1964) Modular Moog
Synthesizer (1967)
• Emerson, Lake & Palmer:
The Great Gates of Kiev (1974)
6.1 Sound Creation
• Synthesized sounds seem rather metallic. For producing a single synthesized sound, consider four typical phases:
– Attack: speed and strength of the signal rise – Decay: lowering the level
– Sustain: actual pitch – Release: end of the
signal
6.1 ADSR envelope
• Transformation of the continuous sound wave into a discrete representation
• Sampling: save at regular intervals, the current amplitude value of vibration
• Clearly, we have to reconstruct audio signals from these values
6.1 Digitalization of Audio Data
Amplitude
• Basic characteristics
– Sampling rate: how many times per unit time is the value of the continuous signal tapped?
– Resolution: which accuracy are the values recorded with?
• Often, a resolution of 16 bits is used (2
16different amplitude values)
• The sampling rate is application dependent:
– Audio CD: 44100 Hz – Phone: 8000 Hz
6.1 Sampling
• It is very important to uniquely reconstruct the initial oscillation
• The higher the sampling frequency, the more values must be saved
• Minimum sampling frequency?
– Sampling theorem (Nyquist, 1928)
Sampling rate must be at least twice as large
as the highest frequency occurring in the signal
6.1 Sampling Rate
6.1 Sampling Rate
• Phone: 8000 Hz
• DVD audio: 96,000 Hz or 192,000 Hz
• Audio CD: 44,100 measurements per second for two stereo channels with
16 bits per measurement results in 176.4 kB/s
⟹ ca 10 MB/min, i.e., 635 MB/h
6.1 Sampling Rate
• For space reasons, is digital data usually stored in compressed form
• Known uncompressed formats:
– AIFF: *. aif (Apple Inter opportunity File Format) – Wave: *. wav (Windows)
– IRCAM: *. snd (Institut de Recherche et Coordination Acoustique / Musique)
– AU: *. au (Sun audio)
6.1 Audio Formats
• Data reduction: with (lossy) or without information loss (lossless)
• Lossless compression methods generally don’t compress very much
– Free Lossless Audio Codec
• 50–60% of their original size
– Apple Lossless – WavPack
6.1 Compression
• Lossy compression algorithms typically are based on simple transformations
– Modified Discrete Cosine
Transformation (MDCT) or wavelets
• Encoding: transformation of the waveform in frequency sequences (sampling)
• Decoding: Reconstruction of waveform from these values
• What data can we afford loosing?
6.1 Compression
• Change of the data without changing the subjective perception
– Omit very high/low frequencies
– Save superimposed frequencies with less precision – Use of other effects according to psychoacoustic
model, e.g., low tones before/after very loud sounds and frequency changes at a minimum distance are
impossible to hear ...
6.1 Compression
• MPEG-1 Audio Layers I, II and III (MP3)
• CD quality at bit rates of 128 kb/s
• Coarse approach to MP3
– Channel coupling of the stereo signal by using the difference
– Cut off inaudible frequencies
– Eliminate redundancy by considering the psychoacoustic effects
– Compress data using Huffman coding
6.1 Compression
• AAC (Advanced Audio Coding)
– Industry-driven improvement of the MP3 format (supported by the MPEG)
– Used in TV-/radio broadcasts, Apple iTunes Store, ...
– Better quality at same file size – Support for multi-channel audio
– Supports 48 main sound channels with up to 96 kHz sampling rate, 16 low-frequency channels (limited to
6.1 Compression
• Lossless compression, important factors are:
– De-/compression speed
– Compression rate
6.1 Compression
• Lossy compression, important factors are:
– De-/compression speed – Compression rate
– Most important, the compressed audio quality!
6.1 Compression
• Lossy compression, results
6.1 Compression
• Lossy compression, results
6.1 Compression
• Communication protocol
– For transmission, recording and
playing musical control information between digital instruments and the PC
– Statements are not sounds, but commands that can be used e.g., by sound cards
– Some commands: Note-on, note-off, key velocity, pitch, type of instrument
• Example:
6.1 The MIDI Format
• 10 minutes music are
about 200 KB of MIDI data
– Significant savings compared to sampling, but no original sound
• Data are inputted to the PC via keyboard and outputted via synthesizer
• Sequencer for caching data and changes
6.1 The MIDI Format
• Audio data
– Music, CDs
– Sound effects, “Earcons”
• Audio data represent most of information transfer
– Storage of historical speeches
– Recordings of conversations, phone calls or negotiations
6.2 Audio Information in Databases
• Three main applications of audio signals in the context of databases
– Identification of audio signals (audio as query) – Classification and search of similar signals
(matching of audio)
– Phonetic synchronization
6.2 Special Applications
• Find the title, etc. for this music piece
• Monitoring of audio streams
– Control of broadcasting of advertisements on radio – Copyright Control (GEMA)
– (Remote) diagnosis based on noise
• Audio on Demand
6.2 Identification of Audio Signals
• Find perceptionally similar audio signals
– E.g., similar pieces of music, the same quotation, ...
• Recommendation
– E.g., bands with similar music
• Genre classification (rock, classical, ...)
– E.g., in audio libraries
6.2 Classification and Matching
• Synchronization of audio streams
– Speech ⟺ text, notes ⟺ audio, ...
• Retrieval of text from or to speech
– Find specific points in a speech – Verbal query to text documents
• Following of audio scores in concerts, etc.
6.2 Synchronization
• Identification
– The simplest of the three problems, in recent years, successful research
• Classification and Matching
– Often still manual annotations
– Automatic classification only works roughly, on small collections
– Matching is still largely unresolved
• Synchronization
6.2 State of the Art
• (Compressed) audio files are stored in the database as (smart) BLOBs
• Additionally, are metadata and feature vectors stored for the realization of the
search functionality
– Language: transcription as text – Music: musical notation or MIDI
6.2 Persistent Storage
• Search in audio data: metadata describe the audio file
– Semantic metadata: difficult to generate title, artist, speaker, keywords, ...
– File information: can be automatically generated e.g., time/place of recording, filename, file size, ...
– Widely used, e.g.,
music exchange markets, online shops, ...
6.3 Audio Retrieval
• Manual indexing is extremely labor intensive and expensive
• Information is often incomplete, partial and subjective (e.g., genre classification)
• No possibility to Query by Example ( "Sounds like ...")
• Search only with SQL, approximate string search, etc.
6.3 Metadata-based Search
• Using content of audio files
• Compare measure vs., measure
– Not very promising, inefficient
– Differences in sampling rate and resolution
• Sounds can be differentiated by certain characteristics
– Low Level Features – Logical Features
6.3 Content-based Search
• Acoustic features
– Same basic idea as in image databases
– Description of signal information by means of characteristic features
– In contrast to image information we don’t use a single feature vector, but a time-dependent vector
function
– “Time-point” of the acoustic characteristics, rather than being contained in the audio file
6.3 Low Level Features
• Typical Low Level Features
– Mean amplitude, loudness – Frequency distribution
– Pitch
– Brightness – Bandwidth
• Measured in the ...
– …Time domain (amplitude versus time)
6.3 Low Level Features
• Amplitude
– Pressure fluctuations around the zero point – Silence is equivalent to 0 amplitude
• Average energy
– Characterizes the volume of the signal
with N as the total number of measurements and x
nas n
thmeasurement
6.3 Features in the Time Domain
• Zero-Crossing Rate
– Frequency of the sign change in the signal
with sgn as a sign function (signum)
6.3 Features in the Time Domain
• Silence Ratio
– Proportion of values that belong to a period of complete silence
• We must first establish:
– The amplitude value below which a pitch is considered to be silence
– The minimum number of consecutive readings that need to be silent, to form a silence period
6.3 Features in the Time Domain
• Fourier transform of the signal
– Decomposition into frequency components with coefficients (Fourier coefficients)
– Representation of frequency spectrum of the signal
• Size of the coefficients of the frequency (represents the amount of energy per frequency)
• Usually measured in decibels (dB)
6.3 Features in the Frequency Domain
• "Ahhh" sound and Fourier spectrum
6.3 Frequency Spectrum
• Bandwidth
– Interval of occurring frequencies
– Difference between the largest and smallest frequency in the spectrum (the minimum frequency is
considered to be the first frequency above the silence threshold)
– Can also be used for classification, e.g., bandwidth in music is higher than for voice
6.3 Features in the Frequency Domain
• Power Distribution
– Can be read directly from the spectrum
– Distinction of frequencies with high/low energy
– Calculation of frequency bands with high/low energy – Centroid as the center of the spectral energy
distribution (brightness)
6.3 Features in the Frequency Domain
• Harmony
– The lowest of all the “loud “ frequencies is called the fundamental frequency
– Harmony of a signal increases when the dominant components in the spectrum are multiples of the fundamental frequency
– E.g., standard pitch A, as the fundamental frequency (440 Hz) produced on a violin creates harmonic
oscillations at 880 Hz, 1320 Hz, 1760 Hz, etc.
6.3 Features in the Frequency Domain
6.3 Harmony
Harmonic oscillations Frequency spectrum of a sound played on an instrument