Multimedia Databases
Wolf-Tilo Balke Younès Ghammad
Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de
• Shape-based Features - Chain Codes - Area-based Retrieval - Moment Invariants - Query by Example
Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2
Previous Lecture
6 Audio Retrieval 6.1 Basics of Audio Data
6.2 Audio Information in Databases 6.3 Audio Retrieval
Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3
6 Audio Retrieval
• Information transfer through sound – Audio (Latin, "I hear")
• Three different types of data:
– Music – Spoken text – Noise
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4
6.1 Basics of Audio Data
• Auditory perception
through pressure fluctuations in the air
• Eardrum vibrates synchronously
• Ear bones amplify and transmit the vibrations
• Auditory hair cells in the ear cochlea, are stimulated by the vibrations
• Neurons produce electrical impulses
6.1 Basics
• 3D model of the human ear
6.1 Basics
• Our brain only interprets two major properties of sound:
– Pitch – Volume
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7
6.1 Basics
• Quantitative performance of the sound wave
• Amplitude as volume – Logarithmic perception
(tenfold increase in amplitude doubles the perceived loudness)
• Frequency as pitch
– Number of periods per unit time
is known as frequency (measured in hertz) – Hearing range between 20 Hz and 20 kHz
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8
6.1 Basics
• Audio signals are time-dependent (overlapping) waveforms
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 9
6.1 Basics
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10
6.1 Basics
• Constructive and destructive interference
6.1 Basics
• Musical instruments are classified based on the vibration generator
– E.g., string-, blowing-, percussion
• Acoustics depends on the vibration generator – E.g., strings-, air, membrane instruments
• Synthetic creation needs an oscillator
• The oscillator generates voltage oscillations
• Speaker transmit the voltage changes on a membrane
6.1 Sound Creation
• Influence of the oscillator
– Higher voltage ⟹ Higher frequency (Moog, 1964)
• Amplifier influences the amplitude thus the volume
• ADSR (attack-decay-sustain-release) envelope influences the loudness of a sound in time
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13
6.1 Sound Creation
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14
6.1 Sound Creation
Moog 901B (1964) Modular Moog Synthesizer (1967)
• Emerson, Lake & Palmer:
The Great Gates of Kiev (1974)
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 15
6.1 Sound Creation
• Synthesized sounds seem rather metallic. For producing a single synthesized sound, consider four typical phases:
– Attack: speed and strength of the signal rise – Decay: lowering the level
– Sustain: actual pitch – Release: end of the
signal
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16
6.1 ADSR envelope
• Transformation of the continuous sound wave into a discrete representation
• Sampling: save at regular intervals, the current amplitude value of vibration
• Clearly, we have to reconstruct audio signals from these values
6.1 Digitalization of Audio Data
Amplitude
• Basic characteristics
– Sampling rate: how many times per unit time is the value of the continuous signal tapped?
– Resolution: which accuracy are the values recorded with?
• Often, a resolution of 16 bits is used (2
16different amplitude values)
• The sampling rate is application dependent:
– Audio CD: 44100 Hz – Phone: 8000 Hz
6.1 Sampling
• It is very important to uniquely reconstruct the initial oscillation
• The higher the sampling frequency, the more values must be saved
• Minimum sampling frequency?
– Sampling theorem (Nyquist, 1928) Sampling rate must be at least twice as large as the highest frequency occurring in the signal
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 19
6.1 Sampling Rate
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20
6.1 Sampling Rate
• Phone: 8000 Hz
• DVD audio: 96,000 Hz or 192,000 Hz
• Audio CD: 44,100 measurements per second for two stereo channels with
16 bits per measurement results in 176.4 kB/s
⟹ ca 10 MB/min, i.e., 635 MB/h
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 21
6.1 Sampling Rate
• For space reasons, is digital data usually stored in compressed form
• Known uncompressed formats:
– AIFF: *. aif (Apple Inter opportunity File Format) – Wave: *. wav (Windows)
– IRCAM: *. snd (Institut de Recherche et Coordination Acoustique / Musique)
– AU: *. au (Sun audio)
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22
6.1 Audio Formats
• Data reduction: with (lossy) or without information loss (lossless)
• Lossless compression methods generally don’t compress very much
– Free Lossless Audio Codec
•50–60% of their original size
– Apple Lossless
– WavPack – ...
6.1 Compression
• Lossy compression algorithms typically are based on simple transformations
– Modified Discrete Cosine
Transformation (MDCT) or wavelets
• Encoding: transformation of the waveform in frequency sequences (sampling)
• Decoding: Reconstruction of waveform from these values
• What data can we afford loosing?
6.1 Compression
• Change of the data without changing the subjective perception
– Omit very high/low frequencies
– Save superimposed frequencies with less precision – Use of other effects according to psychoacoustic
model, e.g., low tones before/after very loud sounds and frequency changes at a minimum distance are impossible to hear ...
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25
6.1 Compression
• MPEG-1 Audio Layers I, II and III (MP3)
• CD quality at bit rates of 128 kb/s
• Coarse approach to MP3
– Channel coupling of the stereo signal by using the difference
– Cut off inaudible frequencies
– Eliminate redundancy by considering the psychoacoustic effects
– Compress data using Huffman coding
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26
6.1 Compression
• AAC (Advanced Audio Coding) – Industry-driven improvement of the MP3
format (supported by the MPEG)
– Used in TV-/radio broadcasts, Apple iTunes Store, ...
– Better quality at same file size – Support for multi-channel audio
– Supports 48 main sound channels with up to 96 kHz sampling rate, 16 low-frequency channels (limited to 120 Hz) and 15 data flows
• Ogg Vorbis, Real Audio, WMA 9, ...
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 27
6.1 Compression
• Lossless compression, important factors are:
– De-/compression speed
– Compression rate
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28
6.1 Compression
• Lossy compression, important factors are:
– De-/compression speed – Compression rate
– Most important, the compressed audio quality!
6.1 Compression
• Lossy compression, results
6.1 Compression
• Lossy compression, results
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 31
6.1 Compression
• Communication protocol
– For transmission, recording and
playing musical control information between digital instruments and the PC
– Statements are not sounds, but commands that can be used e.g., by sound cards
– Some commands: Note-on, note-off, key velocity, pitch, type of instrument
• Example:
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32
6.1 The MIDI Format
MIDI: Original:
• 10 minutes music are about 200 KB of MIDI data
– Significant savings compared to sampling, but no original sound
• Data are input to the PC via keyboard and output via synthesizer
• Sequencer for caching data and changes
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 33
6.1 The MIDI Format
• Audio data – Music, CDs
– Sound effects, “Earcons”
• Audio data represent most of information transfer
– Storage of historical speeches
– Recordings of conversations, phone calls or negotiations
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34
6.2 Audio Information in Databases
• Three main applications of audio signals in the context of databases
– Identification of audio signals (audio as query) – Classification and search of similar signals
(matching of audio) – Phonetic synchronization
6.2 Special Applications
• Find the title, etc. for this music piece
• Monitoring of audio streams
– Control of broadcasting of advertisements on radio – Copyright Control (GEMA)
– (Remote) diagnosis based on noise
• Audio on Demand
6.2 Identification of Audio Signals
• Find perceptionally similar audio signals – E.g., similar pieces of music, the same quotation, ...
• Recommendation – E.g., bands with similar
music
• Genre classification (rock, classical, ...)
– E.g., in audio libraries
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37
6.2 Classification and Matching
• Synchronization of audio streams – Speech ⟺ text, notes ⟺ audio, ...
• Retrieval of text from or to speech – Find specific points in a speech – Verbal query to text documents
• Following of audio scores in concerts, etc.
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38
6.2 Synchronization
• Identification
– The simplest of the three problems, in recent years, successful research
• Classification and Matching – Often still manual annotations
– Automatic classification only works roughly, on small collections
– Matching is still largely unresolved
• Synchronization
– Meanwhile, tolerable error rates (language ⟺ text)
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39
6.2 State of the Art
• (Compressed) audio files are stored in the database as (smart) BLOBs
• Additionally, are metadata and feature vectors stored for the realization of the
search functionality
– Language: transcription as text – Music: musical notation or MIDI
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40
6.2 Persistent Storage
• Search in audio data: metadata describe the audio file
– Semantic metadata: difficult to generate title, artist, speaker, keywords, ...
– File information: can be automatically generated e.g., time/place of recording, filename, file size, ...
– Widely used, e.g., music exchange markets, online shops, ...
6.3 Audio Retrieval
• Manual indexing is extremely labor intensive and expensive
• Information is often incomplete, partial and subjective (e.g., genre classification)
• No possibility to Query by Example ( "Sounds like ...")
• Search only with SQL, approximate string search, etc.
6.3 Metadata-based Search
• Using content of audio files
• Compare measure vs., measure – Not very promising, inefficient
– Differences in sampling rate and resolution
• Sounds can be differentiated by certain characteristics
– Low Level Features – Logical Features
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 43
6.3 Content-based Search
• Acoustic features
– Same basic idea as in image databases – Description of signal information by means of
characteristic features
– In contrast to image information we don’t use a single feature vector, but a time-dependent vector function
– “Time-point” of the acoustic characteristics, rather than being contained in the audio file
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44
6.3 Low Level Features
• Typical Low Level Features – Mean amplitude, loudness – Frequency distribution – Pitch
– Brightness – Bandwidth
• Measured in the ...
– …Time domain (amplitude versus time)
– ... Frequency domain (intensity versus frequency)
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 45
6.3 Low Level Features
• Amplitude
– Pressure fluctuations around the zero point – Silence is equivalent to 0 amplitude
• Average energy
– Characterizes the volume of the signal
with N as the total number of measurements and x
nas n
thmeasurement
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46
6.3 Features in the Time Domain
• Zero-Crossing Rate
– Frequency of the sign change in the signal
with sgn as a sign function (signum)
6.3 Features in the Time Domain
• Silence Ratio
– Proportion of values that belong to a period of complete silence
• We must first establish:
– The amplitude value below which a pitch is considered to be silence
– The minimum number of consecutive readings that need to be silent, to form a silence period
6.3 Features in the Time Domain
• Fourier transform of the signal – Decomposition into frequency
components with coefficients (Fourier coefficients) – Representation of frequency
spectrum of the signal
•Size of the coefficients of the frequency (represents the amount of energy per frequency)
•Usually measured in decibels (dB)
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49
6.3 Features in the Frequency Domain
• "Ahhh" sound and Fourier spectrum
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50
6.3 Frequency Spectrum
• Bandwidth
– Interval of occurring frequencies
– Difference between the largest and smallest frequency in the spectrum (the minimum frequency is
considered to be the first frequency above the silence threshold)
– Can also be used for classification, e.g., bandwidth in music is higher than for voice
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51
6.3 Features in the Frequency Domain
• Power Distribution – Can be read directly from the
spectrum
– Distinction of frequencies with high/low energy
– Calculation of frequency bands with high/low energy – Centroid as the center of the spectral energy
distribution (brightness)
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52
6.3 Features in the Frequency Domain
• Harmony
– The lowest of all the “loud “ frequencies is called the fundamental frequency
– Harmony of a signal increases when the dominant components in the spectrum are multiples of the fundamental frequency
– E.g., standard pitch A, as the fundamental frequency (440 Hz) produced on a violin creates harmonic oscillations at 880 Hz, 1320 Hz, 1760 Hz, etc.
6.3 Features in the Frequency Domain 6.3 Harmony
Harmonic oscillations Frequency spectrum of a sound played on an instrument
• Pitch
– Can be approximated by means of the Fourier spectrum
– The value is calculated from the frequencies and amplitudes of the peaks
– Related to the fundamental frequency, which is often used as an approximation
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55
6.3 Features in the Frequency Domain
• Audio Retrieval - Basics of Audio Data
- Audio Information in Databases - Audio Retrieval
Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56
This Lecture
• Classification and Retrieval of Audio – Low level Audio Features
– Difference Limen – Pitch Detection
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57