Multimedia Databases Wolf-Tilo Balke Younès Ghammad Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

(1)

Multimedia Databases

Wolf-Tilo Balke Younès Ghammad

Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

• Shape-based Features - Chain Codes - Area-based Retrieval - Moment Invariants - Query by Example

Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2

Previous Lecture

6 Audio Retrieval 6.1 Basics of Audio Data

6.2 Audio Information in Databases 6.3 Audio Retrieval

6 Audio Retrieval

• Information transfer through sound – Audio (Latin, "I hear")

• Three different types of data:

– Music – Spoken text – Noise

Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4

6.1 Basics of Audio Data

• Auditory perception

through pressure fluctuations in the air

• Eardrum vibrates synchronously

• Ear bones amplify and transmit the vibrations

• Auditory hair cells in the ear cochlea, are stimulated by the vibrations

• Neurons produce electrical impulses

6.1 Basics

• 3D model of the human ear

6.1 Basics

(2)

• Our brain only interprets two major properties of sound:

– Pitch – Volume

6.1 Basics

• Quantitative performance of the sound wave

• Amplitude as volume – Logarithmic perception

(tenfold increase in amplitude doubles the perceived loudness)

• Frequency as pitch

– Number of periods per unit time

is known as frequency (measured in hertz) – Hearing range between 20 Hz and 20 kHz

6.1 Basics

• Audio signals are time-dependent (overlapping) waveforms

6.1 Basics

• Constructive and destructive interference

6.1 Basics

• Musical instruments are classified based on the vibration generator

– E.g., string-, blowing-, percussion

• Acoustics depends on the vibration generator – E.g., strings-, air, membrane instruments

• Synthetic creation needs an oscillator

• The oscillator generates voltage oscillations

• Speaker transmit the voltage changes on a membrane

6.1 Sound Creation

(3)

• Influence of the oscillator

– Higher voltage ⟹ Higher frequency (Moog, 1964)

• Amplifier influences the amplitude thus the volume

• ADSR (attack-decay-sustain-release) envelope influences the loudness of a sound in time

6.1 Sound Creation

Moog 901B (1964) Modular Moog Synthesizer (1967)

• Emerson, Lake & Palmer:

The Great Gates of Kiev (1974)

6.1 Sound Creation

• Synthesized sounds seem rather metallic. For producing a single synthesized sound, consider four typical phases:

– Attack: speed and strength of the signal rise – Decay: lowering the level

– Sustain: actual pitch – Release: end of the

signal

6.1 ADSR envelope

• Transformation of the continuous sound wave into a discrete representation

• Sampling: save at regular intervals, the current amplitude value of vibration

• Clearly, we have to reconstruct audio signals from these values

6.1 Digitalization of Audio Data

Amplitude

• Basic characteristics

– Sampling rate: how many times per unit time is the value of the continuous signal tapped?

– Resolution: which accuracy are the values recorded with?

• Often, a resolution of 16 bits is used (2

¹⁶

different amplitude values)

• The sampling rate is application dependent:

– Audio CD: 44100 Hz – Phone: 8000 Hz

6.1 Sampling

(4)

• It is very important to uniquely reconstruct the initial oscillation

• The higher the sampling frequency, the more values must be saved

• Minimum sampling frequency?

– Sampling theorem (Nyquist, 1928) Sampling rate must be at least twice as large as the highest frequency occurring in the signal

6.1 Sampling Rate

• Phone: 8000 Hz

• DVD audio: 96,000 Hz or 192,000 Hz

• Audio CD: 44,100 measurements per second for two stereo channels with

16 bits per measurement results in 176.4 kB/s

⟹ ca 10 MB/min, i.e., 635 MB/h

6.1 Sampling Rate

• For space reasons, is digital data usually stored in compressed form

• Known uncompressed formats:

– AIFF: . aif (Apple Inter opportunity File Format) – Wave: . wav (Windows)

– IRCAM: *. snd (Institut de Recherche et Coordination Acoustique / Musique)

– AU: *. au (Sun audio)

6.1 Audio Formats

• Data reduction: with (lossy) or without information loss (lossless)

• Lossless compression methods generally don’t compress very much

– Free Lossless Audio Codec

•50–60% of their original size

– Apple Lossless

– WavPack – ...

6.1 Compression

• Lossy compression algorithms typically are based on simple transformations

– Modified Discrete Cosine

Transformation (MDCT) or wavelets

• Encoding: transformation of the waveform in frequency sequences (sampling)

• Decoding: Reconstruction of waveform from these values

• What data can we afford loosing?

6.1 Compression

(5)

• Change of the data without changing the subjective perception

– Omit very high/low frequencies

– Save superimposed frequencies with less precision – Use of other effects according to psychoacoustic

model, e.g., low tones before/after very loud sounds and frequency changes at a minimum distance are impossible to hear ...

6.1 Compression

• MPEG-1 Audio Layers I, II and III (MP3)

• CD quality at bit rates of 128 kb/s

• Coarse approach to MP3

– Channel coupling of the stereo signal by using the difference

– Cut off inaudible frequencies

– Eliminate redundancy by considering the psychoacoustic effects

– Compress data using Huffman coding

6.1 Compression

• AAC (Advanced Audio Coding) – Industry-driven improvement of the MP3

format (supported by the MPEG)

– Used in TV-/radio broadcasts, Apple iTunes Store, ...

– Better quality at same file size – Support for multi-channel audio

– Supports 48 main sound channels with up to 96 kHz sampling rate, 16 low-frequency channels (limited to 120 Hz) and 15 data flows

• Ogg Vorbis, Real Audio, WMA 9, ...

6.1 Compression

• Lossless compression, important factors are:

– De-/compression speed

– Compression rate

6.1 Compression

• Lossy compression, important factors are:

– De-/compression speed – Compression rate

– Most important, the compressed audio quality!

6.1 Compression

• Lossy compression, results

6.1 Compression

(6)

• Lossy compression, results

6.1 Compression

• Communication protocol

– For transmission, recording and

playing musical control information between digital instruments and the PC

– Statements are not sounds, but commands that can be used e.g., by sound cards

– Some commands: Note-on, note-off, key velocity, pitch, type of instrument

• Example:

6.1 The MIDI Format

MIDI: Original:

• 10 minutes music are about 200 KB of MIDI data

– Significant savings compared to sampling, but no original sound

• Data are input to the PC via keyboard and output via synthesizer

• Sequencer for caching data and changes

6.1 The MIDI Format

• Audio data – Music, CDs

– Sound effects, “Earcons”

• Audio data represent most of information transfer

– Storage of historical speeches

– Recordings of conversations, phone calls or negotiations

6.2 Audio Information in Databases

• Three main applications of audio signals in the context of databases

– Identification of audio signals (audio as query) – Classification and search of similar signals

(matching of audio) – Phonetic synchronization

6.2 Special Applications

• Find the title, etc. for this music piece

• Monitoring of audio streams

– Control of broadcasting of advertisements on radio – Copyright Control (GEMA)

– (Remote) diagnosis based on noise

• Audio on Demand

6.2 Identification of Audio Signals

(7)

• Find perceptionally similar audio signals – E.g., similar pieces of music, the same quotation, ...

• Recommendation – E.g., bands with similar

music

• Genre classification (rock, classical, ...)

– E.g., in audio libraries

6.2 Classification and Matching

• Synchronization of audio streams – Speech ⟺ text, notes ⟺ audio, ...

• Retrieval of text from or to speech – Find specific points in a speech – Verbal query to text documents

• Following of audio scores in concerts, etc.

6.2 Synchronization

• Identification

– The simplest of the three problems, in recent years, successful research

• Classification and Matching – Often still manual annotations

– Automatic classification only works roughly, on small collections

– Matching is still largely unresolved

• Synchronization

– Meanwhile, tolerable error rates (language ⟺ text)

6.2 State of the Art

• (Compressed) audio files are stored in the database as (smart) BLOBs

• Additionally, are metadata and feature vectors stored for the realization of the

search functionality

– Language: transcription as text – Music: musical notation or MIDI

6.2 Persistent Storage

• Search in audio data: metadata describe the audio file

– Semantic metadata: difficult to generate title, artist, speaker, keywords, ...

– File information: can be automatically generated e.g., time/place of recording, filename, file size, ...

– Widely used, e.g., music exchange markets, online shops, ...

6.3 Audio Retrieval

• Manual indexing is extremely labor intensive and expensive

• Information is often incomplete, partial and subjective (e.g., genre classification)

• No possibility to Query by Example ( "Sounds like ...")

• Search only with SQL, approximate string search, etc.

6.3 Metadata-based Search

(8)

• Using content of audio files

• Compare measure vs., measure – Not very promising, inefficient

– Differences in sampling rate and resolution

• Sounds can be differentiated by certain characteristics

– Low Level Features – Logical Features

6.3 Content-based Search

• Acoustic features

– Same basic idea as in image databases – Description of signal information by means of

characteristic features

– In contrast to image information we don’t use a single feature vector, but a time-dependent vector function

– “Time-point” of the acoustic characteristics, rather than being contained in the audio file

6.3 Low Level Features

• Typical Low Level Features – Mean amplitude, loudness – Frequency distribution – Pitch

– Brightness – Bandwidth

• Measured in the ...

– …Time domain (amplitude versus time)

– ... Frequency domain (intensity versus frequency)

6.3 Low Level Features

• Amplitude

– Pressure fluctuations around the zero point – Silence is equivalent to 0 amplitude

• Average energy

– Characterizes the volume of the signal

with N as the total number of measurements and x

_n

as n

^th

measurement

6.3 Features in the Time Domain

• Zero-Crossing Rate

– Frequency of the sign change in the signal

with sgn as a sign function (signum)

6.3 Features in the Time Domain

• Silence Ratio

– Proportion of values that belong to a period of complete silence

• We must first establish:

– The amplitude value below which a pitch is considered to be silence

– The minimum number of consecutive readings that need to be silent, to form a silence period

6.3 Features in the Time Domain

(9)

• Fourier transform of the signal – Decomposition into frequency

components with coefficients (Fourier coefficients) – Representation of frequency

spectrum of the signal

•Size of the coefficients of the frequency (represents the amount of energy per frequency)

•Usually measured in decibels (dB)

6.3 Features in the Frequency Domain

• "Ahhh" sound and Fourier spectrum

6.3 Frequency Spectrum

• Bandwidth

– Interval of occurring frequencies

– Difference between the largest and smallest frequency in the spectrum (the minimum frequency is

considered to be the first frequency above the silence threshold)

– Can also be used for classification, e.g., bandwidth in music is higher than for voice

6.3 Features in the Frequency Domain

• Power Distribution – Can be read directly from the

spectrum

– Distinction of frequencies with high/low energy

– Calculation of frequency bands with high/low energy – Centroid as the center of the spectral energy

distribution (brightness)

6.3 Features in the Frequency Domain

• Harmony

– The lowest of all the “loud “ frequencies is called the fundamental frequency

– Harmony of a signal increases when the dominant components in the spectrum are multiples of the fundamental frequency

– E.g., standard pitch A, as the fundamental frequency (440 Hz) produced on a violin creates harmonic oscillations at 880 Hz, 1320 Hz, 1760 Hz, etc.

6.3 Features in the Frequency Domain 6.3 Harmony

Harmonic oscillations Frequency spectrum of a sound played on an instrument

(10)

• Pitch

– Can be approximated by means of the Fourier spectrum

– The value is calculated from the frequencies and amplitudes of the peaks

– Related to the fundamental frequency, which is often used as an approximation