Multimedia Databases

(1)

Multimedia Databases

Wolf-Tilo Balke Janus Wawrzinek

Institut für Informationssysteme

Technische Universität Braunschweig

(2)

• Shape-based Features

- Chain Codes

- Area-based Retrieval - Moment Invariants - Query by Example

Previous Lecture

(3)

6 Audio Retrieval

6.1 Basics of Audio Data

6.2 Audio Information in Databases 6.3 Audio Retrieval

6 Audio Retrieval

(4)

• Information transfer through sound

– Audio (Latin, "I hear")

• Three different types of data:

– Music

– Spoken text – Noise

6.1 Basics of Audio Data

(5)

• Auditory perception

through pressure fluctuations in the air

• Eardrum vibrates synchronously

• Ear bones amplify and transmit the vibrations

• Auditory hair cells in the ear cochlea, are stimulated by the vibrations

• Neurons produce

6.1 Basics

(6)

6.1 Basics

(7)

• Our brain

only interprets two major

properties of sound:

– Pitch – Volume

6.1 Basics

(8)

• Quantitative performance of the sound wave

• Amplitude as volume

– Logarithmic perception (tenfold increase in

amplitude doubles

the perceived loudness)

• Frequency as pitch

– Number of periods per unit time

6.1 Basics

(9)

• Audio signals are time-dependent (overlapping) waveforms

6.1 Basics

(10)

6.1 Basics

(11)

• Constructive and destructive interference

6.1 Basics

(12)

• Audio examples

6.1 Basics

(13)

• Musical instruments are classified based on the vibration generator

– E.g., string-, blowing-, percussion

• Acoustics depends on the vibration generator

– E.g., strings-, air, membrane instruments

• Synthetic creation needs an oscillator

• The oscillator generates voltage oscillations

• Speaker transmit the voltage changes on a membrane

6.1 Sound Creation

(14)

• Influence of the oscillator

– Higher voltage ⟹ Higher frequency (Moog, 1964)

• Amplifier influences the amplitude thus the volume

• ADSR (attack-decay-sustain-release) envelope influences the loudness of a sound in time

6.1 Sound Creation

(15)

6.1 Sound Creation

Moog 901B (1964) Modular Moog

Synthesizer (1967)

(16)

• Emerson, Lake & Palmer:

The Great Gates of Kiev (1974)

6.1 Sound Creation

(17)

• Synthesized sounds seem rather metallic. For producing a single synthesized sound, consider four typical phases:

– Attack: speed and strength of the signal rise – Decay: lowering the level

– Sustain: actual pitch – Release: end of the

signal

6.1 ADSR envelope

(18)

• Transformation of the continuous sound wave into a discrete representation

• Sampling: save at regular intervals, the current amplitude value of vibration

• Clearly, we have to reconstruct audio signals from these values

6.1 Digitalization of Audio Data

Amplitude

(19)

• Basic characteristics

– Sampling rate: how many times per unit time is the value of the continuous signal tapped?

– Resolution: which accuracy are the values recorded with?

• Often, a resolution of 16 bits is used (2

¹⁶

different amplitude values)

• The sampling rate is application dependent:

– Audio CD: 44100 Hz – Phone: 8000 Hz

6.1 Sampling

(20)

• It is very important to uniquely reconstruct the initial oscillation

• The higher the sampling frequency, the more values must be saved

• Minimum sampling frequency?

– Sampling theorem (Nyquist, 1928)

Sampling rate must be at least twice as large

as the highest frequency occurring in the signal

6.1 Sampling Rate

(21)

6.1 Sampling Rate

(22)

• Phone: 8000 Hz

• DVD audio: 96,000 Hz or 192,000 Hz

• Audio CD: 44,100 measurements per second for two stereo channels with

16 bits per measurement results in 176.4 kB/s

⟹ ca 10 MB/min, i.e., 635 MB/h

6.1 Sampling Rate

(23)

• For space reasons, is digital data usually stored in compressed form

• Known uncompressed formats:

– AIFF: . aif (Apple Inter opportunity File Format) – Wave: . wav (Windows)

– IRCAM: *. snd (Institut de Recherche et Coordination Acoustique / Musique)

– AU: *. au (Sun audio)

6.1 Audio Formats

(24)

• Data reduction: with (lossy) or without information loss (lossless)

• Lossless compression methods generally don’t compress very much

– Free Lossless Audio Codec

• 50–60% of their original size

– Apple Lossless – WavPack

6.1 Compression

(25)

• Lossy compression algorithms typically are based on simple transformations

– Modified Discrete Cosine

Transformation (MDCT) or wavelets

• Encoding: transformation of the waveform in frequency sequences (sampling)

• Decoding: Reconstruction of waveform from these values

• What data can we afford loosing?

6.1 Compression

(26)

• Change of the data without changing the subjective perception

– Omit very high/low frequencies

– Save superimposed frequencies with less precision – Use of other effects according to psychoacoustic

model, e.g., low tones before/after very loud sounds and frequency changes at a minimum distance are

impossible to hear ...

6.1 Compression

(27)

• MPEG-1 Audio Layers I, II and III (MP3)

• CD quality at bit rates of 128 kb/s

• Coarse approach to MP3

– Channel coupling of the stereo signal by using the difference

– Cut off inaudible frequencies

– Eliminate redundancy by considering the psychoacoustic effects

– Compress data using Huffman coding

6.1 Compression

(28)

• AAC (Advanced Audio Coding)

– Industry-driven improvement of the MP3 format (supported by the MPEG)

– Used in TV-/radio broadcasts, Apple iTunes Store, ...

– Better quality at same file size – Support for multi-channel audio

– Supports 48 main sound channels with up to 96 kHz sampling rate, 16 low-frequency channels (limited to

6.1 Compression

(29)

• Lossless compression, important factors are:

– De-/compression speed

– Compression rate

6.1 Compression

(30)

• Lossy compression, important factors are:

– De-/compression speed – Compression rate

– Most important, the compressed audio quality!

6.1 Compression

(31)

• Lossy compression, results

6.1 Compression

(32)

• Lossy compression, results

6.1 Compression

(33)

• Communication protocol

– For transmission, recording and

playing musical control information between digital instruments and the PC

– Statements are not sounds, but commands that can be used e.g., by sound cards

– Some commands: Note-on, note-off, key velocity, pitch, type of instrument

• Example:

6.1 The MIDI Format

(34)

• 10 minutes music are

about 200 KB of MIDI data

– Significant savings compared to sampling, but no original sound

• Data are inputted to the PC via keyboard and outputted via synthesizer

• Sequencer for caching data and changes

6.1 The MIDI Format

(35)

• Audio data

– Music, CDs

– Sound effects, “Earcons”

• Audio data represent most of information transfer

– Storage of historical speeches

– Recordings of conversations, phone calls or negotiations

6.2 Audio Information in Databases

(36)

• Three main applications of audio signals in the context of databases

– Identification of audio signals (audio as query) – Classification and search of similar signals

(matching of audio)

– Phonetic synchronization

6.2 Special Applications

(37)

• Find the title, etc. for this music piece

• Monitoring of audio streams

– Control of broadcasting of advertisements on radio – Copyright Control (GEMA)

– (Remote) diagnosis based on noise

• Audio on Demand

6.2 Identification of Audio Signals

(38)

• Find perceptionally similar audio signals

– E.g., similar pieces of music, the same quotation, ...

• Recommendation

– E.g., bands with similar music

• Genre classification (rock, classical, ...)

– E.g., in audio libraries

6.2 Classification and Matching

(39)

• Synchronization of audio streams

– Speech ⟺ text, notes ⟺ audio, ...

• Retrieval of text from or to speech

– Find specific points in a speech – Verbal query to text documents

• Following of audio scores in concerts, etc.

6.2 Synchronization

(40)

• Identification

– The simplest of the three problems, in recent years, successful research

• Classification and Matching

– Often still manual annotations

– Automatic classification only works roughly, on small collections

– Matching is still largely unresolved

• Synchronization

6.2 State of the Art

(41)

• (Compressed) audio files are stored in the database as (smart) BLOBs

• Additionally, are metadata and feature vectors stored for the realization of the

search functionality

– Language: transcription as text – Music: musical notation or MIDI

6.2 Persistent Storage

(42)

• Search in audio data: metadata describe the audio file

– Semantic metadata: difficult to generate title, artist, speaker, keywords, ...

– File information: can be automatically generated e.g., time/place of recording, filename, file size, ...

– Widely used, e.g.,

music exchange markets, online shops, ...

6.3 Audio Retrieval

(43)

• Manual indexing is extremely labor intensive and expensive

• Information is often incomplete, partial and subjective (e.g., genre classification)

• No possibility to Query by Example ( "Sounds like ...")

• Search only with SQL, approximate string search, etc.

6.3 Metadata-based Search

(44)

• Using content of audio files

• Compare measure vs., measure

– Not very promising, inefficient

– Differences in sampling rate and resolution

• Sounds can be differentiated by certain characteristics

– Low Level Features – Logical Features

6.3 Content-based Search

(45)

• Acoustic features

– Same basic idea as in image databases

– Description of signal information by means of characteristic features

– In contrast to image information we don’t use a single feature vector, but a time-dependent vector

function

– “Time-point” of the acoustic characteristics, rather than being contained in the audio file

6.3 Low Level Features

(46)

• Typical Low Level Features

– Mean amplitude, loudness – Frequency distribution

– Pitch

– Brightness – Bandwidth

• Measured in the ...

– …Time domain (amplitude versus time)

6.3 Low Level Features

(47)

• Amplitude

– Pressure fluctuations around the zero point – Silence is equivalent to 0 amplitude

• Average energy

– Characterizes the volume of the signal

with N as the total number of measurements and x

_n

as n

^th

measurement

6.3 Features in the Time Domain

(48)

• Zero-Crossing Rate

– Frequency of the sign change in the signal

with sgn as a sign function (signum)

6.3 Features in the Time Domain

(49)

• Silence Ratio

– Proportion of values that belong to a period of complete silence

• We must first establish:

– The amplitude value below which a pitch is considered to be silence

– The minimum number of consecutive readings that need to be silent, to form a silence period

6.3 Features in the Time Domain

(50)

• Fourier transform of the signal

– Decomposition into frequency components with coefficients (Fourier coefficients)

– Representation of frequency spectrum of the signal

• Size of the coefficients of the frequency (represents the amount of energy per frequency)

• Usually measured in decibels (dB)

6.3 Features in the Frequency Domain

(51)

• "Ahhh" sound and Fourier spectrum

6.3 Frequency Spectrum

(52)

• Bandwidth

– Interval of occurring frequencies

– Difference between the largest and smallest frequency in the spectrum (the minimum frequency is

considered to be the first frequency above the silence threshold)

– Can also be used for classification, e.g., bandwidth in music is higher than for voice

6.3 Features in the Frequency Domain

(53)

• Power Distribution

– Can be read directly from the spectrum

– Distinction of frequencies with high/low energy

– Calculation of frequency bands with high/low energy – Centroid as the center of the spectral energy

distribution (brightness)

6.3 Features in the Frequency Domain

(54)

• Harmony

– The lowest of all the “loud “ frequencies is called the fundamental frequency

– Harmony of a signal increases when the dominant components in the spectrum are multiples of the fundamental frequency

– E.g., standard pitch A, as the fundamental frequency (440 Hz) produced on a violin creates harmonic

oscillations at 880 Hz, 1320 Hz, 1760 Hz, etc.

6.3 Features in the Frequency Domain

(55)

6.3 Harmony

Harmonic oscillations Frequency spectrum of a sound played on an instrument

(56)

• Pitch

– Can be approximated by means of the Fourier spectrum

– The value is calculated from the frequencies and amplitudes of the peaks

– Related to the fundamental frequency, which is often used as an approximation