Conclusion - Across-frequency processing in convolutive blind source separation

particular the AM decorrelation algorithm almost completely eliminates the speech signal in the output corresponding to the TV set while some crosstalk from the TV set is still audible in the output corresponding to the single talker. This result is attributed to the speaking person being closer to microphones than the interfering TV set. Therefore, the impulse responses from the speaker to the microphones can be assumed to exhibit the better direct- to reverberation-energy ratio which may make them easier to be canceled out by the algorithm.

For separation of the data, very long filters were used. Length of the Hanning window was 3584 samples, DFT length 4096 samples, window shift 1024 samples at a sampling rate of 8 kHz.

3.6 Conclusion 63

of fourth order cross cumulants (Nikias and Petropulu, 1993) which, for zero-mean random variablesx(f) andy(f), can be defined as

c4(x(fk), y(fl)) =E{|x(fk)|²|y(fj)|²} −E{|x(fk)|²}E{|y(fj)|²}

−

E{x(fk)y^∗(fl)}

2−

E{x(fk)y(fl)}

2. (3.26)

For speech spectrogram data, the fourth term on the r.h.s. of (3.26) is zero, and the third term on the r.h.s. is essentially a diagonal contribution forfk =fl. Hence, the fourth order cross cumulant expression (3.26) is very similar to the proposed measure of AM correlation (3.8).

This analogy permits the interpretation of AM correlation as a quantity which mea-sures higher-order statistical dependencies between Fourier transform coefficients in different frequency channels. The present paper has shown that taking into account this higher-order structure in speech signals results in an improved algorithm for blind source separation.

Chapter 4

Separation of

multidimensional sources

4.1 Introduction

The aim of blind source separation (BSS, Jutten and H´erault, 1991) is to recover in-dependent source signals from knowledge of their superpositions, only. One typical example is the ‘Cocktail-Party’ situation, i.e., mixed signals of several speakers are recorded with multiple microphones whereas the signals of interest are the individual speaker signals, which BSS tries to reconstruct. Methods based on different principles have been proposed to achieve this goal. Their common basis is the assumption that the sources are independent systems. By reconstructing signals which are as inde-pendent as possible, an attempt is made to recover the original sources. Since little additional knowledge is assumed to be known, the methods are termed ‘blind’.

The class of the probably most widely employed methods relies on the notion of ‘inde-pendence’ in the sense of statistical independence, and is also referred to as independent component analysis (ICA). This method decomposes mixed signals into statistically independent source signals by exploiting the assumed non-Gaussian probability den-sity functions of the sources (e.g. Jutten and H´erault, 1991; Comon, 1994; Bell and Sejnowski, 1995; Cardoso and Laheld, 1996).

Another group of algorithms employs methods based on second order statistics and recovers the sources by requiring that the cross-correlation functions of different un-mixed signals must vanish (e.g. Weinstein et al., 1993; Molgedey and Schuster, 1994;

Belouchrani et al., 1997).

Finally, algorithms have been proposed that separate mixed signals based on the non-stationarity of the underlying sources (e.g. Matsuoka et al., 1995; Parra and Spence, 2000a).

An assumption common to all of the algorithms mentioned is that theN underlying sources si(t), i = 1, . . . , N, are essentially one-dimensional, i.e., they depend on a single variablet, only, wheret may denote time. Even in applications where the raw

data is higher dimensional, it is rearranged into a one-dimensional feature vector. For blind source separation of two-dimensional images, for example, the data is reordered into a one-dimensional vector which contains the concatenation of all pixel values (Bell and Sejnowski, 1997; Wachtler et al., 2001). This is justified by the assumption of a translation invariant mixing process so that the sources’ pixels are superimposed in the same way at all spatial positions. Furthermore, the data is assumed to be stationary with respect to two-dimensional space.

In the context of blind source separation, the case of multidimensional signals (e.g.

Priestley, 1981) is encountered in frequency-domain based approaches to the separa-tion of acoustically mixed sound sources. Due to time-delays and reverberasepara-tion, the acoustic medium causes the convolutive mixing of sources. By computing consecu-tive short-time spectra, the data is transformed into the time-frequency spectrogram representation, and the convolutive mixing in the time domain factorizes into instanta-neous mixing (i.e., mixing without time-delays) in each frequency band (for details cf.

section 4.4.3). Hence, each sourcei,i= 1, . . . , N, is no longer represented by the one-dimensional signalsi(t), but rather by the two-dimensional spectrogramsi(t, f), where the coordinatestandf correspond to the time- and frequency dimension, respectively.

In contrast to the situation for image data outlined above, two aspects of the problem prohibit its simplification to a single one-dimensional problem. First, the signal mixing varies with frequency, which is a result of the convolution operation in the time domain.

Second, the data is non-stationary with respect to the frequency dimension, since the power of, e.g., speech signals varies considerably across frequency. Since both mixing and data are non-stationary, this problem is truly multidimensional.

Several researchers have proposed to split the multidimensional problem into a set of K independent one-dimensional BSS problems, one for each frequencyf = 1, . . . , K, and separate source components independently in each frequency (e.g. Capdevielle et al., 1995; Ehlers and Schuster, 1997; Murata et al., 1998; Parra and Spence, 2000a).

However, this approach causes in particular the problem that the sources’ components are recovered in disparate (unknown) order in different frequencies, which makes the direct assignment of unmixed components to the corresponding sources impossible.

The need to sort the source components’ permutations has led to additional post-processing steps which incorporate further prior knowledge that had not been assumed for the sake of separation. In contrast, it has recently been shown by Anem¨uller and Kollmeier (2000) for the case of separating convolutively mixed speech signals that taking into account the multidimensional nature of the source signals by modeling the statistical dependencies between different frequency channels resolves the permutation problem without post-processing and results in improved quality of separation.

The aim of the present paper is to suggest that proper consideration of multidimen-sional sources can be beneficial also in applications other than the separation of acous-tically mixed sound signals. Furthermore, a novel solution to the separation of mixed multidimensional sources is given, which is based on second-order statistics.

The outline of the paper is as follows. In section 4.2 the notion of multidimensional source signals and the assumed mixing model is specified. Examples for situations exhibiting multidimensional source signals are given. Section 4.3 is dedicated to a

Im Dokument Across-frequency processing in convolutive blind source separation (Seite 64-69)