• Keine Ergebnisse gefunden

Chapter 1 Introduction

3.5 Affective Computing

3.5.2 Emotion Recognition and Learning

gener-alization capability [134]. The use of over-sized auditory feature representations is the focus of heavy critics [307, 87] because of the incapability of these descriptors to describe the nuances of emotion expressions in speech.

In a similar way as facial expressions, implicit descriptors methods, mostly CNNs, were used recently to describe emotions in human speech [138, 204]. These systems use a series of pre-processing techniques to remove mostly the noise of the audio signal, and let the CNN learn how to represent the data and classify it in emotion concepts [264]. In this strategy, the CNN is able to learn how to represent the auditory information in the most efficient way for the task. Initial research was done for music [190], and speech recognition [272] and was shown to be successful in avoiding overfitting. The problem with this approach is that it needs an extensive amount of data to learn high-level audio representations, especially when applied to natural sounds, which can include speech, music, ambient sounds and sound effects [65].

Although there are models to describe systematically the emotional components in speech [56], there is no consensus on how to identify affective components in a human voice, and thus the evolution of emotional auditory descriptors is limited and much more work is necessary.

Many other models which are not based on face expressions or speech were proposed. The use of several physiological signals, coming from muscle movements, skin temperature, and electrocardiograms were investigated [236, 156, 155] and delivered good performance, however, used uncomfortable and specific sensors to obtain the signals, not suitable for be used in real-world scenarios. Mechanisms such as bracelets or localized electrodes were used.

In a similar approach, the use of Electroencephalogram (EEG) signals was used as emotion descriptors [25, 194, 197]. Using non-invasive EEG mechanisms, the brain behavior was captured when different emotion experiences were presented to a human. The initial results showed poor accuracy and generalization but open one new approach on emotion descriptors.

Another way to describe emotions is using body movement and posture. There are studies which show the relation between different body postures and movements with emotional concepts [55], and even with micro-expressions [83]. Most of the computational models in this area use the body shape [43] or common movement descriptors [40] to represent the expression and showed good accuracy and a certain level of generalization. However, the best performance used different sensors such as depth cameras or gloves to capture the movements [158].

3.5. Affective Computing

FACS [192], and use if-then rules to identify emotion expressions. Such systems present a simple and efficient scheme to identify pre-defined expressions. Most of these works are used to recognize the six universal emotions and usually does not show good generalization due to the fact that the emotional concepts must be very clearly separable and identifiable. Applications of such systems in speech emotion recognition delivered good performance when used in very restricted word scenarios [301], however, achieved poor performance when used in more complex or natural cases, such as real-world interaction [68].

The models based on stochastic approaches usually deal with sequence prob-lems. Among these models, the ones based on Hidden Markov Models (HMM) became very popular in the last decades. Such models use different descriptors to represent the expression and use them to feed one [193] or several HMMs [50].

Each HMM introduces a sequence dependency processing, creating a Markovian chain that will represent the changes in the expression and use it to classify them as an emotional concept. Such models are very popular for speech emotion recog-nition, due to the good performance delivered in speech recognition tasks [173].

The biggest problem with these models is that they are limited to the amount of information they can learn, and they do not generalize well if new information is present. Also tend to be computationally expensive, especially for larger datasets.

Another popular method for emotion recognition are neural networks. Since the first approach in this field, neural networks have been used to recognize emotional concepts [159, 257]. These models are inspired by the human neural behavior, and were approached from very theoretical [105] to practical approaches [272].

Neural networks were used as single-instance classifiers [141] to sequence processing and prediction, using recurrent neural networks [271]. These models tend to be more complex to design and understand, and thus limit the implementation and development of applications. Usually, a large amount of data is necessary to train neural networks, and generalization can be a problem for some architectures.

With the advance of deep learning networks, most of the recent work involves the use of neural architectures. The ones with the best performance and general-ization apply different classifiers to different descriptors [193, 45, 147], and there is no consensus on a universal emotion recognition system. Most of these systems are applied to one-modality only and present a good performance for specific tasks.

By using one modality, either vision or audition in most of the cases, these sys-tems create a domain-specific constraint and sometimes are not enough to identify spontaneously and/or natural expressions [274].

Multimodal emotion recognition has been shown to improve emotion recog-nition in humans [35], but also in automatic recogrecog-nition systems [118]. Usually, such systems use several descriptors to represent one expression and then one or several classifiers to recognize it [45]. Gunes et al. [118] evaluate the efficiency of face expression, body motion and a fused representation for an automatic emotion recognition system. They realize two experiments, each one extracting specific features from face and body motion from the same corpus and compare the recog-nition accuracy. For face expressions, they track the face and extract a series of features based on face landmarks. For body motion, they track the position of the

shoulders, arms and head of the subject and extract 3522 feature vectors, using dozens of different specific feature extraction techniques. These feature vectors are classified using general classification techniques, such as Support Vector Machines and Random Forests. At the end, they fuse all feature vectors extracted from both experiments and classify them. The results obtained when fusing face and motion features were better than when these modalities were classified alone.

The same conclusion was achieved by Chen et al. [45]. They apply a series of techniques to pre-process and extract specific features from face and body motion, similarly to Gunes et al. Differences are that they use fewer features in the final representation and the time variance representation is different in both approaches.

Gunes et al. use a frame-based-classification, where each frame is classified indi-vidually and a stream of frames is later on scored to identify which emotional state is present. Chen et al. analyze two temporal representations: one based on a bag-of-words model and another based on a temporal normalization based on linear interpolation of the frames. Both approaches use the same solution based on manual feature fusion, which does not take into consideration the inner corre-lation between face expression and body motion, but fused both modalities using a methodological scheme.

The observation of different modalities, such as body posture, motion, and speech intonation, improved the determination of the emotional state of differ-ent subjects, increasing the generalization of the models. This was demonstrated in the computational system of Castellano et al. [43], where they process facial expression, body posture, and speech, extracting a series of features from each modality and combining them into one feature vector. Although they show that when different modalities are processed together they lead to a better recognition accuracy, the extraction of each modality individually does not model the correla-tion between them, which could be found when processing the modalities together as one stream.

The same principle was also found for visual-only modalities [118, 45], and audio-only modalities [251, 147, 195]. However, all these works deal with a set of restricted expression categorizations which means that if a new emotion expression is presented to these systems, they must be re-trained and a new evaluation and validation of the whole system need to be done.

3.6 Summary on Emotion Learning and Affec-tive Computing

Many types of research were done on understanding how emotions affect humans on many different levels. From neural modulation in perception and learning sys-tems to behavioral aspects of emotional concepts and affective computing, the most important notions and methods were discussed in this chapter. In each of these areas, different researchers discussed, proposed and described many different mechanisms, models, and theories, and yet we are far from a unified model of

emo-3.6. Summary on Emotion Learning and Affective Computing

tions. This can be explained by the wide range of systems and mechanisms were emotions have an important role, and thus indicates how important is to continue understanding and to research in the field.

This chapter also presented an overview of several approaches and models for different emotional tasks in affective computing. Although these models offer so-lutions for various tasks, none of them led to an integrative solution for emotion recognition, processing, and learning, which is the primary focus of this thesis.

Inspiring some solutions present in our models in the neural-psychological mech-anisms presented in this chapter allows us to address important aspects of our research questions, and contribute to the field of affective computing.

Chapter 4

Neural Network Concepts and Corpora Used in this Thesis

This chapter discusses the most important neural network concepts and techniques used for the development of the models proposed in this thesis. The techniques are exhibited in their original models and any necessary modification is indicated in each of the model’s own chapters.

To evaluate the proposed models a number of different corpora are necessary and hey are described in section 4.6. During the execution of this work, a new corpus was recorded and all details involving the design, collection and analysis of the recorded data is presented in section 4.7.