• Keine Ergebnisse gefunden

Chapter 1 Introduction

5.2 Cross-channel Convolution Neural Network

5.2.1 Visual Representation

Inspired by the primate visual cortex model [91], the proposed model visual stream has two channels. The first channel is responsible for learning and extracting in-formation about facial expressions, which comprises contour, shape and texture of a face, and mimics the encoding of information in the ventral area of the primate visual cortex. The second channel codes information about the orientation, direc-tion and speed of changes within the torso of a person in a sequence of images, similar to the information coded by the dorsal area.

5.2. Cross-channel Convolution Neural Network

Although facial expressions are an important part of emotion expression de-termination, there is evidence that shows that in some cases facial expressions and body posture and movements are contradictory and carry a different meaning.

This phenomenon was first observed by Darwin [60] and is referred to as micro expressions. Although micro expressions occur with other modalities as well, the face is the one in which this behavior is easily perceptible [243].

Ekman [83] demonstrates that facial micro expressions last from 40 to 300 milliseconds, and are composed of an involuntary pattern of the face, sometimes not directly related to the expression the person intended to perform. He also shows that micro expressions are too brief to convey an emotion, but usually are signs of concealed behaviors, giving the expression a different meaning. For example, facial micro expressions are usually the way to determine weather someone is angry while using a happy sarcastic expression. In this case, the addition of facial micro expressions as an observable modality can enhance the capability of the model to distinguish spontaneous expressions, but the observation of the facial micro expression alone does not carry any meaning [231].

Our architecture is tuned to deal with facial expressions and micro expressions.

Our architecture is fed with frames comprising 1s, and our Face channel receives a smaller sequence representing 300 milliseconds. These intervals were found by experimenting with using different sequence lengths as input to the network, and the chosen values are congruent with evidence from [83]. That means that our network is able to recognize common face expressions, but also takes into consid-eration micro expressions.

To feed our visual stream, we must first find the faces on the images. To do so, we use the Viola-Jones face detection algorithm [291], which uses an Adaboost-based detection. Wang [297] discusses the Viola-Jones algorithm, and shows that it is robust and effective when applied to general face detection datasets. In our experiments, the Viola-Jones algorithm showed to be reliable in controlled envi-ronments. After finding the face, we create a bounding box to describe the torso movement. We extract face and torso from a sequence of frames corresponding to 1 second and use them as input to the network.

To define the input of the Movement channel, we use a motion representation.

Feeding this stream with this representation, and not the whole image, allows us to specialize the channel into learning motion descriptors. This way, we can train the network with a smaller amount of data, and use a shallow network to obtain high-level descriptors.

To obtain the motion representation, an additional layer is used for pre-processing the input of the Movement channel. This layer receives 10 gray scale frames and creates a representation based on the difference between each pair of frames. This approach was used in previous works to learn gestures, and showed to be success-ful by [14]. The layer computes an absolute difference and sums up the resulting frame to a stack of frames. This operation generates one image representing the motion of the sequence, defined here asM:

Figure 5.1: Example of input for the visual stream. We feed the network with 1s of expressions, which are processed into 3 movement frames and 9 facial expressions.

M =

N

X

i=1

|(Fi1−Fi)|(i/t), (5.1) where N is the number of frames, Fi represents the current frame and (i/t) represents the weighted shadow. The weighted shadow is used to create different gray scale shadows in the final representation according to the time that each frame is presented. This means that every frame of the image will have a different gray tone in the final image. The weightt starts as 0 in the first frame and is increased over time, so each frame has a different weight. The absolute difference of each pair of frames removes non-changing parts of the image, being able to extract the background or any other detail in the image that is not important to the motion representation. By summing up all the absolute differences of each pair of images it is possible to create a shape representation of the motion and with the help of the weighted shadows, the information of when each single posture happened.

Figure 5.1 displays a common input of our visual stream, containing examples of the Face and Movement channels.

The Face channel is composed of two convolution and pooling layers. The first convolution layer implements 5 filters with cubic receptive fields, each one with a dimension of 5x5x3. The second layer implements 5 filter maps, also with a dimension of 5x5, and a shunting inhibitory field. Both layers implement max-pooling operators with a receptive field of 2x2. In our experiments, we use a rate of 30 frames per second, which means that the 300 milliseconds are represented by 9 frames. Each frame is resized to 50x50 pixels.

The Movement channel implements three convolution and pooling layers. The first convolution layer implements 5 filters with cubic receptive fields, each one with a dimension of 5x5x3. The second and third channels implement 5 filters, each one with a dimension of 5x5 and all channels implement max-pooling with a receptive field of 2x2. We feed this channel with 1s of expressions, meaning that we feed the network with 30 frames. We compute the motion representation for every 10 frames, meaning that we feed the Movement channel with 3 motion representations. All the images are resized to 128x96 pixels.

5.2. Cross-channel Convolution Neural Network

Figure 5.2: The Visual stream of our network is composed of two channels: the Face and the Movement channels. The Face channel implements two layers, each one with convolution and pooling, and applies inhibitory fields in the second layer, while the Movement channel implements three layers, with pooling and convolu-tion. Both channels implement cubic receptive fields in the first layer. The final output of each channel is fed to a Cross-channel which implements convolution and pooling and produces a final visual representation.

We apply a Cross-channel to the visual stream. This Cross-channel receives as input the Face and Movement channels, and it is composed of one convolution channel with 10 filters, each one with a dimension of 3x3, and one max-pooling with a receptive field of 2x2. We have to ensure that the input of the Cross-channel has the same dimension, to do so we resize the output representation of the Movement channel to 9x9, the same as the Face channel. Figure 5.2 illustrates the visual stream of the network.