• Keine Ergebnisse gefunden

Chapter 1 Introduction

7.5 Results

7.5.1 Experiment 1: Emotional Attention

Our results for the Emotional Attention corpus are shown in Table 7.1. It is possible to see that when no expressions were present, the model top-20 error was 75.5%. When no expression was present, the model should output a distribution filled with 0s, showing that no expression was presented.

After evaluating the model with one expression, we can see how each modal-ity performs. If only the face was present, the top-20 accuracy was 87.7%. In this case, the presence of facial expression was effective in identifying a point of interest. In this scenario, the use of happy or neutral face expressions led to the strong assumptions of interest regions by the network. When evaluating only body movement, it is possible to see that there was a slight drop in the top-20 accuracy rate, which became 85.0%. This was likely caused by the presence of the move-ments in the image, without identification or clarification of which movement was performed (either happy or neutral). Finally, we can see that the best results were reached by the combination of both cues, reaching a top-20 accuracy of 93.4% and showing that the integration of facial expression and body movement was more accurate than processing the modalities alone.

Presenting two expressions of two persons being displayed at the same time to the network caused a drop of attention precision while making clearer the relevance of using both modalities. Using only facial expressions reached a top-20 accuracy of 76.4% while using only body movement had 68.2%. In this scenario, the identifi-cation of the type of expression became important. The distinction between happy and neutral expression represents a key point, and the facial expression became more dominant in this case. However, when presenting both modalities at the same time, the model obtained 84.4% of top-20 accuracy. This shows that both modalities were better in distinguishing between happy and neutral expressions

7.5. Results

Table 7.2: Reported top-20 accuracy, in percentage, and standard deviation for different modalities for the WTM Emotion Interaction corpus.

- Top-20 accuracy

No Expression 63.2% (1.2)

Face Body Movement Both

One Expression 96.2% (2.3) 91.0% (1.4) 98.45% (1.2) with respect to using one modality alone.

The results in the KT Emotional Interaction Corpus are showed in Table 7.2.

It is possible to see a similar behavior, where the combination of body movement and face obtained a better result, reaching 98.45%. An interesting aspect is that the “no expression” experiment achieved a low accuracy of 63.2%, mostly due to the fact that this corpus has very little data with no expression being performed.

The second round of experiments deals with the use of the attention modula-tion for emomodula-tion recognimodula-tion. Table 7.3 shows the results obtained while training the attention modulation recognition model with the FABO corpus in comparison of the common CCCNN architecture. It is possible to see that the mean general recognition rate increased from 93.65% to 95.13% through the use of the attention modulation. Although some of the expressions presented higher recognition, ex-pressions such as “Boredom”, “Fear” and “Happiness” presented slightly smaller accuracy. A cause for that to happen is that these expressions probably presented hand-over-face or very slight movements, which were depicted by the CCCNN but ruled out by the attention mechanism.

Evaluating the CCCNN with and without the attention modulation produced the results showed on Table 7.4. It is possible to see that the attention mechanism increased the accuracy of expressions as “Fear”, “Happiness”, and “Sadness” more than the others. Mostly this happens because this expression presents a high degree of variety in the dataset, which is easily perceived by the attention model, and sent to the CCCNN as an important representation.

7.5.2 Experiment 2: Affective Memory

Training our Emotional (deep) Neural Circuitry without memory modulation gave us a direct correlation between what was expressed and the stored memory. This means that the memory learned how to create neurons that will code for the presented emotional concepts. We fed all the videos of the HHI and HRI scenario to the model and proceed to calculate the interclass correlation coefficient per subject and per topic between the network representation and each of the annotator’s labels.

The interclass correlation coefficients per topic for the HHI scenario are pre-sented in Table 7.5. It is possible to see high correlations for both dimensions, valence, and arousal, for at least two scenarios: Lottery and Food. These two

Table 7.3: Reported accuracy, in percentage, for the visual stream channels of the CCCNN trained with the FABO corpus with and without attention modulation.

Class Without Attention With Attention

Anger 95.9 95.7

Anxiety 91.2 95.4

Uncertainty 86.4 92.1

Boredom 92.3 90.3

Disgust 93.2 93.3

Fear 94.7 94.5

Happiness 98.8 98.0

Negative Surprise 99.6 99.7

Positive Surprise 89.6 94.8

Puzzlement 88.7 93.2

Sadness 99.8 99.5

Mean 93.65 95.13

Table 7.4: Reported accuracy, in percentage, for the visual stream channels of the CCCNN trained with the KT Emotion Interaction Corpus corpus with and without attention modulation.

Class Without Attention With Attention

Anger 85.4 89.7

Disgust 91.3 91.2

Fear 79.0 81.2

Happiness 92.3 94.5

Neutral 80.5 84.3

Surprise 86.7 87.9

Sadness 87.1 90.2

Mean 86.0 88.4

scenarios were the ones with a stronger correlation also within the annotators, and possibly the ones where the expressions were most easily distinguishable for all the subjects. The emotion concept correlation also shows a strong agreement value for these two scenarios, while showing a slight disagreement on the School scenario.

The correlation coefficients for the HRI scenario are presented in Table 7.6. It is possible to see that, similarly to the HHI scenario, the topics with the highest correlation were Lottery and Food, while the lowest ones were Pet and Family.

Here the correlation values are slightly smaller than in the HHI scenario, indicating

7.5. Results

Table 7.5: Interclass correlation coefficient of our Emotional (deep) Neural Cir-cuitry per topic in the HHI scenario.

Characteristic Lottery Food School Family Pet

Valence 0.65 0.64 0.41 0.67 0.57

Arousal 0.67 0.72 0.42 0.56 0.49

Emotional Concept 0.84 0.71 0.47 0.52 0.53 Table 7.6: Interclass correlation coefficient per topic in the HRI scenario.

Characteristic Lottery Food School Family Pet

Valence 0.78 0.67 0.31 0.58 0.47

Arousal 0.72 0.61 0.57 0.49 0.42

Emotion Concept 0.79 0.75 0.62 0.51 0.57

Table 7.7: Interclass correlation coefficient of our Emotional (deep) Neural Cir-cuitry per subject in the HHI scenario.

Session 2 3 4 5

Subject S0 S1 S0 S1 S0 S1 S0 S1

Valence 0.63 0.54 0.67 0.59 0.69 0.67 0.54 0.59 Arousal 0.55 0.57 0.67 0.59 0.67 0.60 0.57 0.53 Emotional Concepts 0.79 0.67 0.74 0.79 0.61 0.74 0.67 0.59

Session 6 7 8

Subject S0 S1 S0 S1 S0 S1

Valence 0.57 0.61 0.64 0.61 0.49 0.68 Arousal 0.54 0.61 0.50 0.87 0.71 0.84 Emotional Concepts 0.68 0.87 0.68 0.63 0.64 0.76

that for these expressions were more difficult to annotate, which is reflected in our model’s behavior.

The correlation coefficients calculated on the Affective Memory for the HHI scenario are showed in Table 7.7. Here, it is possible to see that for most of the subjects, the network presented a slightly good correlation, while only a few presented a very good one. Also, it is possible to see that the correlations obtained by the emotion concept were again the highest.

For the subjects in the HRI scenario, the correlation coefficients are presented in Table 7.8. It is possible to see that, different from the HHI scenario, the correlation between the model’s results on recognizing emotion concepts and the annotators

Table 7.8: Interclass correlation coefficient of our Emotional (deep) Neural Cir-cuitry per subject in the HRI scenario.

Subject S1 S2 S3 S4 S5 S6 S7 S8 S9

Valence 0.58 0.42 0.67 0.59 0.42 0.80 0.45 0.61 0.78 Arousal 0.52 0.62 0.57 0.60 0.67 0.62 0.59 0.48 0.58 Emotional Concept 0.74 0.57 0.62 0.61 0.57 0.59 0.57 0.69 0.72 are rather small, showing that for this scenario the network could not represent the expressions as well as in the HHI scenario. The other dimensions show a similar behavior, showing that our Affective Memory can learn and represent subject behaviors in a similar manner as the annotators perceived them.