• Keine Ergebnisse gefunden

Data and Experimental Design

Im Dokument Measuring Behavior 2018 (Seite 161-164)

In this section, we discuss the performance of the person-independent gaze estimation model and perform a pilot test to validate the effectiveness of the proposed mutual gaze detection system.

There are few publicly available datasets devoted to in-the-wild 3D gaze estimation, as most of them focus on human-screen interaction with a limited head and gaze directions range. We therefore decided to use the EYEDIAP dataset [13], which consists of 3-minute videos of 16 young adult subjects (12 male, 4 female, mostly Caucasian, age not specified), looking at a specific target (moving floating ball or moving point on screen), with 2 different lighting conditions for some of the subjects. The videos are further divided in static and moving head pose for each of the subjects. Note that this dataset does not contain participants wearing glasses or other face accessories.

Our model, pre-trained with the VGG-Face dataset [12] (a dataset of faces consisting of 2622 identities), was fine-tuned with the floating target, static and moving head pose EYEDIAP videos of all subjects. We filtered out those frames where the face or landmarks were not properly detected by [9] or where the ground-truth gaze vectors were not correct, which resulted in a total of 88770 frames (see Figure 3). We fine-tuned all the layers except the first convolutional block.

161

R.A Grant et al. (Eds.): Measuring Behavior 2018, ISBN 978-1-910029-39-8 Manchester, UK, 5th-8th June 2018

Figure 3. Distribution of ground-truth gaze directions and head poses in terms of yaw (horizontal axis) and pitch (vertical axis) angles, with respect to the camera coordinate system, for the filtered floating-target EYEDIAP subset.

To validate the accuracy of the person-independent gaze model, we performed leave-four-subjects-out cross-validation. We trained the model using the ADAM optimizer [14] with an initial learning rate of 0.0001, dropout of 0.3, and batch size of 64 frames. The number of training epochs, which is the number of passes of the full training set through the network, was set to 10. To extend the filtered dataset we applied on-the-fly data augmentation to the training set, with the following random transformations: horizontal and vertical shifts, zoom, illumination changes, horizontal flip, and additive Gaussian noise. The final model used in the mutual gaze detection system was trained using all the subjects.

Furthermore, to assess the performance of the mutual gaze detection system, we conducted a pilot validation at KOCA (Koninklijk Orthopedagogisch Centrum Antwerpen) in Antwerp, Belgium. KOCA is a home guidance service that supports parents in raising their deaf child and has a standardized laboratory living room to perform video recordings of face-to-face interaction sessions [1]. We recorded a 4-minute video with two participants, one female inside the age range of the training dataset (Participant 1), and a male outside the age range (Participant 2). Participant 1 wore glasses the last 2 minutes of the interaction. The interaction was recorded using two calibrated monocular RGB cameras placed about 2 meters in front of each participant, who were seated around a table, placed at 90º to one another. The video signals from both cameras were merged in the control room to one split-screen image by a mixer (see Figure 4). The final video was recorded at 25 frames per second, with a resolution of 720x576 pixels. The interaction session was manually annotated with frame-level accuracy using The Observer XT 12.5 [15] with the following behaviors: ‘Looking at other’s face’, ‘Looking at object’, ‘Looking at hand’, ‘Looking elsewhere’. We set the head thresholds to X = 20 cm, Y = 40 cm, and Z = 20 cm for both participants, and the window filter size to 7 frames.

Figure 4. Setup of the pilot validation at KOCA. The interaction was recorded using 2 calibrated monocular RGB cameras placed about 2 meters in front of each participant, who were seated around a table, placed at 90º to one another. The video

signals from both cameras were merged to one split-screen image by a mixer.

Results

The trained gaze estimation model achieved an average error of 5.5 degrees after 10 epochs in the leave-four-subjects-out cross validation experiment (see Figure 5). It is difficult to compare this result to the current state-of-the-art person-independent, calibration-free, gaze estimation methods, since they are generally validated with the

162

R.A Grant et al. (Eds.): Measuring Behavior 2018, ISBN 978-1-910029-39-8 Manchester, UK, 5th-8th June 2018

screen-target EYEDIAP subset. Notwithstanding, the lowest obtained error for that subset has been 6 degrees [8].

Taking into account that the screen-target subset has less variability in head and eye movements than the floating-target subset, we demonstrate that our model is competitive against state-of-the-art approaches, obtaining better results in a less constrained setting.

Figure 5. Epoch progression for leave-four-subjects-out cross validation, aggregating the results of each cross-validation run.

Figure 6 shows some examples of qualitative results of the mutual gaze detection system for both participants in the second evaluation scenario. For Participant 1, in general, facial landmarks and the line of gaze are visibly well estimated. Problems usually arise with near-to-profile head positions, as face landmarks cannot be properly located and the gaze model has not been trained for that angle range. We also see a clear difference when wearing glasses.

The gaze predictions are smoother when the participant is not wearing them, whereas they are less precise and have greater fluctuations with glasses, especially when the glasses frame hides more part of the eyes region. For Participant 2 we observe a particular effect. While the X component of the gaze vector is usually correct, the Y component is looking upwards most of the time, highly likely due to the difference in appearance of the eye region of the participant with respect to the gaze model training subjects. This can also be observed in some frames where the eyes landmarks are detected below the eyes.

Figure 6. Examples of qualitative results for Participant 1 (left) and Participant 2 (right): (1) looking at an object; (2) looking at the other’s face; (3) looking at an object, but landmarks and gaze are not correctly estimated, either because of the glasses

or the more near-to-profile position; (4) looking at the other’s face; (5) looking at the object; (6) looking at the other’s face, but eye landmarks are not correctly estimated.

Finally, the event predictions were exported to The Observer XT to compare them to the annotated ground truth.

The certainty values were converted to behavior modifiers. To make both coding schemes comparable, ‘Looking at object’ and ‘Looking at hand’ behaviors used in the manual annotation were treated as ‘Looking elsewhere’. We observed that, even though the Y component of Participant 2’s line of gaze was not correctly estimated, the event predictions were mostly right thanks to the head threshold parameter, which mitigates this deviation. Reliability analysis was carried out omitting the modifiers. A percentage of agreement of 70.2% was found using the duration-sequence method with a Kappa value of 0.62, which can be regarded as a good level of agreement. Manual examination of the List of Comparisons generated by the software revealed that the majority of the occasions when the predictions were not the same as the ground truth was when the subjects very briefly looked elsewhere, and when one of the participants was looking at an object located in between of them.

163

R.A Grant et al. (Eds.): Measuring Behavior 2018, ISBN 978-1-910029-39-8 Manchester, UK, 5th-8th June 2018

Im Dokument Measuring Behavior 2018 (Seite 161-164)

Outline

ÄHNLICHE DOKUMENTE