• Keine Ergebnisse gefunden

4.3 Experimental estimation of Cognitive Workload

4.3.4 Discussion

as no CW. All remaining levels are taken as present CW. With this new binary target average accuracy in 10-fold CV across all classifiers is found to be 90.22±4.06 %.

Considering the binary classification task, the ranking of the tested classifiers remains mainly unchanged. However, in the binary setting, GP outperforms SVM and KNN.

Again, the difference between SVM and GP (0.7 %) is minimal with respect to the variability found within the CV (1.1 %). In contrast to the fine-grained tasks, the NB and DT-based models also provided acceptable classification results. The accuracy found with NB or DT is 85.10±1.44 % or 87.51±2.17 %, respectively.

The absolute distance between the 2 lowest-ranked classifiers (NB, DT) and the 3 top-ranked classifiers (SVM, KNN, GP) is lower in the binary task (6.53 %) compared to the fine-grained task (16.41 %). This indicates that the mid-level CW variation dominates the complexity of the classification. This assumption is also supported, connecting it to the results found with the LOGO CV, where the mean accuracy is 82.21±1.25 %.

The best results are found with the GP or SVM (absolute difference in accuracy is 0.05 %). Compared to the 10-fold CV, that is a difference of 8.01 % or a percentage deterioration of 9.29 %. Again, this highlights the assumption made on the confusion in the mid-level CW classes.

By relaxing the fine-grained constraint, and thus separate binary targets only, the estimation is improved and comparable to that found with 10-fold CV. This result also suggests that it is possible to generalization across a more diverse set of participants, even without using personalized characteristics.

4.3 Experimental estimation of Cognitive Workload

99 1 0 0 0 10 52 32 6 0 3 7 45 45 0 0 6 6 83 6 0 0 20 40 40

99 1 0 0 0 6 65 18 12 0 15 24 35 26 0 13 18 13 56 0 0 9 45 27 18

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

Predicted workload level Predicted workload level

Actual workload level Actual workload level

(a) Training set accuracy – 79.8% (b) Test set accuracy – 74.5%

Figure 4.9:Comparison results of a 5-level CW estimation in an autonomous driving scenario, as reported by Manawadu et al.[155]. Similar accuracy and variability patterns in respect to confusion were found in this work. Reprinted with permission, ©2018, IEEE

Comparable rankings could be found for temporal demands and frustration. For both items, it was also found that the mixed memory and reaction phase was rated the highest. In general, the items of the NASA-TLX score were ranked equally for all phases, except for the Stroop test. With regard to effort and physical demand, it was given the highest rating, whereas, concerning all other items, it was given the lowest ranking.

It can be concluded that the Stroop test differs significantly from all other phases of the experiment, while the remaining phases of the experiment are interchangeable to some extent. In summary, no difference could be found between the uni-modal CW measure (Likert scale) and the median of all TLX-items. Thus, for further investigation, additional studies to examine or classify the different tasks would be interesting. Based on these insights, an experiment could then be controlled in more detail so that it concentrates on a single item (e.g.frustrationoreffort) only. Also, it could be helpful to challenge the participants more and thus allows to have a wider or more evenly distributed range of perceived difficulties. Besides the more detail view on the dimensions of CW, additional performance measures like error-rate or time-on-task could help to further clarify the variation in subjectively perceived CW.

Considering the observed mid-levels CW confusion, it can be inferred that both the (self-reported) target values and the predictors (EDA, HR) are affected by noise.

Interestingly, there is no concluding answer whether there is noise in the (self-reported) target values in the predictors (EDA, HR) or both. Nevertheless, it can be seen that even most advanced methods suffer from this mid-level CW confusion.

An example is found in the work of Manawadu et al.[155]. They implemented a fine-grained estimation model for CW in the setting of semi-autonomous driving. The reported accuracy for the 5 leveled fine-grained estimation is similar to that presented in this chapter (74.5 %, Figure 4.9).

Strikingly, even with the deep ANN-based classification approach presented by Man-awadu et al.[155], and additional sensory information based on EEG signals, the confusion for the mid-level CW (i.e. level 2-5) remains. Manawadu et al.[155]work around this “human error” by altering the ground truth CW labels and use a soft-threshold approach. With this approach misclassification in the range of±1 level is allowed. As a result, accuracy is improved to 96.5 %. This again is in close agreement with the binary-classification result presented in this chapter (sec. 4.3.3.3).

Taking the assumption of noisy predictors and target values into account, GP was included in the comparison. This is because GPs are well-known to act as a linear smoother and, therefore, generally provide good performance in noisy settings[187]. Due to the soft-margin approach, SVM can also act as a linear-smoother, given appro-priate regularization. Both methods are known to provide a good trade-off regarding the bias-variance dilemma.

Indeed, GP and SVM-based classification outperformed DT and NB classification.

However, the best results were found using KNN. KNN-based classification, in turn, is known to be prone to noise (over-fitting), especially ifkvalue is chosen very small. It was observed that without regularization by using CV, a neighborhood ofk=1 gave the best accuracy. In the final model, the neighbor value is still small (k=4). It cannot be excluded that the high accuracy found is an effect of over-fitting to noise. This also agrees with the finding of a low accuracy using NB classification. This is because NB is known to be robust against noise by trading-off towards a bias error. Nevertheless, another reason why NB could have performed worst is the lack of independence of the features that are part of the feature set.

For this reason, it is of interest to have a more detailed view of the correlation between self-reported CW and objective measures, e.g. error rate or time-on-task.

Furthermore, the question arises if using bodily reactions alone, a separation among different dimensions of CW (i.e. mental demands, effort, time pressure, frustration) is possible at all[91]. It remains an open challenge for future research to investigate and clarify on the distortion of self-reported CW toughly. To clarify this, at least a more controlled experimental setup is required. This is based on the observation that no significant difference could be found between the NASA-TLX and the simplistic uni-modal Likert scale. Hence it could be concluded that a setting that exclusively targets a single dimension, e.g. time pressure, is needed.

Nevertheless, according to the results for the 3 top-rated models, misclassification rarely exceeded more than one class (or level). Therefore, despite the lower overall accuracy, the fine-grained estimation should be favorable because it facilitates a detailed specification of the perceived CW. Although concerning the top 3 models (GP, SVM, and KNN), comparable accuracy was found, they differ in implementation details.