• Keine Ergebnisse gefunden

Issue 1 – Predictive Quality of the Model

Im Dokument Modeling Driver Distraction  (Seite 109-113)

Issue 5 – Training/Accommodation Effects

4.6. Results and Discussion

4.6.2. Issue 1 – Predictive Quality of the Model

NOG eyes-off-road - 1.30 .890 17.5% 1 6 10

SGD eyes-off-road s 0.20 .490 14.4% 5 6 10

DRT deterioration % 22p.p. .843 25.2% 0 3 10

DLP deterioration % 23p.p. .724 19.3% 4 7 9

DFH deterioration % 20p.p. -.232 55.7% 0 1 4

TSOT P85 s 2.42 .795 19.2% 3 4 9

TGT IVIS P85 s 1.56 .897 12.5% 4 8 10

SGD IVIS P85 s 0.45 .704 18.3% 3 6 9

TEORT P85 s 1.57 .905 11.7% 5 9 10

SGD eyes-off-road P85 s 0.26 .551 14.5% 4 8 9

Table 4.2.: Evaluation overview

Table 4.2 presents an overview of the evaluation results. More details and plots for each metric can be found in Appendix D. The upper part of the table holds the evaluation of 13 metrics, based on predicted and measured medians. The lower part displays some additional information, when evaluating the 85th percentile (P85) for some metrics. The mean absolute error (MAE) column reports the average error compared to the prediction.

Pearson’s r holds the correlation between prediction and measurement (N = 10 tasks).

The mean absolute percentage error (MAPE) is presented in the next column. The last three columns CntError<x0% present how often the percentage between prediction and measurement was below x%. Therefore, a fast increasing number (in the 10% and 20%

column) is desirable; the maximum achievable is ten (tasks).

The TTT unoccluded and TSOT results are based on averaging two trials for each person. For R the averaged TTT unoccluded and TSOT is divided. The TTT while driving, TGT (to IVIS) and NOG IVIS (fractional) are averaged results of two trials during AAM testing. For SGD IVIS, the SGD for each trial is calculated based on the fractional approach (TGT/NOG) and then averaged. The TEORT, NOG (eyes-off-road) and SGD (eyes-off-road) are also two averaged trials. The NOG (eyes-off-road) and SGD (eyes-off-road TEORT/NOG) are not based on the fractional approach. To measure the DRT deterioration, the median reaction time of two trials is calculated separately,

then averaged and related to the median baseline reaction time (driving with TDRT).

Regarding the DLP and DFH, deterioration the driving performance during AAM testing of two trials is averaged and related to baseline driving.

The 85th percentiles (P85) of the measurements are calculated with the interpolating Excel function (quantile 0.85).

The tasks and modeling for the predictions are documented in Section 4.2 (p. 74). While some tasks are modeled in a relatively complex manner, Task 3 and Task 9 are mapped to basic subtasks (enter a phone number on a touchscreen and enter a phone number with a rotary knob). Therefore, these two tasks can be also used as a kind of retest check and reference; the detailed results are reported in Appendix D. When considering these detailed tables, it is also advisable to keep in mind that the first six tasks (Task 1 – Task 6) are touchscreen tasks, and Task 7 – Task 10 are rotary knob tasks. Task 3, Task 4 and Task 5 are essentially the same task (entering a phone number) with specific modifications.

Discussion

For the TTT unoccluded the MAPE (23.6%) would be slightly above the accepted 20% limit (cf. Section 2.4). A deeper examination of the results (Appendix D.1) reveals that the difference primarily originates from the rotary knob tasks. In general, the (static) TTT unoccluded is not too important for driver distraction assessments. While the main dif-ference for the TSOT (MAPE 19.8%) still stems from difdif-ferences in the rotary knob predictions (Appendix D.2), this underestimation is diminished for the dynamic TTT while driving (MAPE 13.4%; Appendix D.4), TGT (MAPE 15.1%; Appendix D.5) and TEORT (MAPE 15.4%; Appendix D.8). The reason for the surprisingly slow performance (TTT unoccluded and TSOT) of the subjects in rotary knob tasks is unclear. The con-gruency for TTT while driving indicates that the test subjects would be able to perform similar to the subjects in the subtask database under the given experimental conditions.

A possible explanation could be that the subjects had chosen an individually slower user pace on the rotary knob for TTT unoccluded and TSOT. The additional driving task may accelerate the user pace and render it similar to the subtask database. Therefore, the driv-ing task might have a beneficial experimental impact by interactdriv-ing with the user pace and diminishing differences between experiments. The (static) TTT unoccluded seems surprisingly to be one of the hardest metrics to predict. The R-metric benefits from the cancellation of the user pace by the division (TSOT/TTT unoccluded). The user pace affects the numerator and denominator.

In addition, the R-metric and the Single Glance Durations have typical ranges (e.g., R: 0.7–1; SGD: 1–2 s). These also limit deviations in MAE and MAPE.

When considering the CntError<10% column of the NOG IVIS, it is visible that six tasks had been predicted with a deviation <10% (all touchscreen tasks; Appendix D.6).

The NOG eyes-off-road seems harder to predict. A short check had been carried out: In the glance visualization of the online interface of the model, it can be recorded that during the 10-digit input subtask on touchscreen, two speedometer glances were registered (for all 24 subjects). In the comparable evaluation of Task 3, the subjects together glanced 16 times at the speedometer during the first trial of the touchscreen phone input. In the second trial, 16 glances are also observed. More (short) unpredicted glances lower

the SGDs (eyes-off-road). This could be a reasonable explanation concerning why the SGD eyes-off-road are over-predicted (Appendix D.10). This indicates that eyes-off-road metrics can make experiments more susceptible to disturbances and can lead to counter-intuitive results: For example, the SGD IVIS P85 (Appendix D.14.3) for Task 3 is about 3.5 s, while the SGD eyes-off-road P85 (Appendix D.14.5) would report 2.3 s. Therefore, the glances to the IVIS appear longer than the glances away from the road.

The evaluation results for the 85thpercentiles (P85) in the final part of Table 4.2 appear not worse (Pearson’s r and MAPE) than the predictions of the median in the upper part of the table. It is questionable whether the relaxed acceptance criterion of 40% for higher percentiles (cf. Section 2.4) is actually necessary.

The DRT, DLP and DFH deteriorations display considerable variability (Appendices D.11, D.12 and D.13).

Predictions of the DFH deterioration are unacceptable: MAPE 55.7%, weak correlation (r = -0.23) and only four tasks could be predicted with < 40% deviation.

The DRT predictions are slightly beyond the acceptance criterion (MAPE 25.2%).

Seven tasks deviate 20–40% (difference of the last two columns). The high correlation (r = 0.84) would be a benefit. A closer investigation of the detailed table (Appendix D.11) reveals that all tasks were under-predicted. This explains why a high correlation, com-bined with an unfortunate error (MAPE), can be observed. The reasons for the offset are unclear.

The online interface of the prediction model also includes bootstrap indicators. The model bootstraps a sample of 24 persons from the N = 24 subjects 1,000 times. These bootstrapped data sets are compared to guideline criteria. Based on this result, an indica-tor is calculated as a percentage related to how often the result passed the criteria. The comparison of the indicators to the measurement outcomes can be found in Appendix D.

The indicators for TSOT, TGT and TEORT appear valuable, while it must still be kept in mind that the model should present an approximate idea and estimation. For SGD IVIS (AAM), the indication can be helpful. The mean SGD eyes-off-road (NHTSA) indi-cator is questionable. This can be also due to the potential unreliable eyes-off-road metric.

The DFH bootstrap indicator is based on a likely less reliable metric and is therefore judged useless. The DLP bootstrap indicator demonstrates a positive performance (Ap-pendix D.14.6), with a correlation of -0.75. Nevertheless, the DLP also includes one of the worst predictions: Task 4 (Phone Delay) with an over-prediction of 74% (Appendix D.12).

The delay subtasks were measured embedded in a complex application (see construction of the prediction model, Section 3.2). It is possible that some subjects used longer de-lays to adjust their lane positions, which can cause higher DLP values. However, these DLP values are still lower than typical visual/manual subtask interactions (e.g., dialing a number). In the evaluation of Task 4, the delay is at the beginning of the task. At the be-ginning, there should be no reason to make larger adjustments to the lane position. This probably resulted in the very low measured DLP values during the evaluation experiment.

This is an indication that the position and combination (order) of subtasks within the prediction might be important sometimes. Measuring and storing all of these (hidden) potential interdependences between subtask combinations appears hardly possible. For

the current model, the order of the subtasks is neglected.

The selection of subtasks to model a task is, to some extent subjective. This topic is not assessed in this evaluation and thesis. It is foreseeable that different persons may chose slightly different subtasks. The description of the modeling for the ten tasks (Sec-tion 4.2) may help to find a reasonable mapping. It must be also kept in mind that the model handles visual/manual interfaces. When a delay (e.g., after manually dialing a phone number) is ended with an acoustic event (e.g., a ringing tone) this is a mixture of visual/manual and auditory interfaces that cannot be predicted with the current model.

Overall, the model makes generally reasonable predictions. The aim to offer approx-imate estapprox-imates for prototypes is achieved. The DFH deterioration metric should be ignored, disabled or hidden in a future version of the online interface. When DFH is excluded, the mean coefficient of determination of Table 4.2 would be 𝑅2 =.614. The overall average MAPE without DFH is 16% (min 9.3%, max 25.2%).

Comparing this result to some other evaluation experiments, already reported in Sec-tion 2.4, helps in judging the performance and emphasizes the distincSec-tions to other models.

With regard to 𝑅2 = 0.88 and 𝑅2 = 0.92, Pettitt’s method was evaluated in Kang et al.

(2013) to model occlusion task times. Salvucci (2005) reports a fit of𝑅2 > .99 to model the task on time while driving of four short tasks. For a TEORT prediction, Purucker et al. (2017) reports r = 0.58, which would result in 𝑅2 = 0.34.

Compared to these results, the final outcome of this thesis (𝑅2 =.614) is between the impressive high fits of approximately 𝑅2 = 0.9 and the improvable 𝑅2 = 0.34. The other models are typically restricted to predicting one or a few metrics and usually do not pro-vide data for, e.g., 85th percentiles. The model built in this thesis provides predictions for different assessment methods and uses distributions to derive, e.g., 85th percentiles.

Im Dokument Modeling Driver Distraction  (Seite 109-113)