Model Performance on Classiﬁcation

All three age estimation methods of this work (M1, M2, and M3), were also evaluated on the majority classiﬁcation (18-year-limit) with an “extended” 5-fold cross-valida-tion. This included the evaluation of the models on all ten training rounds (all) or only on the best one of each fold (best). The reference statistical evaluation on the training set is designated asstatin the tables and represents a naive classiﬁer with 100% sensitivity and 0% speciﬁcity, i.e. it applies the principle of “in dubio pro reo”.

The designations of the model variants from regression are extended to include the ML algorithms for classiﬁcation: k-nearest neighbours classiﬁer (KNC), support-vec-tor classiﬁer (SVC), decision tree classiﬁer (DTC), random forests classiﬁer (RFC), extremely randomized trees classiﬁer (ETC), gradient tree boosting classiﬁer (GBC).

7.3 Age Estimation Results

The age distribution of the subjects analyzed withMethod 1 was slightly imbal-anced towards minors and thus resulted in an accuracy of 61.33% for the statistical evaluation of the training set. The weight of the classes was included into the ML algorithm before training to account for the imbalance. All ﬁtted classiﬁers had a higher performance thanstat, but did not surpass 80% in all metrics (Table 7.8).

Learning from AM alone, delivered insuﬃcient results. Using OS instead of SKJ as a feature for the growth plate maturation, gave slightly better metrics. The combination of AM and SKJ, did not improve the results as in the regression task, but rather hurt the accuracy and sensitivity. Finally, the best model for M1 was GBC based on OS as input data with an average accuracy of 81.14%, sensitivity of 82.73%, speciﬁcity of 78.46%, and AUC of 83.18%.

Table 7.8:Performance onmajority classiﬁcationof several model variants fromMethod 1 (M1) on the test sets in an “extended” 5-fold cross-validation using AM, OS, and SKJ

Rounds Data Classiﬁer Acc. Sens. Spec. AUC

- - stat^∗ 61.33 100.00 0.00 50.00

AM KNC 70.29 76.36 60.00 73.81

OS GBC 80.57 81.82 78.46 83.36

SKJ GBC 80.00 80.91 78.46 83.15

all

AM+SKJ ETC 74.29 69.09 83.08 83.92

AM KNC 77.71 80.00 73.85 76.92

OS GBC 81.14 82.73 78.46 83.18

SKJ GBC 80.00 80.91 78.46 83.15

best

AM+SKJ ETC 76.74 71.55 85.54 85.87

∗: predicts all subjects in the training set as minors all/best: all ten or best training rounds per fold are included

All listed classiﬁers ofModel 2 achieved above 80% in accuracy, sensitivity, speci-ﬁcity, and AUC (Table 7.9). The best performing classiﬁers on coronal MRIs were RFCs and incorporated either only the coronal MRIs or all data. Both classiﬁers surpassed 89% in accuracy. The RFC on MRIs only, had a slightly higher average sensitivity and AUC which could prove to be advantageous compared to the RFC on all data, which has a higher speciﬁcity, depending on the preferred outcome.

Method 2 was also trained on a larger number of sagittal MRIs with a distribution of the training set marginally inclined to minors (52.26%). The models performed better thanstatwhen evaluated onallten training rounds and on thebestone per

Table 7.9:Performance onmajority classiﬁcationof several model vari-ants fromMethod 2 (M2) on the test sets in an “extended”

5-fold cross-validation usingcoronal MRIs, AM, and SKJ

Rounds Data Classiﬁer Acc. Sens. Spec. AUC

- - stat^∗ 49.25 100.00 0.00 50.00

COR CNN+SVC 85.71 86.36 84.62 90.82

all COR+AM+SKJ CNN+RFC 83.49 81.36 87.08 89.55

COR CNN+RFC 89.14 89.09 89.23 92.52

best COR+AM+SKJ CNN+RFC 89.71 88.18 92.31 91.99

∗: predicts all subjects in the training set as minors all/best: all ten or best training rounds per fold are included

fold (Table 7.10). RFC trained only on sagittal MRIs attained the best average metrics, with an accuracy of 90.9%, a sensitivity of 88.6%, a speciﬁcity of 94.2%, and a AUC of 94.4% over all folds.

Table 7.10:Performance onmajority classiﬁcationof sev-eral model variants fromMethod 2 (M2) on the test sets in an “extended” 5-fold cross-validation usingsagittalMRIs

Rounds Data Classiﬁer Acc. Sens. Spec. AUC

- - stat^∗ 52.26 100.00 0.00 50.00

all SAG CNN+SVC 87.47 88.41 86.13 94.33 best SAG CNN+RFC 90.93 88.64 94.19 94.38

∗: predicts all subjects in the training set as minors all/best: all ten or best training rounds per fold are included

The last method evaluated on majority classiﬁcation of the 18-year-limit was Method 3. The gradient boosting classiﬁer, i.e. GBC, achieved metrics under 80% — except for the AUC — considering all ten training rounds (Table 7.11). In contrast, the RFC was the best model with an average accuracy of 86.86%, a sensitivity of 85.46%, a speciﬁcity of 89.23%, and a AUC of 88.53% over all folds.

7.3 Age Estimation Results

Table 7.11:Performance onmajority classiﬁcationof several model vari-ants fromMethod 3 (M3) on the test sets in an “extended”

5-fold cross-validation usingcoronal MRIs, AM, and SKJ

Rounds Data Classiﬁer Acc. Sens. Spec. AUC

- - stat^∗ 49.25 100.00 0.00 50.00

all COR+AM+SKJ CNN+GBC 76.34 74.91 78.77 83.96 best COR+AM+SKJ CNN+RFC 86.86 85.46 89.23 88.53

∗: predicts all subjects in the training set as minors all/best: all ten or best training rounds per fold are included

Summary

In summary, several successful models were trained for the age regression and ma-jority classiﬁcation tasks. Comparing the three methods of this work (M1, M2, and M3),Method 2 proved to be the optimal approach to solve both tasks.

For regression, M2 was best conﬁgured using a CNN trained on coronal MRIs fol-lowed by an extremely randomized trees regressor. The ETR used the CNN age predictions per image slice, the AM, and SKJ to regress the ﬁnal chronological age of an individual. It achieved an average MAE of 0.69±0.47 years and maximum AE of 2.15 years on the tests sets of ﬁve diﬀerent folds. Each fold included 35 test subjects amounting to a total of 175 diﬀerent subjects evaluated with the aforemen-tioned model. The predictions of ETR are plotted over the true chronological ages of all test subjects (Fig. 7.18). The green central line highlights a perfect prediction, while the two parallel grey lines encompass 95% of the model predictions. The pre-dictions lie relatively close and evenly distributed along the green line except a few

“outliers” outside the area between the grey lines.

For classiﬁcation, the best method was M2 as well but using sagittal instead of coro-nal MRI. RFC proved to be the best ML algorithm to learn from age predictions of the CNN to discriminate between adults and minors. It achieved a high perfor-mance on this task with an accuracy of 90.9%, a sensitivity of 88.6%, a speciﬁcity of 94.2%, and a AUC of 94.4% over all folds. The ROC curve suggests, that the model has the potential to increase its sensitivity at the cost of speciﬁcity, or conversely (Fig. 7.19).

Figure 7.18:Predicted vs. true chronological age of test subjects from all ﬁve folds (n= 35∗5 = 175) using a CNN followed by an ETR based onMethod 2. The green central line highlights a perfect prediction, while the two parallel grey lines encompass 95% of the data.

7.3 Age Estimation Results

Figure 7.19:ROC curve for the best model on majority classiﬁcation. The ran-dom forest classiﬁer (RFC) attains an accuracy of 90.9%, a sensitiv-ity of 88.6%, a speciﬁcsensitiv-ity of 94.2%, and a AUC of 94.4% averaged over ﬁve distinct folds. The high mean AUC suggests, that shifting the threshold can further improve the sensitivity at the expense of the speciﬁcity, or vice versa.

8 ^Discussion

Im Dokument Towards Automated Age Estimation of Young Individuals (Seite 125-132)

Summary

8 Discussion

8 ^Discussion