• Keine Ergebnisse gefunden

Model Performance on Classification

All three age estimation methods of this work (M1, M2, and M3), were also evaluated on the majority classification (18-year-limit) with an “extended” 5-fold cross-valida-tion. This included the evaluation of the models on all ten training rounds (all) or only on the best one of each fold (best). The reference statistical evaluation on the training set is designated asstatin the tables and represents a naive classifier with 100% sensitivity and 0% specificity, i.e. it applies the principle of “in dubio pro reo”.

The designations of the model variants from regression are extended to include the ML algorithms for classification: k-nearest neighbours classifier (KNC), support-vec-tor classifier (SVC), decision tree classifier (DTC), random forests classifier (RFC), extremely randomized trees classifier (ETC), gradient tree boosting classifier (GBC).

7.3 Age Estimation Results

The age distribution of the subjects analyzed withMethod 1 was slightly imbal-anced towards minors and thus resulted in an accuracy of 61.33% for the statistical evaluation of the training set. The weight of the classes was included into the ML algorithm before training to account for the imbalance. All fitted classifiers had a higher performance thanstat, but did not surpass 80% in all metrics (Table 7.8).

Learning from AM alone, delivered insufficient results. Using OS instead of SKJ as a feature for the growth plate maturation, gave slightly better metrics. The combination of AM and SKJ, did not improve the results as in the regression task, but rather hurt the accuracy and sensitivity. Finally, the best model for M1 was GBC based on OS as input data with an average accuracy of 81.14%, sensitivity of 82.73%, specificity of 78.46%, and AUC of 83.18%.

Table 7.8:Performance onmajority classificationof several model variants fromMethod 1 (M1) on the test sets in an “extended” 5-fold cross-validation using AM, OS, and SKJ

Rounds Data Classifier Acc. Sens. Spec. AUC

- - stat 61.33 100.00 0.00 50.00

AM KNC 70.29 76.36 60.00 73.81

OS GBC 80.57 81.82 78.46 83.36

SKJ GBC 80.00 80.91 78.46 83.15

all

AM+SKJ ETC 74.29 69.09 83.08 83.92

AM KNC 77.71 80.00 73.85 76.92

OS GBC 81.14 82.73 78.46 83.18

SKJ GBC 80.00 80.91 78.46 83.15

best

AM+SKJ ETC 76.74 71.55 85.54 85.87

: predicts all subjects in the training set as minors all/best: all ten or best training rounds per fold are included

All listed classifiers ofModel 2 achieved above 80% in accuracy, sensitivity, speci-ficity, and AUC (Table 7.9). The best performing classifiers on coronal MRIs were RFCs and incorporated either only the coronal MRIs or all data. Both classifiers surpassed 89% in accuracy. The RFC on MRIs only, had a slightly higher average sensitivity and AUC which could prove to be advantageous compared to the RFC on all data, which has a higher specificity, depending on the preferred outcome.

Method 2 was also trained on a larger number of sagittal MRIs with a distribution of the training set marginally inclined to minors (52.26%). The models performed better thanstatwhen evaluated onallten training rounds and on thebestone per

Table 7.9:Performance onmajority classificationof several model vari-ants fromMethod 2 (M2) on the test sets in an “extended”

5-fold cross-validation usingcoronal MRIs, AM, and SKJ

Rounds Data Classifier Acc. Sens. Spec. AUC

- - stat 49.25 100.00 0.00 50.00

COR CNN+SVC 85.71 86.36 84.62 90.82

all COR+AM+SKJ CNN+RFC 83.49 81.36 87.08 89.55

COR CNN+RFC 89.14 89.09 89.23 92.52

best COR+AM+SKJ CNN+RFC 89.71 88.18 92.31 91.99

: predicts all subjects in the training set as minors all/best: all ten or best training rounds per fold are included

fold (Table 7.10). RFC trained only on sagittal MRIs attained the best average metrics, with an accuracy of 90.9%, a sensitivity of 88.6%, a specificity of 94.2%, and a AUC of 94.4% over all folds.

Table 7.10:Performance onmajority classificationof sev-eral model variants fromMethod 2 (M2) on the test sets in an “extended” 5-fold cross-validation usingsagittalMRIs

Rounds Data Classifier Acc. Sens. Spec. AUC

- - stat 52.26 100.00 0.00 50.00

all SAG CNN+SVC 87.47 88.41 86.13 94.33 best SAG CNN+RFC 90.93 88.64 94.19 94.38

: predicts all subjects in the training set as minors all/best: all ten or best training rounds per fold are included

The last method evaluated on majority classification of the 18-year-limit was Method 3. The gradient boosting classifier, i.e. GBC, achieved metrics under 80% — except for the AUC — considering all ten training rounds (Table 7.11). In contrast, the RFC was the best model with an average accuracy of 86.86%, a sensitivity of 85.46%, a specificity of 89.23%, and a AUC of 88.53% over all folds.

7.3 Age Estimation Results

Table 7.11:Performance onmajority classificationof several model vari-ants fromMethod 3 (M3) on the test sets in an “extended”

5-fold cross-validation usingcoronal MRIs, AM, and SKJ

Rounds Data Classifier Acc. Sens. Spec. AUC

- - stat 49.25 100.00 0.00 50.00

all COR+AM+SKJ CNN+GBC 76.34 74.91 78.77 83.96 best COR+AM+SKJ CNN+RFC 86.86 85.46 89.23 88.53

: predicts all subjects in the training set as minors all/best: all ten or best training rounds per fold are included

Summary

In summary, several successful models were trained for the age regression and ma-jority classification tasks. Comparing the three methods of this work (M1, M2, and M3),Method 2 proved to be the optimal approach to solve both tasks.

For regression, M2 was best configured using a CNN trained on coronal MRIs fol-lowed by an extremely randomized trees regressor. The ETR used the CNN age predictions per image slice, the AM, and SKJ to regress the final chronological age of an individual. It achieved an average MAE of 0.69±0.47 years and maximum AE of 2.15 years on the tests sets of five different folds. Each fold included 35 test subjects amounting to a total of 175 different subjects evaluated with the aforemen-tioned model. The predictions of ETR are plotted over the true chronological ages of all test subjects (Fig. 7.18). The green central line highlights a perfect prediction, while the two parallel grey lines encompass 95% of the model predictions. The pre-dictions lie relatively close and evenly distributed along the green line except a few

“outliers” outside the area between the grey lines.

For classification, the best method was M2 as well but using sagittal instead of coro-nal MRI. RFC proved to be the best ML algorithm to learn from age predictions of the CNN to discriminate between adults and minors. It achieved a high perfor-mance on this task with an accuracy of 90.9%, a sensitivity of 88.6%, a specificity of 94.2%, and a AUC of 94.4% over all folds. The ROC curve suggests, that the model has the potential to increase its sensitivity at the cost of specificity, or conversely (Fig. 7.19).

Figure 7.18:Predicted vs. true chronological age of test subjects from all five folds (n= 355 = 175) using a CNN followed by an ETR based onMethod 2. The green central line highlights a perfect prediction, while the two parallel grey lines encompass 95% of the data.

7.3 Age Estimation Results

Figure 7.19:ROC curve for the best model on majority classification. The ran-dom forest classifier (RFC) attains an accuracy of 90.9%, a sensitiv-ity of 88.6%, a specificsensitiv-ity of 94.2%, and a AUC of 94.4% averaged over five distinct folds. The high mean AUC suggests, that shifting the threshold can further improve the sensitivity at the expense of the specificity, or vice versa.

8 Discussion