Detailed description of the data analysis

(1)

Detailed description of the data analysis

Data were analyzed using the R software package (version 3.4.1 for Linux; http://CRAN.R-project.org/

1) on an Intel Xeon® computer running on Ubuntu Linux 16.04.3 64-bit. The analysis was performed in five main steps comprising (i) data preprocessing, (ii) feature selection, (iii) classifier creation from each set of questionnaire items followed by (iv) the creation of a combined classifier from the identi - fied item subsets of the questionnaires, with the inclusion of (v) classical measures of test respectively classification performance.

Data preprocessing

Questionnaires completed by at least 80 % of the items were required for subject inclusion into the data analysis. This resulted in n = 761, 752, 754 and 779 subjects available for the analyses of BDI, STAI-State, STAI-Trait and STAXI-2, Anger-inhibition, respectively. In the data fulfilling this criterion, 3, 4, 3 and 0 missing values were found in the BDI, STAI-State, STAI-Trait and STAXI-2, Anger-inhibition data subsets, respectively. Missing values were imputed using a k nearest neighbor algorithm with k = 3 ² applying the weighted average method implemented in the “DMwR” R library (https://cran.r- project.org/package=DMwR ³). Pain persistence group assignments (classification of patients to groups) were then included into the data space. The obtained data space, D=

{ ⁽

^xⁱ^{, y}ⁱ

⁾ |

^xⁱ^{X , y}ⁱ^{Y ,i=1}^{, … , n}

}

, had an input space X comprising vectors xi = <xi,1,…,xi,d> with d > 0 different parameters (here, the psychological questionnaire items) acquired from n > 0 cases, and an output space Y comprising the classes yi (here, the “persistent” versus “non-persistent pain”

groups of the patients). The data space D consisted of four matrices of size 22 x 761, 21 x 752, 21 x 754 and 9 x 779 for the BDI, STAI-State, STAI-Trait and STAXI-2 data subsets, respectively. These data sets were split into a 2/3 sized training set and a 1/3 sized test set that both contained the two core groups of pain persistence classes in size-proportional counts.

(2)

Feature selection

Feature (psychological questionnaire item) selection was performed on the training data sets and ap - plying random forests followed by computed ABC analysis as introduced recently ⁴. Specifically, a classifier (persistent versus non-persistent pain) was developed for each questionnaire using supervised machine learning ^{5, 6} and feature selection techniques ⁷. Feature selection was started with a random forest analysis, which is an ensemble learning -method using the bagging of many weak classifiers into a strong classifier. Specifically, it employs a multitude of decision trees to learn a highly irregular combination of features ^{8, 9}. These trees are obtained by random splits of the features, and the classifier relates on the majority vote for class membership provided by a large number of decision trees.

In the present analysis, 300 decision trees were built containing 1 for all features randomly drawn from the questionnaire items. The number of trees was based on visual analysis of the relationship between the number of decision trees and the classification accuracy, which indicated that beyond 100 trees, the classification-balanced accuracy remained stable and a larger number merely con - sumed available computation time (for an example figure of a similar analysis, see Supplementary Figure 1 of ¹⁰). From these trials, features (questionnaire items) were chosen to be included into the final classifier based on the importance of the feature in the random forest classifier, which is provided as the mean decrease in classification accuracy when the respective parameter was excluded from forest building. These calculations were done using the R library “randomForest” (Liaw A;

https://cran.r-project.org/package=randomForest).

Subsequently, the above-obtained values of the mean decrease in tree classification accuracy - when the parameter was excluded from random forest analysis - were submitted to computed ABC analysis

11. This is a categorization technique for the selection of the most important subset among a larger set of items. ABC analysis aims at dividing a set of positive data, here the set of response to the items of the psychological questionnaires, and classification performances in the preceding random forest analysis, into three disjoint subsets called “A”, “B” and “C”. Subset “A” comprises the profitable values,

(3)

i.e., "the important few" that were retained for subsequent classifier establishment whereas the op- posite subset “C” comprises non-profitable values, i.e., "the trivial many" ^{12, 13}. These calculations were done using the R package “ABC analysis” (http://cran.r-project.org/package=ABCanalysis ¹¹).

Random forest analysis with subsequent computed ABC analysis was applied to 1,000 data subsets randomly drawn from the original data sets by means of core-class proportional bootstrap resampling from the training data set ¹⁴ using the R library “sampling” (Y. Tillé, A. Matei, https://cran.r-projec- t.org/package=sampling). The final size of the feature set was equal to the most frequent size of set

“A” in the 1,000 runs. The final members of the feature set were chosen in decreasing order of their appearances in ABC set “A” among the 1,000 runs.

Classifier creation

The classifier creation process aimed at finding the items sum from the given questionnaire at which the association of a subject to either the “non-persistent pain” or the “persistent pain” group was most accurately possible. That is, the final classifier was created by identifying a classification rule

ψ:N^N^{ABC setA}→ Y , where N denotes the features in ABC set “A”, that assigned a class label to the

data on the basis of the sum score of the features selected in the previous analytical steps. Thus, at this step of the analysis the sum scores of the selected subset of items of a given questionnaire was tested for its classification capacity. Therefore, all possible sums of the selected items, which can take values of n∈N ,

[

0, …,3

]

for BDI and STAXI-2 (anger-inhibition), and n∈N ,

[

1, …,4

]

for STAI-State and STAI-Trait, were iteratively tested with respect to their classification performance. The main criterion was the product of sensitivity and specificity. The procedure was repeated 1,000 times on randomly generated bootstrap subsamples of the training data subset.

Creation of combined classifiers

From the identified item subsets of the questionnaires in the previous analytical steps, an attempt at a combined classifier from the four different reduced questionnaires was made. This analysis used

(4)

the items so far identified as best suited for the correct assignment of a subject to either the “non- persistent pain” or the “persistent pain” group. They were submitted to feature selection and classifier creation in the same manner as described above. To obtain similarly scaled items, STAI-State and STAI-Trait were rescaled into n∈N ,

[

0, …,3

]

by subtracting a value of 1 from each rating.

Assessment of classification performance

Classifier performance was assessed on the training and on the test data subsets. Specifically, test sensitivity and specificity were calculated as mentioned above, the negative and positive predictive values were obtained as NPV [%] = 100 · true negative / (true negative + false negative) and PPV [%] = 100 · true positive / (true positive + false positive), respectively ¹⁵. The analysis was performed similarly as applied on the reduced item sets, i.e., using 1,000 data subsets randomly drawn from the original data sets by means of core-class proportional bootstrap ¹⁴ resampling from the training data set.

References

1 R Development Core Team. R: A Language and Environment for Statistical Computing. 2008 2 Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theor 1967; 13: 21-7 3 Torgo L. Data Mining with R: Learning with Case Studies. Chapman \& Hall/CRC, 2010

4 Lötsch J, Ultsch A. Random forests followed by ABC analysis as a feature selection method for machine-learning. Conference of the International Federation of Classification Societies.

Tokyo, 2017; 170

5 Murphy KP. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012

6 Shalev-Shwartz S, Ben-David S. Understanding Machine Learning: From Theory to Algorithms.

Cambridge University Press, 2014

7 Guyon I, Andr, #233, Elisseeff. An introduction to variable and feature selection. J Mach Learn Res 2003; 3: 1157-82

8 Ho TK. Random decision forests. Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1: IEEE Computer Society, 1995; 278 9 Breiman L. Random Forests. Mach Learn 2001; 45: 5-32

10 Kringel D, Geisslinger G, Resch E, et al. Machine-learned analysis of the association of next- generation sequencing based human TRPV1 and TRPA1 genotypes with the sensitivity to heat stimuli and topically applied capsaicin. Pain 2018

11 Ultsch A, Lötsch J. Computed ABC Analysis for Rational Selection of Most Informative Variables in Multivariate Data. PLoS One 2015; 10: e0129767

12 Pareto V. Manuale di economia politica, Milan: Società editrice libraria, revised and translated into French as Manuel d’économie politique. Paris: Giard et Briére, 1909

13 Juran JM. The non-Pareto principle; Mea culpa. Quality Progress 1975; 8: 8-9

(5)

14 Efron B, Tibshirani RJ. An introduction to the bootstrap. San Francisco: Chapman and Hall, 1995

15 Altman DG, Bland JM. Diagnostic tests 2: Predictive values. BMJ 1994; 309: 102

Detailed description of the data analysis

{ (

) |

}

[

]

[

]

[

]

{ ⁽

⁾ |