Due to the low number of samples and high number of features, various statistical and machine learning methods are applied to reduce the number of uninformative features and get a list of informative features that can be used as a potential signature for biomarkers.
Figure 8.5-1 Feature selection
In order to obtain informative miRNAs (154) and piRNAs (43), βMeasure of Relevance (MoR) procedure [275] on 61 samples of cohort 1 was applied. Considering high dimensionality of the features compared to a very small set of samples, the MoR procedure can reduce features set to small informative features set by evaluating distribution overlap (assuming that samples are independent), biological difference, dispersion parameters of the samples and weighing factor common to all features. The πππ π value for jth smallRNA feature is given by:
π1π ππ2π represents means, variances and covariance of rank transformed values of jth smallRNA feature for two sub samples 1 and 2.
The absolute MoR values for each feature are then sorted in decreasing order and a suitable selection and evaluation criterion is applied to the information chain as described in [275].
We also applied reliability analysis (RiA) [276] to obtain a reduced set of informative features by applying MoR procedure (500 iterations) to a randomly selected subgroup of samples and features and features with a relative frequency higher than 0.25 during 500 iterations were selected. We chose features that were common between MoR and RiA as reliable features for further analysis.
Figure 8.5-2 miRNAs and piRNAs
8.5.1 Variable ranking and removal of low ranked variables
Given the very low number of samples, the reliable features were further ranked [277] by sorting the features according to some scoring function π(π) values given by [188, 278]:
π(π):πΉ β β¦
Where, π(π) is the scoring function computed from the values of training data with a set of features πΉ = (πΉ1, β¦ ,πΉπ) and β¦ is the probability space which is the set of possible classifications{π1, β¦ ,ππ}.
A combination of various machine learning algorithms were used and the mean score was calculated from the scoring functions which were used to rank the reliable features.
First, Ridge regression [279] which is also known as Tikhonov regularization was used. Here the loss function is a linear least squares function with l2 regularization. The linear least square function is calculated using singular value decomposition (SVD) on the input matrix π(π,π) with the complexity of π(ππ2) whereπ β₯ π. The ridge algorithm with default parameter given by
Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001)
is used. The output are the coefficients which the Ridge algorithm minimized by applying a penalty on the size of the coefficients (the residual sum of squares). The minimization function is given by
minπ βππ β π¦β22+πΌβπβ22
Where, πΌ is the regularization strength, π¦ is the output variable and π are the coefficients of the linear model.
Along with the Ridge, where l2 regularization parameters are set, Bayesian regression can estimate these parameters by fine tuning from the data. Bayesian Ridge [280-283]
regression given by
BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True, fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300, normalize=False, tol=0.001, verbose=False)
basically, estimates the probabilistic model where the output π¦ is assumed to be Gaussian around ππ is given by
π(π¦|π,π,πΌ) =Ξ(y|MΟ,Ξ±)
Where, πΌ is treated as a random variable that will be estimated from the data.
The priors for π is given by spherical Gaussian
π(π|π) =Ξ(π|0,πβ1ππ©)
Where π is the estimated precision of the weights.
Then, a univariate linear regression tests [284] (πππππππ π ππΏπ()) was used for testing the effect of a single regressor, sequentially for many regressors. It is performed by first calculating the correlation between each regressor (π) and the target using
οΏ½οΏ½π[: ,π]β ππππ(π[: ,π])οΏ½ β οΏ½π¦ β πππππ¦οΏ½οΏ½
οΏ½π π π(π[: ,π])β π π π(π¦)οΏ½
It is frst converted to a πΉscore and then to a p-value.
A random forest regressor (RF) [168, 169] given by
RandomForestRegressor(n_estimators=10, criterion=βmseβ, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=βautoβ, max_leaf_nodes=None,min_impurity_decrease=0.0,
min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False)
results in the meta estimator that fits many sub classification decision trees. The classifier uses the average of probabilities predicted by each sub decision trees to get the final classification for each class.
Then, LassoLarsCV [285], a cross-validated Lasso, using the LARS algorithm given by
LassoLars(alpha=1.0, fit_intercept=True, verbose=False, normalize=True, precompute=βautoβ, max_iter=500, eps=2.220446049250313e-16, copy_X=True, fit_path=True, positive=False)
The optimization objective function to minimize is:
min π 1
2ππ πππ πππ βππ β π¦β22+πΌβπβ1
Where, πΌ is the regularization strength, π¦ is the output variable and π are the coefficients of the linear model. The lasso estimate thus solves the minimization of the least-squares penalty with πΌβπβ1 added, where πΌ is a constant and βπβ1is the π1norm of the parameter vector.
In the end, mean rank is calculated by averaging the scores from each algorithm and features with mean rank lower than 0.30 are filtered out.
8.5.2 Multivariate analysis of covariance
We also filtered out miRNAs and piRNAs that were confounded by age and gender by applying multivariate analysis of covariance (MANCOVA) [286, 287] on the reliable features with significant values less than 0.05.
8.5.3 Model selection and performance
The performance of selected features is then evaluated in an independent test cohort. The average Error and average number of trees parameter were calculated using a 10 fold cross validation on CV data which was obtained from the training data.
Figure 8.5-3 Optimal model selection and performance evaluation
The random forest model implemented in R (v 3.2.2) with the package randomforest (v 4.6.14) is trained with stratified sampling and class weights of 0.5 and 1.0 were used for the control class and the Alzheimerβs disease class to minimize the false negatives. Prediction on the training and test data is performed using the predict function from randomforest and roc function is used to get the test performance from the untouched test cohort data. The AUC values are plotted using the pROC [288] package with 500 stratified bootstrap [289]
iterations to provide the confidence interval computed with Delongβs method [290] for the AUC values. Smoothing of the AUROC curve was performed by calculating the πΌ and π½ coefficients of a linear regression line of the smoothed curve which is given by
πβ1(ππΈ) =πΌ+π½πβ1(ππ)
Where, πis the normal quantile function value of sensitivities (SE) and specificities (SP).
The variable importance is calculated from the importance function in randomforest package.
The total decrease in node impurities is measured by taking the average node impurities on every split of the variable for all the trees generated in the random forest model. This node impurity is measured by Gini index (πΌπΊ) which is given by
πΌπΊ(π) = 1β Ξ£π=1πΆ ππ2
Where, πΆ is the number of classes in the total samples set with π β{1,2, β¦ ,πΆ}, and ππare the number of samples that belong to the particular class π in the total sample set. The Gini index for each variable is obtained and a Barplot of decreasing variable importance is plotted.