Statistical and machine learning analysis

Due to the low number of samples and high number of features, various statistical and machine learning methods are applied to reduce the number of uninformative features and get a list of informative features that can be used as a potential signature for biomarkers.

Figure 8.5-1 Feature selection

In order to obtain informative miRNAs (154) and piRNAs (43), “Measure of Relevance (MoR) procedure [275] on 61 samples of cohort 1 was applied. Considering high dimensionality of the features compared to a very small set of samples, the MoR procedure can reduce features set to small informative features set by evaluating distribution overlap (assuming that samples are independent), biological difference, dispersion parameters of the samples and weighing factor common to all features. The 𝑀𝑙𝑅^𝑗 value for jth smallRNA feature is given by:

𝑘1𝑗 𝑈_𝑘2^𝑗 represents means, variances and covariance of rank transformed values of jth smallRNA feature for two sub samples 1 and 2.

The absolute MoR values for each feature are then sorted in decreasing order and a suitable selection and evaluation criterion is applied to the information chain as described in [275].

We also applied reliability analysis (RiA) [276] to obtain a reduced set of informative features by applying MoR procedure (500 iterations) to a randomly selected subgroup of samples and features and features with a relative frequency higher than 0.25 during 500 iterations were selected. We chose features that were common between MoR and RiA as reliable features for further analysis.

Figure 8.5-2 miRNAs and piRNAs

8.5.1 Variable ranking and removal of low ranked variables

Given the very low number of samples, the reliable features were further ranked [277] by sorting the features according to some scoring function 𝑆(𝑖) values given by [188, 278]:

𝑆(𝑖):𝐹 → Ω

Where, 𝑆(𝑖) is the scoring function computed from the values of training data with a set of features 𝐹 = (𝐹₁, … ,𝐹_𝑞) and Ω is the probability space which is the set of possible classifications{𝑐₁, … ,𝑐_𝑞}.

A combination of various machine learning algorithms were used and the mean score was calculated from the scoring functions which were used to rank the reliable features.

First, Ridge regression [279] which is also known as Tikhonov regularization was used. Here the loss function is a linear least squares function with l2 regularization. The linear least square function is calculated using singular value decomposition (SVD) on the input matrix 𝑀(𝑛,𝑝) with the complexity of 𝑂(𝑛𝑝²) where𝑛 ≥ 𝑝. The ridge algorithm with default parameter given by

Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001)

is used. The output are the coefficients which the Ridge algorithm minimized by applying a penalty on the size of the coefficients (the residual sum of squares). The minimization function is given by

min_𝜔 ‖𝑀𝑀 − 𝑦‖₂²+𝛼‖𝑀‖₂²

Where, 𝛼 is the regularization strength, 𝑦 is the output variable and 𝑀 are the coefficients of the linear model.

Along with the Ridge, where l2 regularization parameters are set, Bayesian regression can estimate these parameters by fine tuning from the data. Bayesian Ridge [280-283]

regression given by

BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True, fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300, normalize=False, tol=0.001, verbose=False)

basically, estimates the probabilistic model where the output 𝑦 is assumed to be Gaussian around 𝑀𝑀 is given by

𝑝(𝑦|𝑀,𝑀,𝛼) =Ν(y|Mω,α)

Where, 𝛼 is treated as a random variable that will be estimated from the data.

The priors for 𝑀 is given by spherical Gaussian

𝑝(𝑀|𝜆) =Ν(𝑀|0,𝜆⁻¹𝐈_𝐩)

Where 𝜆 is the estimated precision of the weights.

Then, a univariate linear regression tests [284] (𝑓_{𝑟𝑞𝑔𝑟𝑞𝑠𝑠𝑖𝐿𝑞}()) was used for testing the effect of a single regressor, sequentially for many regressors. It is performed by first calculating the correlation between each regressor (𝑖) and the target using

��𝑀[: ,𝑖]– 𝑚𝑀𝑀𝑛(𝑀[: ,𝑖])� ∗ �𝑦 – 𝑚𝑀𝑀𝑛_𝑦��

�𝑠𝑠𝑀(𝑀[: ,𝑖])∗ 𝑠𝑠𝑀(𝑦)�

It is frst converted to a 𝐹score and then to a p-value.

A random forest regressor (RF) [168, 169] given by

RandomForestRegressor(n_estimators=10, criterion=’mse’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None,min_impurity_decrease=0.0,

min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False)

results in the meta estimator that fits many sub classification decision trees. The classifier uses the average of probabilities predicted by each sub decision trees to get the final classification for each class.

Then, LassoLarsCV [285], a cross-validated Lasso, using the LARS algorithm given by

LassoLars(alpha=1.0, fit_intercept=True, verbose=False, normalize=True, precompute=’auto’, max_iter=500, eps=2.220446049250313e-16, copy_X=True, fit_path=True, positive=False)

The optimization objective function to minimize is:

min _𝜔 1

2𝑛_{𝑠𝑞𝑚𝑠𝑞𝑞𝑠}‖𝑀𝑀 − 𝑦‖₂²+𝛼‖𝑀‖₁

Where, 𝛼 is the regularization strength, 𝑦 is the output variable and 𝑀 are the coefficients of the linear model. The lasso estimate thus solves the minimization of the least-squares penalty with 𝛼‖𝑀‖₁ added, where 𝛼 is a constant and ‖𝑀‖₁is the 𝑙₁norm of the parameter vector.

In the end, mean rank is calculated by averaging the scores from each algorithm and features with mean rank lower than 0.30 are filtered out.

8.5.2 Multivariate analysis of covariance

We also filtered out miRNAs and piRNAs that were confounded by age and gender by applying multivariate analysis of covariance (MANCOVA) [286, 287] on the reliable features with significant values less than 0.05.

8.5.3 Model selection and performance

The performance of selected features is then evaluated in an independent test cohort. The average Error and average number of trees parameter were calculated using a 10 fold cross validation on CV data which was obtained from the training data.

Figure 8.5-3 Optimal model selection and performance evaluation

The random forest model implemented in R (v 3.2.2) with the package randomforest (v 4.6.14) is trained with stratified sampling and class weights of 0.5 and 1.0 were used for the control class and the Alzheimer’s disease class to minimize the false negatives. Prediction on the training and test data is performed using the predict function from randomforest and roc function is used to get the test performance from the untouched test cohort data. The AUC values are plotted using the pROC [288] package with 500 stratified bootstrap [289]

iterations to provide the confidence interval computed with Delong’s method [290] for the AUC values. Smoothing of the AUROC curve was performed by calculating the 𝛼 and 𝛽 coefficients of a linear regression line of the smoothed curve which is given by

𝜙⁻¹(𝑆𝐸) =𝛼+𝛽𝜙⁻¹(𝑆𝑆)

Where, 𝜙is the normal quantile function value of sensitivities (SE) and specificities (SP).

The variable importance is calculated from the importance function in randomforest package.

The total decrease in node impurities is measured by taking the average node impurities on every split of the variable for all the trees generated in the random forest model. This node impurity is measured by Gini index (𝐼_𝐺) which is given by

𝐼_𝐺(𝑝) = 1− Σ_𝑖=1^𝐶 𝑝_𝑖²

Where, 𝐶 is the number of classes in the total samples set with 𝑖 ∈{1,2, … ,𝐶}, and 𝑝_𝑖are the number of samples that belong to the particular class 𝑖 in the total sample set. The Gini index for each variable is obtained and a Barplot of decreasing variable importance is plotted.

Im Dokument Decoding the Epigenome of Neuronal Networks in Health and Disease (Seite 68-74)