Classification of Residues using Support Vector Machines

5.2 Classification of the Flexibility of Side Chains

5.2.4 Classification of Residues using Support Vector Machines

The results of the threshold based prediction of the flexibility (see section 7.2.1) show that using a simple linear classifier works quite well for the χ1 torsion angle but for the higher torsion angles the classification performance is low. In order to improve the classification performance on the one hand additional features have to be selected. On the other hand a method superior to linear classification has to be chosen, like a support vector machine (SVM). Before outlining the enhanced classifier, a short introduction to support vector ma-chines is given.

Introduction to Support Vector Machines

In 1992, support vector machines have been introduced by Vapnik and co-workers (Boser et al., 1992). In the field of bioinformatics in several applications SVMs are used, e.g. for classifying monomer and dimer structures of proteins (Neumann, 2003; Zhang et al., 2003) or homology modelling of proteins (see Christianini & Shawe-Taylor, 2000, chapter 8). The following introduction (including the figures) is based on the book of Cristianini and Shawe-Taylor (Christianini & Shawe-Shawe-Taylor, 2000).

Support vector machines try to learn a separating hyperplane so that it optimally dis-criminates two classes⁷. The hyperplane is tuned that way during learning so that the SVM’s generalisation error is minimised, meaning that this method finds a optimal solu-tion and that it does not end up in a local minima. Besides this, the computasolu-tional effort is very low so that usually support vector machines can handle large datasets efficiently.

b o

o o

o x

x x

x x x x

Figure 5.16: Scheme of linear classification for two di-mensions. The separating hyperplane is given by(~w,~b), x and o denote examples from the two classes.

SVMs are based on linear classi-fication machines. In linear clas-sification a binary decision is per-formed by a real valued function f : X ⊆ Rⁿ → R. The input ~x = (x1, . . . ,xn) is assigned a positive class if f(~x)>0and a negative class otherwise. In this case f is a lin-ear function and for~x∈X it can be written as:

f(~(x)) =h~w,~xi+~b (5.22) Thus, f(~x) defines a hyperplane (see Fig. 5.16) with parameters ~w (the direction perpendicular to the hyperplane) and ~b (position vec-tor). Support vector machines are

trained by a labelled data set (of size M). The SVM divides the input samples into two classes. The training set can be written as{(~xi,yi)} i=1, . . . ,Mwithyi∈ {−1,1}.

Often the classification performed this way is difficult, since an optimal hyperplane can not be found. The idea within the theory of support vector machines is to transform the original input space into a higher dimensional space (see Fig. 5.17). Usually, a high dimensional space is sparse. Mapping the input space into a higher dimensional space thus makes it easier to find separating hyperplanes. In order to map the data, so–called kernel functions are searched so that the hyperplane optimally discriminates two classes. Imagine thatφis a non–linear function that maps the input~x∈Rⁿ to a higher dimensional spaceR^N, and there the input is linearly separable. Thus, the function separating the input can be written as:

f(~x) =~wφ(~x) +~b (5.23)

7SVMs can be extended to separate n–classes, as each n–class problem can be defined as (n-1) two class problems. Here only the two class case is referred.

o o x

x x x

o x o o

x x x x

o o φ

−−−−−−−−→

φ(o) φ(o)

φ(x)

φ(o) φ(x)

φ(x) φ(x) φ(x)

Figure 5.17: Scheme of mapping input space to feature space using Kernel functionφ.

This equation can be reformulated resulting in the Perceptron Learning rule:

f(~x) =

∑

N i=1

αiy_ihφ(~x_i)φ(~x)i+~b (5.24) In equation 5.24, the inner product hφ(~x)φ(~x)i is calculated. Here, kernel functions can be applied. Their inner product of the original input (K(~x,~x)) is equal to the inner product in feature space. Using this kernel trick it is possible to build a classifier with an implicit mapping into feature space:

f(~x) =

∑

N i=1

αiyiK(~xi,~x) +~b (5.25)

Kernel functions have important properties, e.g. a new kernel function can be constructed from other kernels. The properties of kernel functions are discussed in detail in chapter 3 of Cristianini’s book. In practice, usually a special transformation function (φ) to build a kernel is not searched but already known kernels are taken. A second feature of a SVM is that the classifier can be trained very good, so that the generalisation error is reduced at the same time. This is reached by optimising the distance between the margin of the function separating the classes (functional margin) and the input examples. The functional marginγ˜i

of the training example (~xi,y_i) is defined as

γ˜i=yi(~w^T~xi+b)⇒γ˜i>0 (5.26) if ~x_i is classified correctly. The functional margin can be transformed into the geometric margin (γi) by using the normalised weight vector||~w||:

γi= γ˜i

||~w|| (5.27)

In case of the hard margin classifier, the margin is then defined as the minimum over all examplesI in the training set:

γ= min

1≤i≤Iγi (5.28)

Figure 5.18 shows the margin for an exemplarily training set. The generalisation error is high if the training set is not compact. But also, the larger the training set is, the smaller the generalisation error. Furthermore, the greater the distance of the input samples to the marginγis, the smaller is the generalisation error, too.

In order to optimise the SVM a discriminating function is searched that maximises the mar-gin. In case of the hard margin classifier,γis given byγ=_||~_w||^˜^γ a maximisation ofγis the same as minimising||~w||for a fixedγ. Setting˜ γ˜=1the optimisation problem can be formulated as

w^T~w→minimise

with subject to y_i(~w^T~x_i+b)≥1 1, . . . ,N (5.29) and can be solved by using Lagrange multipliers and the Karush–Kuhn–Tucker–Theorem (KKT).

o o

o o o

o o

o x

x x

x x x

Figure 5.18: Margin for an example training set.

Solving the equations above results in:

α^∗_i[y_i(~w^∗x_i+b^∗)−1] =0 i=1, . . . ,N (5.30) Equation 5.30 implies that only those inputs x contribute to the solution for which the functional margin is one.

These data points lie closest to the sep-arating hyperplane. Their correspond-ing α^∗_i are non-zero. All other param-etersα^∗_j,i6= j,i,j∈N are zero and the corresponding input vectors are not volved. Therefore, the contributing in-puts are called support vectors. The optimal hyperplane for separating the classes is then given by:

f(x,α^∗,b^∗) =

∑

N i=1

yiα^∗_i(x_ix) +b^∗

∑

y_iα^∗_i(x_ix) +b^∗

(5.31)

Here,α^∗,b^∗ are the Lagrange multipliers used for the optimisation andsvrepresents the set of support vectors. In figure 5.19 a maximal margin (bold line) is shown that optimally separates the two classes (x,o). The highlighted examples in the plot mark the support vectors.

Classifying Residue Flexibility using SVMs

After having outlined the principles of support vector machines in this section the applica-tion of a SVM as a classifier is described. In this thesis a support vector machine is used

o o

x x

x x x

o o

Figure 5.19: Scheme of a maximal margin.

The maximal margin hyperplane (bold) sep-arates the input (x,o) optimally. The bold marked inputs (x,o) denote the support vec-tors setting up the margin hyperplane.

to classify amino acid side chains as flexible or non–flexible. Because every amino acid side chain has specific properties (e.g. charge, polarity, size and number of torsion angles) for each torsion angle and side chain a SVM is trained. In section 5.2.2, several residue specific features have been outlined. From these a feature vector has to be created. In order to choose those features that represent the side chains best a data driven approach is applied. Here, principle component analysis (PCA) is used to select appropriate fea-tures and to reduce the dimension of the feature vector at the same time. A reduction of the dimensionality of the feature vector can increase the classification power of the SVM.

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Variances 0.000.050.100.150.20

Figure 5.20: Spectrum of the eigenvalues of all features for ARG and χ1. Here, only the first 10 (of 21) eigenval-ues are shown.

In PCA the aim is to produce a set of uncorrelated variables rep-resenting the original information.

Therefore, the input data ({~x} ∈ R^M) is rotated to the principle axis using a orthogonal linear transfor-mation. For dimension reduction, then the first n principle compo-nents are selected.

The selection of principle compo-nents can be guided by analysing the spectrum of the eigenvalues (see Fig. 5.20). A hint for cutting off the first n components can be obtained by comparing the differ-ences in variance starting from left to right. In figure 5.20 the

vari-ance within the first four components is obviously greater than in the rest of the coeffi-cients. Thus, selecting the first four coefficients is reasonable. Another possibility is to take the first two or only the first component because of the differences in the variances. In case of different possibilities, here the SVM is trained with different numbers of principle

com-2 3 4 5 6

505560657075

number of principle components selected

classification accuracy [%]

−

− LYS SER

Figure 5.21: Plot of the total classification accuracy of classifying LYS and SER for different numbers of principle components, used as features.

ponents. The combination which achieves the highest classification accuracy is then taken.

Figure 5.21 shows the classification accuracy for different numbers of principle components of Lysine and Serine. In both cases, the eigenvalue spectrum supports several possible cut–

offs (see Fig. 5.22(a) and 5.22(b)).

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Variances 0.000.050.100.150.20

possible cut−offs

(a) Eigenvalues of SER

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Variances 0.000.050.100.150.20

possible cut−offs

(b) Eigenvalues of LYS

Figure 5.22: Eigenvalue spectra for LYS and SER of the features. Here, only the first ten eigenvalues are plotted.

Initially, for all residues the feature vector consists of the following components (cf. sec-tion 5.2.2 and Fig. 5.23): a set of wavelet coefficients, the energy difference, the original conformation of the residue, the secondary structure information as well as the solvent ac-cessible surface area value and the temperature factor. The order of the components within the feature vector is not important because the feature vector is processed by the PCA and transformed into lower dimensional space. The features get merged via the transformation into the lower dimensions and a mapping is impossible.

...

L+ 1 L+ 2 L+ 3

)

wavelet coefficients

L+ 4 L+ 5

temperature factor SAS

secondary structure original conformation energy difference

Figure 5.23: Initial feature vector before the PCA is applied. The number of the wavelet co-efficients depends on the level of the MSA they are taken from. The order of the components within the feature vector is arbitrary because the vector is processed further by a PCA.

For the first torsion angle (χ1), the first three to five principle components are chosen ac-cording to the analysis of the eigenvalue spectra. The concrete numbers for each residue and torsion angle are given in ta-ble 5.1. The eigenvalue spectra for each residue and torsion angle are shown in ap-pendix B.4.

The resulting low dimensional feature vec-tors are then used to classify the residues’

side chain as flexible or non–flexible using the support vector machine. Here, the sup-port vector machine implemented in the R–

package (Ihaka & Gentleman, 1996; Dim-itriadou et al., 2004) is trained. Because the number of examples for training of some residues is small, the SVM is trained using a

leave-one out cross test. Within each training iteration the input material is divided ran-domly in a small test and a larger training set. The SVM is then presented the training set and afterwards it is evaluated by the test set. Here, a 10–fold cross evaluation is chosen.

residue No. of principle components χ1 χ2 χ3 χ4

ARG 3 4 5 3

ASN 4 5 – –

ASP 4 6 – –

CYS 3 – – –

GLN 3 4 4 –

GLU 3 5 4 –

HIS 3 3 – –

ILE 3 5 – –

LEU 5 5 – –

LYS 5 4 3 3

MET 3 6 5 –

PHE 5 5 – –

SER 5 – – –

THR 4 – – –

TRP 4 6 – –

TYR 5 5 – –

VAL 5 – – –

Table 5.1: Number of the first n principle components selected for each residue and torsion angle. The principle components are taken as features to train the SVM.

Im Dokument Enhancing protein-protein docking by new approaches to protein flexibility and scoring of docking hypotheses (Seite 57-64)