• Keine Ergebnisse gefunden

Validation

7.3 Classification

7.3.2 Investigation of Geometric Approaches

Approaches based on a geometric representation are on the one hand quite rigid, a property which could result in some problems, especially if such mea-sures are applied on noisy data which is subject to structural flexibility and mutations. On the other hand, however, such approaches do not cause a loss of information, scale very well and do not lead to exorbitant large search spaces containing even geometric infeasible solutions. Hence they should be very powerful measures from this perception. The LPCS method as introduced in this thesis comes with a parameterλinfluencing the obtained similarity scores.

This parameter is evaluated in the following by using a linear (1-dimensional) grid starting in zero and going up to 1 in 0.1 steps. Forλ=0 a subset relation is considered, whereas forλ=1 the concept of equivalence is used to measure similarity (cf. Section 2.4.1). The results summarized in Table 7.5 clearly indi-cate that LPCS is a very good measure on protein binding sites, at least on the binding sites contained in the ATP/NADH dataset, since classification rates of above 90% could be reached on this non-trivial classification problem. Consid-ering the influence of the parameterλ, it indeed sticks out that this parameter has some influence on the classification rates, as illustrated in Figure 7.5. Even though, the influence ofλ on the classification rate is not very high, it turns nevertheless out that a trade-off between equivalence and inclusion performs best. Obviously, this is in particular the case for settings ofλin the interval [0.5, 0.8]. This setting has benefits compared toλ=1.0 especially in the case of two structures that differ in size. Using strict equivalence similarity is defined by the minimum of the score of two superpositions, however, one of the two calculated superpositions will place the smaller one in the middle of the other

Table 7.5: Classification rates on the ATP/NADH dataset obtained by using LPCS scores for different values ofλandk.

HHHH HHH λ

k 1 3 5 7 9

0.0 0.896 0.873 0.865 0.834 0.825 0.1 0.899 0.873 0.868 0.837 0.828 0.2 0.904 0.879 0.870 0.837 0.831 0.3 0.907 0.878 0.868 0.831 0.831 0.4 0.910 0.885 0.870 0.842 0.825 0.5 0.907 0.893 0.879 0.859 0.856 0.6 0.901 0.901 0.896 0.882 0.854 0.7 0.907 0.910 0.870 0.873 0.848

0.8 0.913 0.899 0.865 0.854 0.870

0.9 0.904 0.885 0.879 0.868 0.867 1.0 0.890 0.873 0.854 0.845 0.828

one in order to minimize the sum of distances between equally labeled points.

Even though, the sum of distances is minimized, the score this superposition returns is low. Hence, even if the other superposition returns a high score, the overall score will be artificially low. However, compared to majority voting, which would lead to a classification rate of 60.28% on the ATP/NADH dataset, LPCS performs independently of the chosenλvalue significantly better.

This behavior needs to be reflected in a more detailed way, in particular because LPCS has the drawback of processing on a very rigid representation in the form of labeled point clouds. Although, the classification rates of the other methods are not known so far, one can easily imagine, that increasing classification rates of more than 90% on a non-trivial classification problem is not simple. To enable some insights into the LPCS measure, Figure 7.6 is used which is giving a representative superposition of two protein binding sites.

Due to the rigid representation of protein binding sites one cannot expect a perfect match, however, in contrast to concepts as the maximum common sub-graph, LPCS is not requiring perfect matches. Instead, not perfectly matching points are also considered, where their deviations are penalized, reducing the score LPCS returns for this comparison. From this point of view, LPCS can

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.82

0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92

λ

classification rate k = 1

k = 3 k = 5 k = 7 k = 9

Figure 7.5: Classification rates ofk-NN classifier versusλfor different choices ofk.

even be considered as a kind of extended rmsd value of two structures, that does not require to have a one-to-one correspondence between points and that even takes physicochemical properties into account. However, theoretically this approach (or more precise the underlying representation) has also a cen-tral disadvantage which is illustrated in Figure 7.7, where two sets of points are given. Considering them as point clouds allows no transformation (at least by applying the algorithm presented in this thesis). Hence the LPCS approach tries to minimize the distances between pairs of points from different point clouds leading to the result depicted in Figure 7.7. By allowing to modify the

Figure 7.6: Superposition of the structures1dy3.1(red) and1j1z.10(black).

ea

eb ec

ea

eb

Figure 7.7: Changing the angle between the two lines connecting the outer balls would allow a perfect match; an algorithm working on a rigid representation, however, must accept a mismatch and must try to minimize it.

model (e.g., as in the case of methods following the concept of graph edit dis-tance), a simple adaptation of the angle between the lines connecting the outer nodes would allow for a perfect match. Such an adaptation could be realized by changing edge weightE(ec)toE(ea) +E(eb).

Consideration of the Constructed Alignment

So far it was shown that LPCS scores lead to very high accuracies when applied as inputs for a classifier, even though the underlaying model is very rigid. To explain this behavior further, in the next step the calculated superpositions are inspected closer by transforming them into alignments. Although the similar-ity scores obtained by LPCS do not depend on the alignments, the alignments strongly depend on the superpositions. Moreover, the conserved pattern of the alignment has the highest contribution to the LPCS score obtained.

Having found the optimal superposition, graph matching (cf. Section 4.2) can be applied to obtain an alignment. Finally, a conserved pattern can be de-rived from this alignment. Conserved patterns dede-rived from the alignment of the structures1dy3.1and1j1z.10, both which are taken from the ATP set are visualized in the Figure 7.8 after an additional superposition calculated by means of the Kabsch (1976) algorithm. It is worth mentioning here, that the two structures are not “cherry-picked”, instead the results are representative for many tried structures, hence they will be used throughout the whole study.

As can be seen, even a rigid geometric approach is able to superimpose two point clouds in a way that a common pattern of the two clouds is located spa-tially close. Hence, the distances between pairs of points from the different

(a) Geometric alignment, k=0.75 Å; rmsd=0.43 Å

(b) Geometric alignment, k=4.00 Å; rmsd=1.88 Å

Figure 7.8: Visualization of the conserved pattern (ω=1,ξ =1) obtained by constructing geometric alignments of the structures1dy3.1(red) and1j1z.10(black). One-to-one correspondences are visualized in terms of a straight line.

clouds that participate in this pattern are quite low. As a result, they contribute strongly to the score. On the other hand, points that do not have a counterpart in the other cloud which is spatially close have a very low contribution to the score. Thus, the final score is determined in a similar way as it was done in the mcs measure, namely by dividing the size of the common subgraph by the size of the larger graph. However, in contrast to the classical mcs measure LPCS is taking also the points into consideration which do not match, however, it down-weights them according to the exponential function of the negative dis-tance. In Figure 7.8 alignments are shown in which, respectively, assignments with distance above 0.75 Å or 4 Å are removed. In the alignment depicted in Figure 7.8 (a) indeed a conserved pattern can be recognized which matches al-most perfectly. As will be shown later, the largest common pattern that matches up to an error of 0.2 Å has a size of 7 pseudocenters, too. Hence, LPCS was able to superimpose the structures in a way that a pattern of the same size as in the case of the maximum common subgraph approach (see results on graph-based methods) was placed spatially close. Moreover, by increasing the maximally allowed distance between assigned pseudocenters an even larger approximate common pattern is found. Using the assignments to derive an edit path, fi-nally an error-tolerant approach is obtained. However, the error-tolerance is not affecting the structure completely: The transformation rules comprise only of some slight movements, the overall structure of the protein binding sites

remains the same. This is underpinned by the obtained rmsd value which is below 2 Å. Reducing k to 0.75 Å, the rmsd is becoming 0.43 Å and the align-ment does not tolerate larger deviations, hence, the number of pseudocenters in the conserved pattern is reduced.

Summary

The results obtained can be summarized as follows: Methods based on point clouds are very interesting tools to process on protein binding sites. It was shown, that the similarity measure LPCS leads to high classification rates. On a non-trivial classification problem as it was used here, a rate of above 90%

clearly indicates a very effective measure. By chosingλ-values different from 1, LPCS allows to lessen the concept of equivalence. This may have benefits because the LigSite algorithm used to detect cavities on the surface of proteins has some problems to identify the border of a cavity. Here, a measure based on the trade-off between equivalence and inclusion has clearly advantages which were also indicated by the increased classification rates. A further technique allowing for calculations of alignments is also available for this representation.

Among others, such alignments can be considered as a transformation rule to transform the first point cloud into the second one, or vice versa. Hence, this technique introduces at least to some degree flexibility to the representation labeled point cloud. The rmsd values obtained when considering conserved pat-terns behave proportional to the parameter k. Hence, geometric alignment al-lows the user to adjust the degree of structural flexibility a-priory. On the other hand, this allows the construction of rigid alignments but also of such align-ments that are more flexible in terms of deformation of the structures, always by avoiding the complete deformation of the structure.