• Keine Ergebnisse gefunden

5.2 Parameter influence on algorithmic performance

5.2.2 Influence of the scoring parameter

As mentioned earlier, it is unclear which measure is more suitable for the task of comparing protein binding sites. To get an idea of the influence ofαon the predictive performance of both the global as well as the semi-global approaches, the predictive performance of both algorithms were examined with respect to different α.

2In case of SEGA, a highαemphasizes mutual inclusion.

2 4 6 8 10 12 14 16 40

50 60 70 80 90

nneigh

Accuracy k = 1

k = 3 k = 5 k = 7 k = 9

(a) α= 0

2 4 6 8 10 12 14 16

50 60 70 80 90 100

nneigh

Accuracy k = 1

k = 3 k = 5 k = 7 k = 9

(b)α= 1

Figure 5.6: Performance of k-nearest neighbor classification in a ten-fold stratified cross validation on the four class dataset for α= 1 and α = 0. The misclassification rate is plotted for different values of nneigh.

To this end, pairwise classification were carried out on the four-class dataset used in the previous experiment, this time with a fixed parameter setting but variable α.

Each pair of classes of this dataset was analyzed separately, comparing each class with every other. This was done to assess whether one specific setting of α proves superior on different problems.

Intuitively, one would assume that a conjunctive measure would perform better for

5.2 Parameter influence on algorithmic performance

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

100 101 102 103 104 105

nneigh

Runtime [ms]

Figure 5.7: Runtimes obtained for 1000 random comparisons for different values of nneigh

nneigh

2 3 4 5 6 7 8

µ 3.6269 1.4160 0.5749 0.3172 0.2814 0.3656 0.5529 σ 2.2753 0.6996 0.2196 0.0925 0.0650 0.0776 0.1139

nneigh

9 10 11 12 13 14 15 16

µ 0.8179 1.2414 1.7499 2.4743 3.3990 4.5653 5.8799 7.0440 σ 0.1653 0.2460 0.3484 0.4900 0.6717 0.9020 1.1534 1.3680 Table 5.4: Runtime requirements [s] of SEGA on the test dataset for different values of nneigh.

groups of globally similar binding sites with a high degree of structural conservation, whereas the disjunctive measure should be more suitable if only a fraction of the binding sites correspond to each other. In the latter case, the non-matching part of the larger cavity will have less influence on the similarity score.

In this respect, solving all two-class problems separately might be particularly interesting, since three classes consist of binding sites that interact with rather flexible ligands (ATP, NAD, FAD), whereas the fourth group consists of binding sites with rigid porphyrine rings. Hence, it is more likely for the first three classes to show structural diversity, since the ligands can be bound in different conformations. The fourth class, on the other hand, should be more homogeneous, as the porphyrines will

0 0.2 0.4 0.6 0.8 1 0.6

0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

α

Accuracy

ATP−FAD ATP−Heme ATP−NADH FAD−Heme NADH−FAD NADH−Heme

Figure 5.8: Mean MCR on different two-class problems for different α derived from 10-fold stratified cross validation using SEGA (nneigh= 10, k = 1).

not vary in conformation and thus are more likely to enforce a certain topology of the binding site. Of course, whether this is actually the case strongly depends on the directional interactions between the ligands and the cavities.

5.2.2.1 Influence of the scoring parameter on the performance of SEGA

Classification experiments were again carried out for each two-class problem using SEGA with nneigh = 10. Results were obtained by 10-fold stratified cross validation using a k-nearest neighbor classifier and compared with respect to different α. Per-formance was assessed in terms of classification accuracy. Fig. 5.8 shows the mean accuracy for different values of α, depicted are the results for k = 1.

In this case, no single α value is preferable. In some cases, α apparently has no great influence on the classification performance, while in the case of the ATP/FAD and the NADH/FAD problem, a high α appears to be detrimental, thus favoring one-sided inclusion. This might indicate the proteins of these classes are more het-erogeneous in structure, which is not unreasonable, given that FAD, ATP and NADH binding proteins carry out a variety of different functions and exhibit different folds.

In case of pairwise classification involving heme, a tendency towards mutual in-clusion can be observed, although the effect is minimal. This might be explained by the porphyrine-containing pockets being more homogeneous in size and structure,

5.2 Parameter influence on algorithmic performance

0 0.2 0.4 0.6 0.8 1

0.7 0.75 0.8 0.85 0.9 0.95 1

α

Accuracy

ATP−FAD ATP−Heme ATP−NADH FAD−Heme NADH−FAD NADH−Heme

Figure 5.9: Mean MCR on different two-class problems for different α derived from 10-fold stratified cross validation.

thus a mutual inclusion is more likely to occur in this class. Apparently, α is a problem-specific parameter.

5.2.2.2 Influence of the scoring parameter on the performance of GAVEO

Similar experiments were carried out for the GAVEO and GAVEOc approach, since the corresponding similarity measure (4.9) is again a conjunction of two extremes controlled by a trade-off parameter α. The algorithms were parametrized according to the parameter setting obtained previously. Classification results were obtained from 10-fold stratified cross validations using k-nearest neighbor classification. The results are given in Fig. 5.9 and Fig. 5.10.

As expected, both GAVEO variants yield similar results, in both cases indicating a slight preference for lowα values. In general, however, the influence ofαappears to be low. Note that the measure used here is a similarity measure instead of a distance measure, as was the case in the previous section. Thus, a low value ofαhere indicates a preference for the conjunctive measure.

Apparently, in case of the GAVEO approach, the performance does not vary largely for different two-class problems, as opposed to SEGA. While this seems to be at odds with the previous results, one has to be aware of the fact that GAVEO measures a global alignment of graphs, i.e., the similarity of the graphs as a whole, whereas the

0 0.2 0.4 0.6 0.8 1 0.7

0.75 0.8 0.85 0.9 0.95 1

α

Accuracy

ATP−FAD ATP−Heme ATP−NADH FAD−Heme NADH−FAD NADH−Heme

Figure 5.10: Mean MCR on different two-class problems for different α derived from 10-fold stratified cross validation.

SEGA measure quantifies the aggregated distances between local subgraphs. Thus, the results here are not really comparable to the results of the previous section, since both measures are based on two different concepts of similarity.