Capabilities of State-of-the-Art Approaches

3.4 Summary

4.1.2 Capabilities of State-of-the-Art Approaches

SCOP Id SCOP Superfamily Name # Samples Length (Mean/Std.-Derivation) Training Test Training Test

a.1.1 Globin-like 60 30 150.3 (13.6) 151.6 (11.1)

a.3.1 Cytochrome c 44 22 102.6 (24.1) 118.4 (32.6)

a.39.1 EF-hand 49 25 138.1 (48.0) 122.0 (39.3)

a.4.5 ”Winged helix” DNA-binding domain

49 25 93.8 (26.6) 92.9 (23.1)

b.1.1 Immunoglobulin 207 104 108.9 (15.3) 106.7 (12.3)

b.10.1 Viral coat and capsid proteins 64 32 278.0 (92.9) 262.1 (85.2) b.29.1 Concanavalin A-like

lectins/glucanases

52 27 221.2 (51.2) 220.8 (72.9) b.40.4 Nucleic acid-binding proteins 47 24 113.1 (36.6) 111.5 (47.2) b.47.1 Trypsin-like serine proteases 55 28 231.4 (29.5) 226.0 (30.1)

b.6.1 Cupredoxins 50 26 143.9 (34.6) 139.0 (31.5)

c.1.8 (Trans)glycosidases 62 31 376.5 (76.4) 397.8 (84.0) c.2.1 NAD(P)-binding

Rossmann-fold domains

102 51 204.3 (58.9) 211.5 (75.1) c.3.1 FAD/NAD(P)-binding

domain

45 23 226.1 (93.3) 223.3 (86.3) c.37.1 P-loop containing nucleotide

triphosphate hydrolases

127 64 259.3 (120.4) 253.4 (85.6) c.47.1 Thioredoxin-like 56 28 111.6 (38.2) 105.6 (35.3) c.69.1 Alpha/Beta-Hydrolases 51 26 350.1 (103.7) 323.7 (25.0)

Total: 1120 566

Table 4.1: Overview of the SCOPSUPER95 66 corpus created for the assessment of current Profile HMM capabilities as well as for the comparison of the effectiveness of the methods developed in this thesis. For every superfamily the alpha-numerical SCOP Id as well as its real name as defined in the database is given. In the last row the total numbers of samples are summarized.

85−90

80−85

75−80

70−75

65−70

60−65

55−60

50−55

45−50

40−45

35−40

30−35

25−30

20−25 90−95

15−20

Similarity Ranges [%]

10−15

0−10

0 2 4 6 8 10 12

Sequences [%]

SCOPSUPER95_66 Corpus

Training Test

Figure 4.1: Histogram of similarity ranges for the SCOPSUPER95 66 corpus averaged over all 16 super-families involved (black: Training / blue: Test) illustrating the almost uniform distribution of similarities all over the whole range with one exception at 20-25%. Note that the lower limit of the identity ranges as defined by SUPERFAMILY is 10%. Thus, the first bin covers a broader range of 10% compared to the 5% bins otherwise.

[Kar98], the scores are calculated with respect to scores obtained by aligning the reverse se-quence. Within the SAM package this null model is called reversed null model (the general motivation for this choice of null model was explained in section 3.2.2 on page 61). In or-der to determine the statistical significance of alignment scores for discriminating between target hits and misses during homology detection, extreme value distributions are evaluated as described in section 3.1.1.¹For both tasks all models are evaluated independently for all appropriate test sequences which is compared to general pattern recognition applications rather unusual at least for the classification task.

Classification Accuracy

In general pattern recognition applications the capabilities of certain approaches for clas-sification tasks are usually measured by means of the clasclas-sification accuracyC, and more common by its inverse the classification error E. They are defined as the ratio of correct (forC) / false (forE) classifications and the overall number of decisions. Usually, the nu-merator of this ratio is further subdivided into the number of substitutions, deletions, and insertions. For protein sequence classification, i.e. global string alignment as defined for Dynamic Programming techniques, this subdivision is not important because usually

com-1In fact SAM’s E-values are derived from log-odd scores S using the following formula: E(S) =

1+exp(−λS), whereN denotes the database size (SCOPSUPER95 66: 566), and λis a scaling param-eter which is 1 when using natural logarithms [Hug96, p.91].

plete sequences are analyzed exclusively, i.e. without concatenations which could include insertions or deletions in the above meaning. Thus, both measurements are defined as fol-lows:

C = Ncorrect classifications

Noverall decisions

, E = Nfalse classifications

Noverall decisions

= 1− C. (4.1)

When applying state-of-the-art discrete Profile HMMs as created by SAM, for the SCOP-SUPER95 66 corpus the classification accuracy is 67.1 percent. This corresponds to a clas-sification error of 32.9 percent. For these values, the particular size of the test set, and a selected level of confidence of 95%, statistically significant changes are obtained when the percentages differ by more than 3.9 percent, i.e. the confidence interval is [±3.9%] (cf.

figure 4.2). To re-emphasize: For this representative task of remote homology classifica-tion almost one third of the decisions regarding the appropriate superfamily affiliaclassifica-tion are wrong when using the currently most powerful sequence analysis techniques, namely Pro-file HMMs. This is problematic especially for target validation applications within the drug discovery pipeline.

In table 4.2 the results for the classification task are summarized.

Measure Results [%]

Classification AccuracyC 67.1 Classification ErrorE 32.9 95% confidence interval ±3.9

Table 4.2: Summary of the classification results for the SCOPSUPER95 66 corpus when applying discrete Profile HMMs (SAM).

Following the assessment of current methodologies’ capabilities for remote homology classification in the next section their actual effectiveness for detection tasks is evaluated.

Detection Performance

Usually target detection is performed by analyzing alignment scores regarding some thresh-old which discriminates between target hits and target misses. As discussed for pairwise alignment techniques (cf. section 3.1 on pages 31ff), the significance of scores is usually judged by means of so-called E-values denoting the probability of randomly occurred false predictions for particular scores. Especially for remote homology detection, the major diffi-culty is the actual determination of a suitable threshold for the target or non-target decision.

For the comparison of different techniques, often Receiver Operator Characteristics – ROC curves are used (cf. e.g. [Mou04, pp. 192ff]). Here, the number of false positive predictions is illustrated as a function of the corresponding number of false negative predictions. The threshold selected for discrimination between target hit and miss, which is usually with re-spect to e.g. E-values, is implicitly given as a particular point within the ROC curve. By means of certain criteria, e.g. the costs for false positive predictions vs. the costs for false negative predictions, the optimum threshold can be determined by analyzing ROC plots.

Generally, the closer the ROC curve is located to the lower and left borders of the diagram, the better is the detection performance of the underlying approach.

Classification error rates which are determined using a finite test set are, strictly speaking, only estimates for the general capabilities of a recognition system. In order to obtain correct rates for the sample space of the underlying statistical process, theoretically, an infinite number of experiments needs to be performed. Instead, for the error probabilities Eˆ, estimated using the finite test set, so-called confidence intervals [E_l,E_u] are defined.

Such an interval defines a range containing the actual probabilityE with a given statistical evidence, the so-called level of confidence.

A classification experiment can be interpreted as Bernoulli-process where only two events may occur: A(correctly recognized) and A¯(not or falsely recognized). Thus, the confi-dence interval for a given probabilityEneeds to be determined by evaluating the Binomial distribution. According to the Moivre-Laplace theorem (local limit theorem),E can be in-terpreted as asymptotically normally distributed:

E − Eˆ qE(1−E)

≈ N(0,1).

When performingN experiments, the lower and the upper boundaries of the confidence interval are defined as follows:

E_l = N N +z²



Eˆ+ z² 2N −z

Eˆ(1−Eˆ)

N + z²

4N²





E_u = N N +z²



Eˆ+ z² 2N +z

Eˆ(1−Eˆ)

N + z²

4N²



.

Here,zdesignates the(1−^α₂)-quantile of the standard normal distribution, whose value is documented in tables. Usually, for classification experiments, as performed in this thesis, a level of confidence of 1−α = 95%is used which means that the actual classification error rateE is within the given confidence range with a probability of 95 percent.

Based on confidence intervals, additionally, changes in classification error rates, caused by e.g. changes in the parameters of the underlying classification system, can be evalu-ated regarding their statistical significance. If the changed classification error rate is out-side the confidence interval, these changes can be interpreted as statistically significant.

Otherwise, they were most likely caused by chance.

Figure 4.2: General estimation of confidence intervals for probabilities, i.e. classification error or accuracy rates (cf. e.g. [Bro91, Sch96]). The derivation given is adopted from [Wie03].

For the assessment of the detection performance of discrete Profile HMMs, the scores obtained by aligning the sequences of the complete SUPERFAMILY database to the

ap-propriate superfamily models were analyzed. The complete SUPERFAMILY hierarchy of SCOP sequences with a maximum of 95% residue-level similarity contains approximately 8 000 entries. The number of false predictions were determined by means of the E-values generated by SAM. For a general overview about the detection performance of discrete Profile HMMs, in figure 4.3 the results of all alignments of SCOP sequences to all 16 su-perfamilies are summarized resulting in a single ROC curve. Since detection decisions are based on absolute scores, which are analyzed regarding their statistical significance, the number of false predictions can usually be limited to some reasonable number. Thus, a working area corresponding to such an exemplary limitation is separately shown as gray shaded rectangle.

It can be seen that the generally critical performance as measured for the classification task is problematic for the detection task as well. The number of false positive predictions remains rather high for reasonable numbers of false negatives. This is problematic espe-cially for drug discovery applications because of the implied increased costs for useless further investigations within the drug design pipeline. Additionally, the number of false positives cannot be reduced until the number of false negative predictions increases signifi-cantly, which is again problematic for drug discovery because suitable candidate sequences might be missed.

60000 80000 100000 120000 140000

400 600 800 1000 1200 1400 1600 1800 2000

# false positives

# false negatives 40000

discrete Profile HMMs (SAM)

20000

SUPERFAMILY95_66 Corpus

5000 10000 15000

20000 working area

200 300 400 500 600

100 0

200 0

Figure 4.3: ROC curve illustrating the combined results for remote homology detection based on the SCOP-SUPER95 66 corpus using discrete Profile HMMs (SAM). The grey shaded rectangle highlights the detection performance for the biologically most relevant working area when limiting the num-ber of false predictions reasonably.

For further family specific analysis of the detection performance, the detection results are inspected individually for every superfamily contained in the SCOPSUPER95 66 corpus.

Generally, the results are comparable for all superfamilies (cf. figure 4.4, for clarity the presentation of the ROC curves is split into two diagrams). Some exceptions exist, either performing better than average (a.39.1, b.6.1, b.47.1) or worse (b.1.1, b.40.4, c.2.1). The analysis of these exceptional cases regarding the sequence similarities of the appropriate training and test sets as well as regarding the appropriate number of training samples does not show any specialties. In figure 4.5 the histograms of sequence similarities for the ex-ceptional superfamilies mentioned above are shown individually illustrating the generally uniform distribution of the sequence similarities for the specific superfamilies. Thus, the reason for the non average performance seems to be intrinsic.

In order to judge the effectiveness of certain target detection methods for practical ap-plications within e.g. pharmaceutical tasks, usually some characteristic values are extracted from the appropriate ROC curves. When fixing the number of false predictions maximally allowed (e.g. false positives), the corresponding number of false predictions (here false neg-atives) is a good measure. Usually, the percentage of false predictions is set to 5%. In table 4.3 these characteristic values are summarized illustrating the still rather weak performance of current discrete Profile HMMs for remote homology detection tasks.

False Negative Predictions [%] False Positive Predictions [%]

for 5 % False Positives for 5 % False Negatives

26.1 57.6

Table 4.3: Characteristic values for SCOPSUPER95 66 detection experiments: At fixed working points of the ROC curve allowing 5% false predictions, the numbers of corresponding false predictions are given. It can be seen that improvements are demanded since rather large percentages of false predictions occur.

Im Dokument Advanced stochastic protein sequence analysis (Seite 95-100)