• Keine Ergebnisse gefunden

Effectiveness of Sub-Protein Unit based Models

6.3 Advanced Stochastic Protein Family Models for Small Training Sets

6.3.1 Effectiveness of Sub-Protein Unit based Models

6.3 Advanced Stochastic Protein Family Models for Small

where the classification errors for both Profile HMMs and SPU based protein family models are compared. The classification error is decreased by almost 29% relative. Compared to semi-continuous feature based Profile HMMs, the improvements are at an almost similar level, i.e. the proof of concept for the alternative protein family modeling approach could be given.

Modeling Variant Classification ErrorE[%] Relative Change∆E[%]

Base: Discrete Profile HMMs

Discrete Profile HMMs 32.9 –

SCFB Profile HMMs 20.7 -37.1

SCFB SPU HMMs 23.5 -28.6

Table 6.5: Classification performance for discrete Profile HMMs, their semi-continuous feature based coun-terparts (MLLR adapted feature space representation), and the new modeling approach using SPUs. Target models estimated using feature based building blocks significantly outperform state-of-the-art models for SCOPSUPER95 66 while reaching comparable performance as SCFB Profile models..

SCOPSUPER95 66: Evaluation of Remote Homology Detection

Following the assessment of the classification capabilities of the new SPU based protein family models, their applicability for remote homology detection tasks is evaluated for the SCOPSUPER95 66 corpus. In analogy to preceding presentations of results for similar experiments, in figure 6.4 ROC curves are presented, and in table 6.6 the corresponding characteristic values for fixed working points within the curves are given.

Note that according to the results of informal experiments on different data the UBM architecture was changed from structured UBM containing 30 states to the classical UBM of Douglas Reynolds consisting of a single state (cf. section 5.2.4). The reason for this decision is as follows: When applying the structured UBM the selectivity is perfect, i.e. no false positive predictions are obtained at all. However, compared to the single state UBM rather large numbers of false negative predictions occur. Since the competitive model evaluation delivers hard decisions for a particular model (either target or UBM) this number cannot be reduced in any further step. The particular false negatives are actually rejected before the threshold based decision regarding log-odd scores is performed. In order to sharpen the specificity of the detection process the application of a single state UBM delivers slightly more false positives but the number of false negatives can be reduced drastically.

In the abovementioned figure and table, respectively, it can be seen that the improved performance of SPU based protein family models compared to discrete Profile HMMs that has been proved for the classification task can be generalized to the detection task, too. The percentage of false positive predictions can be substantially reduced, and the sensitivity is also improved in terms of reduced numbers of false negatives.

Note that the radically changed modeling approach includes still some optimization po-tential since the effectiveness of semi-continuous feature based Profile HMMs currently cannot be reached. However, since the evaluations presented here serve as the general proof

of concept of the new modeling technique, and the state-of-the-art discrete Profile HMMs are significantly outperformed, SPU based protein family modeling is very promising.

20000 40000 60000 80000 100000 120000 140000

0 200 400 600 800 1000 1200 1400 1600 1800 2000 0

# false positives

working area

# false negatives

discrete Profile HMMs semi−cont. feat. based SPU HMMs − MLLR adaptation semi−cont. feat. based Profile HMMs − MLLR adaptation

0 5000 10000 15000 20000

0 100 200 300 400 500 600 700 800

Figure 6.4: ROC curves illustrating the improved remote homology detection performance when applying SPU based protein family models (green curve) instead of current discrete Profile HMMs (red curve). The SPU based models are evaluated competitively to a single state UBM. False rejections are the reason for the curve’s endpoint apart from the y-axis (marked with ’+’). The ROC curves corresponding to semi-continuous feature based Profile modeling are given as reference (pink) illustrating their still better performance for detection tasks. Since the proof of concept for the SPU based modeling approach was addressed, their optimization potential is promising.

Modeling Variant False Negative Predictions False Positive Predictions [%] for 5 % False Positives [%] for 5 % False Negatives

Discrete Profile HMMs 26.1 57.6

SCFB Profile HMMs + UBM 5.1 5.5

SCFB SPU HMMs + UBM 18.2 0.0 (17.4)

Table 6.6: Characteristic values for SCOPSUPER95 66 detection experiments for Profile HMMs (discrete, and semi-continuous) and SPU based protein family models evaluated competitively to a single state UBM. The specificity as well as the sensitivity can substantially be improved compared to the state-of-the-art when using the SPU approach. For SCFB evaluations the corresponding limits of 5% false negative predictions were not reached. Thus, the appropriate global maxima at the end-points of the ROC curves are given in parentheses. Both feature based modeling variants include an MLLR adapted feature space representation.

SCOPSUPER95 44f: Evaluation of Remote Homology Classification

In addition to the general proof of concept for the applicability of SPU based protein family models which was given in the previous sections using the SCOPSUPER95 66 corpus, in the following the effectiveness of the new modeling approach for reduced training sets is evaluated. Again, the evaluation is concentrated on the general applicability of the new technique providing the proof of concept for the paradigm shift in protein sequence analysis using building blocks. Since the SPU framework can be configured and thus enhanced in various ways, the results presented give hints for further developments.

The SCOPSUPER95 44f corpus is used for the explicit assessment of the robustness of SPU based models depending on the number of training samples available. The first set of experiments is directed to the evaluation of the effectiveness of SPU based models for remote homology detection tasks. Therefore, in figure 6.5 the classification error rates are shown depending on the amounts of training samples used for model estimation. In the upper diagram the results for the original testset are given whereas in the lower chart the extended testset of SCOPSUPER95 44f is analyzed.

The actual training sets are obtained by randomly selecting sequences from the SCOP pool of the particular superfamily (cf. section 6.1 on page 151 for details about the corpus definition). Thus, certain “statistical noise” occurs when measuring the classification errors for the particular subsets of training samples. In order to allow easy comparison of the gen-eral effectiveness of the particular modeling methods for remote homology classification, the actual values are smoothed using Bezier interpolation. This results in continuous curves which can be analyzed by visual inspection as well as numerically for estimating the overall trend of the effectiveness.

It can be seen that SPU based protein family models are generally suitable for remote homology classification using small training sets up to a certain minimum amount of sample sequences. If the particular training sets contain at least (approximately) 20 sequences the classification error obtained for SPU based protein family models is comparable and slightly smaller, respectively, as when using current discrete Profile HMMs. Although the better performance of SCFB Profile HMMs cannot be reached yet, the results are very promising for the new concept of protein family modeling.

SCOPSUPER95 44f: Evaluation of Remote Homology Detection

The presentation of the classification performance given in the previous section provides an overview of the general effectiveness of SCFB SPU based protein family HMMs when only small amounts of training data are available. In order to prove the effectiveness for remote homology detection 44 ROC curves are necessary when proceeding similarly to the classification case. Since the presentation of such a large amount of diagrams for randomly selected training sets is in no relation to the knowledge gain which can be obtained from it, the following presentations are limited to three representative training sets of SCOPSU-PER95 44f.

For the first set of experiments the subsets containing 20 training samples are selected whereas the second kind of training sets contain 30 sequences each. Finally, the upper limit of the number of training samples available is used, namely 44 sequences. Note that these

30 40 50 60 70 80 90 100

0 5 10

20 15

SCFB Profile HMMs (MLLR)

20 25 30 35 40 45

Classification Error [%]

# Training Samples

SCOPSUPER95_44f (original testset) discrete Profile HMMs SCFB SPU HMMs (MLLR)

30 40 50 60 70 80 90 100

0 5 10

20 15

SCFB Profile HMMs (MLLR)

20 25 30 35 40 45

Classification Error [%]

# Training Samples

SCOPSUPER95_44f (extended testset) discrete Profile HMMs SCFB SPU HMMs (MLLR)

Figure 6.5: Classification error rates obtained when applying SPU based protein family models to the task of remote homology classification (SCOPSUPER95 44f). With respect to current discrete Profile HMMs, the diagrams illustrate the comparable and slightly reduced classification error rates ob-tained when a minimum of approximately 20 training sequences is available. Since the training samples are randomly picked the actual classification error rates (marked by ’+’) are smoothed using Bezier splines (solid lines). The results for semi-continuous feature based Profile HMMs are given here, too, illustrating their still better performance.

training sets are not identical to the ones from SCOPSUPER95 66 since the number of training samples is fixed to 44 for all 16 superfamilies (compared to the minimum number of 44 training sequences in SCOPSUPER95 66). The presentation of ROC curves is given in figures 6.6, and 6.7.

The particular ROC curves confirm the results obtained for classification experiments for remote homology detection. If less then the suggested minimum of approximately 20 training sequences are available, the detection performance of SPU based target models is worse compared to state-of-the-art discrete Profile HMMs (upper diagram of figure 6.6).

However, when 10 training samples more are used for model estimation the detection per-formance gets better on average and it is comparable to state-of-the-art. Furthermore, when 44 sample sequences are available for model training, SPU based protein family models in combination with a single state UBM significantly outperform discrete Profile HMMs. In fact 44 sequences are a reasonably small number. Thus, the SPU based modeling approach is very promising. For completeness in table 6.7 the characteristic values for fixed working points of the particular ROC curves are given.

Although the differences between SCFB Profile HMMs and SPU based protein family models are substantially, the evaluation of the detection performance for the new modeling approach is promising since they only represent the successful proof of concept. Further research activities should be directed to this new kind of protein family modeling in order to further improve their capabilities.

Modeling Variant False Negative Predictions False Positive Predictions [%] for 5 % False Positives [%] for 5 % False Negatives

20 Training Samples

Discrete Profile HMMs 36.3 80.2

SCFB Profile HMMs + UBM 33.1 0.0 (13.6)

SCFB SPU HMMs + UBM 45.7 0.0 (8.6)

30 Training Samples

Discrete Profile HMMs 31.6 78.2

SCFB Profile HMMs + UBM 19.0 0.0 (25.9)

SCFB SPU HMMs + UBM 34.0 0.0 (19.3)

44 Training Samples

Discrete Profile HMMs 27.8 68.7

SCFB Profile HMMs + UBM 12.9 26.6

SCFB SPU HMMs + UBM 21.9 0.0 (29.1)

Table 6.7: Comparison of characteristic values for SCOPSUPER95 44f detection experiments (20, 30, and 44 training samples) at fixed working points for SPU based protein family models (MLLR adapted feature space representation) vs. Profile HMMs (discrete and semi-continuous feature based vari-ants). For those values where the corresponding limit of 5% false predictions was not reached, the appropriate global maxima at the endpoints of the ROC curves are given in parentheses.

20000 40000 60000 80000 100000 120000 140000

0 200 400 600 800 1000 1200 1400 1600 1800 2000 0

# false positives

# false negatives

Discrete Profile HMMs semi−cont. feat. based SPU HMMs − MLLR estimation semi−cont. feat. based Profile HMMs − MLLR adaptation

0 5000 10000 15000 20000

0 100 200 300 400 500 600 700 800 working area

20000 40000 60000 80000 100000 120000 140000

0 200 400 600 800 1000 1200 1400 1600 1800 2000 0

# false positives

working area

# false negatives

Discrete Profile HMMs semi−cont. feat. based SPU HMMs − MLLR estimation semi−cont. feat. based Profile HMMs − MLLR estimation

0 5000 10000 15000 20000

0 100 200 300 400 500 600 700 800

Figure 6.6: ROC curves illustrating the detection performance of SPU based target models competitively evaluated to a single state UBM for SCOPSUPER95 44f (upper diagram: 20 training sequences;

lower diagram: 30 samples). A minimum of approximately 20 training sequences is required for reaching state-of-the-art (red), or outperforming it. The endpoints of the curves not crossing the

20000 40000 60000 80000 100000 120000 140000

0 200 400 600 800 1000 1200 1400 1600 1800 2000 0

# false positives

working area

# false negatives

Discrete Profile HMMs semi−cont. feat. based SPU HMMs − MLLR estimation semi−cont. feat. based Profile HMMs − MLLR estimation

0 5000 10000 15000 20000

0 100 200 300 400 500 600 700 800

Figure 6.7: SCOPSUPER95 44f based comparison of detection results for the third set of experiments based on the use of 44 training samples each for the estimation of target family models. The combination of SCFB SPU models and single state UBMs performs best while obtaining small amounts of false rejections due to the UBM. Again the endpoints of UBM based curves not crossing the y-axis caused by false rejections are marked with ’+’.

To summarize, the proof of concept for the general applicability of the new approach for protein family modeling using automatically derived building blocks was given. Both classification and detection performance of SPU based target models are improved in com-parison to current discrete Profile HMMs when approximately 20 training sequences are available. If the amount of sample data is below this minimum the resulting SPUs are ex-pected to be not as representative as necessary for the overall protein family. Thus, more flexibility of target models is required.