• Keine Ergebnisse gefunden

Effectiveness of Bounded Left-Right Models

6.3 Advanced Stochastic Protein Family Models for Small Training Sets

6.3.2 Effectiveness of Bounded Left-Right Models

20000 40000 60000 80000 100000 120000 140000

0 200 400 600 800 1000 1200 1400 1600 1800 2000 0

# false positives

working area

# false negatives

Discrete Profile HMMs semi−cont. feat. based SPU HMMs − MLLR estimation semi−cont. feat. based Profile HMMs − MLLR estimation

0 5000 10000 15000 20000

0 100 200 300 400 500 600 700 800

Figure 6.7: SCOPSUPER95 44f based comparison of detection results for the third set of experiments based on the use of 44 training samples each for the estimation of target family models. The combination of SCFB SPU models and single state UBMs performs best while obtaining small amounts of false rejections due to the UBM. Again the endpoints of UBM based curves not crossing the y-axis caused by false rejections are marked with ’+’.

To summarize, the proof of concept for the general applicability of the new approach for protein family modeling using automatically derived building blocks was given. Both classification and detection performance of SPU based target models are improved in com-parison to current discrete Profile HMMs when approximately 20 training sequences are available. If the amount of sample data is below this minimum the resulting SPUs are ex-pected to be not as representative as necessary for the overall protein family. Thus, more flexibility of target models is required.

Similarly to the evaluation of SPU based HMMs, the assessment of the effectiveness of the BLR approach is based on the SCOPSUPER95 66 as well as on the SCOPSU-PER95 44f corpus. In the following the particular evaluation results are presented sepa-rately.

SCOPSUPER95 66: Evaluation of Remote Homology Classification

Table 6.8 contains the results of the experimental evaluation of the classification task performed for the SCOPSUPER95 66 corpus. The most effective variants of feature space adaptation as discussed in the previous section were applied and appropriate semi-continuous feature based Bounded Left-Right protein family HMMs were estimated – SCFB BLR HMMs (MAP/MLLR).

Analyzing the classification error rates of both SCFB Profile HMMs and SCFB BLR HMMs, it becomes clear that the new models with reduced complexity significantly outper-form the feature based Profile models containing the standard three-state topology. For the best configuration, namely BLR HMMs based on the MLLR adapted feature space repre-sentation, the classification error decreases by approximately 20 percent relative. Compared to the corresponding state-of-the-art discrete Profile HMMs, this implies almost halving the classification error.

Note that the Bounded Left-Right model topology requires the new feature representa-tion of protein sequences. In addirepresenta-tion to the comparison of SCFB HMMs, in table 6.8 the classification error rate for discrete BLR models is given. Discrete BLR models perform sig-nificantly worse than their feature based counterparts and even worse than state-of-the-art discrete Profile HMMs. When processing discrete amino acid data, state-of-the-art discrete Profile HMMs are without doubt the methodology of choice.

Modeling Variant Classification ErrorE[%] Relative Change∆E[%]

Base: Profile HMMs Discrete SCFB

Discrete Profile HMMs 32.9 − −

SCFB Profile HMMs (MAP) 24.0 −27.1 −

SCFB Profile HMMs (MLLR) 20.7 −37.1 −

Discrete BLR HMMs 38.9 +15.4 −

SCFB BLR HMMs (MAP) 21.7 −34.0 −9.6

SCFB BLR HMMs (MLLR) 16.8 −48.9 −18.8

Table 6.8: Classification results for SCOPSUPER95 66 comparing Profile HMMs (both discrete and semi-continuous feature based) with protein family models with reduced model complexities based on the Bounded Left-Right architecture (mean confidence range: approximately±3.5%). Using the best configuration (BLR models including MLLR based feature space adaptation) the classifica-tion error can be further decreased by approximately 20 percent relative (compared to SCFB Pro-file HMMS) which implies halving the classification error obtained when using standard discrete Profile HMMs. The significantly worse results for discrete BLR models are given separately.

In figure 6.8 the transition probabilities of the SCOPSUPER95 66 SCFB BLR models are visualized using a greyvalue representation. According to the definition of Bounded Left-Right models (cf. page 130f.) the number of transitions varies for all models. Obviously,

the dominating state transition for all models is the direct connection between adjacent states (almost white stripes within all sub-images in the second rows). Additionally, the majority of states in all models contain non-vanishing transition probabilities to themselves and to farther adjacent states. It can be seen that the model topology is rather suitable since transitions to states which are not locally adjacent are very unlikely to occur (the lower rows of all sub-images are almost black).

a.39.1 (5) a.4.5 (5) b.1.1 (5)

b.29.1 (5) b.10.1 (5)

b.40.4 (5) b.47.1 (5) a.1.1 (4)

b.6.1 (4) a.3.1 (4)

c.1.8 (4)

c.2.1 (12)

c.37.1 (5) c.47.1 (4) c.69.1 (5) c.3.1 (5)

Models

...

...

States

Figure 6.8: Visualization of transition probabilities for all superfamily BLR models of the SCOPSU-PER95 66 corpus: Transition probabilities are mapped to greyvalues as illustrated by the inset (upper right); the lighter the shades, the higher the corresponding transition probabilities. The model names are written on the left side of the figure including the number of transitions created (cf. page 130f.) whereas the x-axis represents the states of the particular superfamily models.

SCOPSUPER95 66: Evaluation of Remote Homology Detection

In the previous section, evaluation results were presented which illustrate the superior per-formance of Bounded Left-Right HMMs for protein sequence classification. The effective-ness of the new modeling alternative for remote homology detection is now evaluated by means of the same methodology and datasets as previously mentioned for SCFB Profile HMMs. As suggested by the improved specificity of UBM based model evaluation (cf. fig-ure 6.3), target BLR models based on either MAP, or MLLR based featfig-ure space adaptation were competitively evaluated to the single state Universal Background Model (UBM).

In figure 6.9 the results for the evaluation of the detection performance are presented by means of ROC curves. The curves for SCFB BLR models (blue and green) are directly compared to the curves of the corresponding SCFB Profile HMMs (cyan and pink). Further-more, the baseline for all SCOPSUPER95 66 experiments discussed in this thesis, namely the ROC curve for discrete Profile HMMs, is shown in red. Note that some small amounts of false rejections are produced by the single state UBM. Thus, the corresponding ROC curves do not cross the y-axis.

20000 40000 60000 80000 100000 120000 140000

0 200 400 600 800 1000 1200 1400 1600 1800 2000

# false positives

0

# false negatives working area

discrete Profile HMMs semi−cont. feat. based BLR HMMs − MAP estimation semi−cont. feat. based BLR HMMs − MLLR adaptation semi−cont. feat. based Profile HMMs − MAP adaptation semi−cont. feat. based Profile HMMs − MLLR adaptation

0 5000 10000 15000 20000

0 100 200 300 400 500 600 700 800

Figure 6.9: SCOPSUPER95 66 based comparison of the detection performance for both SCFB Profile HMMs (cyan, and pink ROC curves) and SCFB BLR models (blue, and green ROC curves) when evaluating the particular models competitively to an UBM. Both variants of the feature based BLR models show excellent performance for target identification tasks (red ROC curve:

corresponding standard discrete Profile HMMs). Due to small amounts of false rejections pro-duced by the single state UBM applied both BLR MAP and BLR MLLR curves do not cross the y-axis. For clarity the particular endpoints are marked with ’+’.

It can be seen that Bounded Left-Right models are well suited for remote homology detection. The target models with reduced complexity, i.e. containing significantly less

pa-rameters which need to be trained, perform better than Profile HMMs. Thus, improved performance can be expected for smaller training sets, too. The combination of BLR target models and an UBM is superior for the SCOPSUPER95 66 based classification task.

In table 6.9 the corresponding percentages of false predictions for fixed points within the ROC curves allowing five percent false predictions are summarized. It becomes clear that when applying SCFB BLR protein family models to the practical task of remote homology detection for e.g. pharmaceutical applications where certain percentages of false classifi-cations are allowed, the percentages of corresponding false classificlassifi-cations could have been decreased again (compared to the SCFB Profile HMMs).

Modeling Variant False Negative Predictions False Positive Predictions [%] for 5 % False Positives [%] for 5 % False Negatives

Discrete Profile HMMs 26.1 57.6

SCFB Profile HMMs (MAP) 7.9 16.0

SCFB Profile HMMs (MLLR) 5.1 5.5

SCFB BLR HMMs (MAP) 8.9 11.9

SCFB BLR HMMs (MLLR) 4.9 4.7

Table 6.9: Characteristic values for SCOPSUPER95 66 UBM based detection experiments: At the working points of 5 percent allowed false predictions only little corresponding false predictions are obtained when using the new SCFB BLR based protein families.

The evaluation of both classification and detection performance for SCFB BLR protein family HMMs using the SCOPSUPER95 66 corpus represents the successful proof of con-cept for the new modeling approach. Following this, the amount of training samples is ex-plicitly reduced in order to evaluate the effectiveness of SCFB BLR HMMs for the sparse data problem. Therefore, experimental evaluations based on the SCOPSUPER95 44f cor-pus are presented.

SCOPSUPER95 44f: Evaluation of Remote Homology Classification

In order to illustrate the dependency of the classification error obtained for remote ho-mology classification on the amount of training material available for model estimation, in figure 6.10 these values are presented for both SCFB Profile HMMs and SCFB BLR models. The baseline for discrete Profile HMMs is shown for completeness as well. As pre-viously mentioned, generally two test sets exist for the SCOPSUPER95 44f corpus. Thus, the results are shown for both of them in separate diagrams.

Inspecting the curves in both diagrams of figure 6.10 it becomes clear that SCFB protein family HMMs show superior performance for almost the whole range of training subsets.

Only for very small sample sets (less than five sequences), discrete Profile HMMs relatively outperform the new techniques. However, in this area the absolute classification error is out of any reasonable range (more than 60 percent). When applying SCFB BLR protein family HMMs the classification error can be halved for almost all subsets of training samples compared to the SCFB Profile HMMs.

10 20 30 40 50 60 70 80 90 100

0 5 10 15 20 25 30 35 40 45

Classification Error [%]

# Training Samples

SCOPSUPER95_44f (original testset) discrete Profile HMMs

SCFB BLR (MAP) SCFB Profile (MAP) SCFB BLR (MLLR) SCFB Profile (MLLR)

10 20 30 40 50 60 70 80 90 100

0 5 10 15 20 25 30 35 40 45

Classification Error [%]

# Training Samples

SCOPSUPER95_44f (extended testset) discrete Profile HMMs

SCFB BLR (MAP) SCFB Profile (MAP) SCFB BLR (MLLR) SCFB Profile (MLLR)

Figure 6.10: Illustration of the SCOPSUPER95 44f based classification errors (top: original testset; bottom:

extended testset) depending on the number of training samples used for model training for Pro-file HMMs (MAP, and MLLR based SCFB, and discrete) as well as for BLR models. The models with reduced complexity show superior classification performance especially when less training material is available. Since the training samples are randomly picked the actual classification error rates (marked by ’+’) are smoothed using Bezier splines (solid lines).

SCOPSUPER95 44f: Evaluation of Remote Homology Detection

Following the assessment of the classification capabilities of SCFB BLR protein family HMMs in the following their effectiveness for detection tasks is considered. In analogy to the argumentation given on page 162, the presentation of ROC curves is limited to three kinds of experiments.

In figures 6.11 and 6.12 the detection results are presented by means of ROC curves illustrating the mutual dependencies of false predictions. All diagrams contain ROC curves for discrete Profile HMMs (serving as baseline reference), SCFB Profile HMMs evaluated competitively to a structured UBM, SCFB BLR HMMs which are applied in combination with an unstructured UBM, and the results for standalone evaluation of SCFB BLR HMMs, i.e. not competing with any kind of UBMs.

In the diagrams the improved performance of advanced stochastic protein family mod-eling techniques for remote homology detection can be seen even for small training sets.

The ROC curves corresponding to semi-continuous feature based approaches lie almost everywhere below the reference curves for state-of-the-art discrete Profile HMMs.

Furthermore, it can be seen that SCFB Bounded Left-Right protein family models out-perform their Profile topology based counterparts when evaluating them competitively to an UBM. Note that the specificity of this model combinations degrades proportional to the decreasing numbers of training samples used. The smaller the number of training samples, the larger the percentage of false negative predictions. In the diagrams this is illustrated by the positions of the endpoints of the particular ROC curves not crossing the y-axis (marked with ’+’). The larger the horizontal distance of the endpoints from the point of origin, the larger the number of (fixed) false rejections.

However, the number of false positive predictions which is extremely relevant for e.g.

pharmaceutical applications due to enormous costs linked to further expensive analysis of erroneously selected candidates (e.g. in wet-lab experiments) can substantially be reduced.

In addition to identifying the limits of advanced stochastic modeling approaches developed in this thesis, the experiments clearly highlight their substantial benefits: Already when using 20 training samples (which is in fact a very small number) great improvements for remote homology detection tasks can be obtained. Further reductions of the amounts of training samples makes SCFB BLR protein family models’ capabilities tend to the detection performance of current discrete Profile HMMs including slight improvements.

Similar to the presentations given in previous sections, in table 6.10 the characteristic values of the particular ROC curves which concretely specify the mutual dependencies of false predictions at reasonable working points are summarized. Due to the great effective-ness of the UBM the limit of five percent false positive predictions is often not reached.

When keeping consistency regarding the definition of the characteristic values, the percent-age of corresponding false negative predictions is 0.0. However, due to the (occasionally obtained) larger numbers of false rejections caused by the UBM, additionally, the percent-ages of these false rejections are given in parentheses. In analogy to the argumentation given above, when not reaching the limit of five percent false negatives, the percentage of false positive predictions for the endpoint of the particular ROC curve is given in parentheses, too.

semi−cont. feat. based BLR HMMs − MLLR estimation semi−cont. feat. based Profile HMMs − MAP estimation semi−cont. feat. based Profile HMMs − MLLR adaptation semi−cont. feat. based BLR HMMs (no UBM) − MAP estimation semi−cont. feat. based BLR HMMs (no UBM) − MLLR adaptation Discrete Profile HMMs

0 20000 40000 60000 80000 100000 120000 140000

0 200 400 600 800 1000

semi−cont. feat. based BLR HMMs − MAP estimation

1200 working area

1400 1600 1800 2000

# false positives

# false negatives 0

5000 10000 15000 20000

0 100 200 300 400 500 600 700 800

20000 40000 60000 80000 100000 120000 140000

0 200 400 600 800 1000 1200 1400 1600 1800

0 2000

working area

# false positives

# false negatives 0

5000 10000 15000 20000

0 100 200 300 400 500 600 700 800

Figure 6.11: Detection results for SCOPSUPER95 44f experiments based on target family models estimated using 20 (upper diagram), and 30 training samples (lower diagram). Even for small training sets, new modeling techniques outperform discrete Profile HMMs (red). The endpoints of UBM based curves not crossing the y-axis caused by false rejections due to the UBM are marked with

’+’.

semi−cont. feat. based BLR HMMs − MAP estimation semi−cont. feat. based BLR HMMs − MLLR estimation semi−cont. feat. based Profile HMMs − MAP estimation semi−cont. feat. based Profile HMMs − MLLR adaptation semi−cont. feat. based BLR HMMs (no UBM) − MAP estimation semi−cont. feat. based BLR HMMs (no UBM) − MLLR adaptation

0 20000 40000 60000 80000 100000 120000 140000

0 200 400 600 800 1000

Discrete Profile HMMs

1200 working area

1400 1600 1800 2000

# false positives

# false negatives 0

5000 10000 15000 20000

0 100 200 300 400 500 600 700 800

Figure 6.12: SCOPSUPER95 44f based comparison of detection results for the third set of experiments based on the use of 44 training samples each for the estimation of target family models. The combina-tion of SCFB BLR models and single state UBMs performs best while obtaining small amounts of false rejections due to the UBM. Again the endpoints of UBM based curves not crossing the y-axis caused by false rejections are marked with ’+’.

To summarize, semi-continuous feature based protein family HMMs with Bounded Left-Right topology are well suited for remote homology detection tasks where only little train-ing data is available. Especially in combination with a strain-ingle state UBM substantial reduc-tions of false positive prediction rates are achievable which is relevant for e.g. pharmaceu-tical applications. Generally, advanced stochastic models of protein families as developed in this thesis outperform current discrete Profile HMMs particularly when only small sets of training samples are available.