• Keine Ergebnisse gefunden

Effectiveness of Semi-Continuous Feature Based Profile HMMs

with respect to Pfam-domains and modeled using stochastic models. When evaluating SWISSPROT sequences for the domain HMMs, the models need to find sequences where only parts (namely the appropriate domains) match. This corresponds to one common application for molecular biologists where they might be interested in oc-currences of certain biological functions (represented by the particular domain) in larger protein environments.2

Evaluation of the Specificity: Since SWISSPROT contains substantial amounts of data (approximately 90 000 sequences), the specificity of the new models including the effectiveness of the Universal Background Model (UBM) can be evaluated.

In table 6.2 the characteristics of PFAMSWISSPROT are summarized. Three representa-tive protein domains were selected.

Pfam Id Pfam Name # Samples # Occurrences

Training in SWISSPROT PF00001 7tm 1 – GPCR 7 Transmembrane Receptor 64 1078

PF00005 PKinase – Protein Kinase Domain 63 507

PF00069 ABC tran – ABC Transporter 54 1202

Table 6.2: Overview of the PFAMSWISSPROT corpus for final system evaluation. Members of the three Pfam domains listed here will be searched within the approximately 90 000 sequences of the gen-eral protein database.

6.2 Effectiveness of Semi-Continuous Feature Based Profile

SCOPSUPER95 66: Evaluation of Remote Homology Classification

In table 6.3 the capabilities of the semi-continuous feature based (SCFB) Profile HMMs are given for the SCOPSUPER95 66 based classification task. The results are illustrated by means of a comparison of classification errors obtained when using variants of SCFB Profile HMMs, or their discrete counterparts.

Modeling Variant Classification Error Relative Change∆E[%]

(Profile HMMs) E [%] (Base: Discrete Profile HMMs)

Discrete 32.9 −

SCFB (ML) 37.6 +14.3

SCFB (MAP) 24.0 −27.1

SCFB (MLLR) 20.7 −37.1

Table 6.3: Classification results for SCOPSUPER95 66 comparing discrete Profile HMMs to semi-continuous feature based Profile HMMs (SCFB) obtained by the three variants of feature space adaptation (ML/MAP/MLLR). Whereas the ML variant performs worse than state-of-the-art, both MAP and MLLR based models significantly outperform it (the mean confidence range for this corpus is approximately±3.5%).

Reconsidering the fact that the underlying model architectures are identical for all exper-iments, namely the complex three-state Profile topology, and analyzing the relative changes of the classification error∆E it becomes clear that the new feature representation is very ef-fective. When applying semi-continuous Profile HMMs which where estimated using MAP or MLLR adaptation, the classification error can be decreased significantly. In the best case (MLLR adaptation), the classification error could be reduced by more than one third rela-tive. The classification capabilities of semi-continuous Profile HMMs which are based on ML estimation are worse than standard models since the number of adaptation samples is too small.

SCOPSUPER95 66: Evaluation of Remote Homology Detection

The evaluation based on the SCOPSUPER95 66 classification task gave a first clue regard-ing the general capabilities of the new feature based protein family modelregard-ing techniques.

For the second major application field addressed by this thesis, namely target identification, the results of detection experiments are presented in figure 6.2. The complete 95% sequence identity based SUPERFAMILY hierarchy of SCOP was searched for occurrences of the 16 superfamilies. As another new concept developed in this thesis, during searching the par-ticular target models and a structured Universal Background Model (UBM) which captures general protein data are competitively evaluated.

Analyzing the four ROC curves, the superior performance of both MAP and MLLR based SCFB Profile HMMs can also be confirmed for the detection tasks. As suggested by the classification experiments, ML adaptation is also not that effective for target identification.

In addition to the evaluation of the overall detection capabilities, the effectiveness of the explicit background model can be evaluated using ROC curves (figure 6.3). The ordinate of the diagram represents the number of false positive predictions, i.e. those sequences which

20000 40000 60000 80000 100000 120000 140000

0 200 400 600 800 1000 1200 1400 1600 1800 2000

# false positives

0

# false negatives working area

discrete Profile HMMs semi−cont. feat. based Profile HMMs − ML estimation semi−cont. feat. based Profile HMMs − MAP adaptation semi−cont. feat. based Profile HMMs − MLLR adaptation

0 5000 10000 15000 20000

0 100 200 300 400 500 600 700 800

Figure 6.2: ROC curves illustrating the superior performance of feature based Profile HMMs compared to standard discrete models (red curve). The underlying experimental evaluation was performed using the SCOPSUPER95 66 corpus. It can be seen that all semi-continuous feature based Profile HMMs estimated using the particular adaptation techniques produce better detection results than their discrete counterparts – the area below the ROC curve is significantly smaller.

are actually not members of the particular target family but given the appropriate thresh-old they were falsely classified as members. The smaller the maximum number, the more effective the UBM. The ideal case is that the UBM based evaluation scores are better for all non-family sequences whereas the particular target family scores are better for all actual members. Traditionally, no explicit background model is applied for remote homology de-tection. Instead, usually some kind of post-processing of the detection results is performed by analyzing the significance of alignment scores and filtering properly. It can be seen that the combination of UBM and MLLR based SCFB Profile HMMs is superior. The maximum number of false positive predictions can be reduced by almost 66 percent.

The characteristic values which concretely specify the mutual dependencies of false pos-itive and false negative predictions are summarized in table 6.4.

Conclusion

The experimental evaluation of the feature based representation of protein sequences pre-sented in this section demonstrates the superior performance of semi-continuous feature based Profile HMMs. It could be shown that the richer sequence representation developed in this thesis is a basic foundation for enhanced probabilistic protein family models. Both application fields relevant for e.g. drug discovery, namely target verification and target

iden-20000 40000 60000 80000 100000 120000 140000

0 200 400 600 800 1000 1200 1400 1600 1800 2000

# false positives

0

# false negatives working area

discrete Profile HMMs SCFB Profile HMMs (MAP) − UBM based evaluation SCFB Profile HMMs (MAP) − no UBM used for evaluation SCFB Profile HMMs (MLLR) − UBM based avaluation SCFB Profile HMMs (MLLR) − no UBM used for evaluation

0 5000 10000 15000 20000

0 100 200 300 400 500 600 700 800

Figure 6.3: Direct comparison of SCFB Profile HMMs’ detection performance when applying target models standalone (cyan, and pink ROC curves) and when competing with an explicit background model (blue, and green ROC curves). The superior performance of SCFB Profile HMMs (compared to state-of-the-art discrete models – red ROC curve) can further be improved when using an UBM.

Modeling Variant False Negative Predictions [%] False Positive Predictions [%]

(Profile HMMs) for 5 % False Positives for 5 % False Negatives

discrete 26.1 57.6

SCFB (ML) 17.7 64.4

SCFB (MAP) 7.9 16.0

SCFB (MLLR) 5.1 5.5

Table 6.4: Characteristic values for SCOPSUPER95 66 UBM based detection experiments illustrating the relevance of the new feature based sequence processing for e.g. pharmaceutical applications: At fixed working points of the ROC curves allowing 5% false predictions, the numbers of corre-sponding false predictions decrease significantly for MAP, and MLLR variants of semi-continuous feature based Profile HMMs.

tification benefit from the new approach. For the latter task the effectiveness of an explicit background model (UBM) which covers all non-target data could be shown.

The third major outcome of these experiments is that ML estimation is not a proper method for specializing protein family HMMs towards the target they represent. Thus, in all further evaluations ML based specialization will not be respected any longer. Note that a detailed presentation of the experimental evaluation of this section is given in appendix D.

6.3 Advanced Stochastic Protein Family Models for Small