Task: Homology Detection at the Superfamily Level

3.4 Summary

4.1.1 Task: Homology Detection at the Superfamily Level

In order to evaluate the capabilities of current Profile HMM based approaches for protein sequence analysis and to compare the enhanced concepts developed in this thesis to them, the following task is defined:

For a given set of superfamilies, Hidden Markov Models will be established using reasonable amounts of training data. By means of these models, clas-sification tasks representative for target validation as well as detection tasks

representative for target identification within the process of general drug dis-covery are evaluated. The general focus will be put on classification accuracy for the first case as well as on sensitivity and specificity for the latter.

Here, classification means the assignment of the family affiliation to sequences which are known to originate from a closed set of protein families. This application is important espe-cially for target validation tasks where sequences are annotated regarding certain models.

Additionally, the general discrimination power of family models can be assessed since only the Profile HMMs themselves but no background models are involved in general. Thus, the results of this kind of evaluations give a first clue about the effectiveness of the appropriate methods.

In the second use case, homologue sequences of a particular protein family are searched in a general database of protein sequences which is typical for target identification tasks:

For a family which is therapeutically relevant, sequences belonging to it are searched by comparison of query data to the probabilistic model of the appropriate family. By means of a threshold based analysis of the log-odd alignment scores the decision regarding affiliation to the target family is taken.

Datasets

For an objective judgment of the capabilities of certain sequence analysis techniques some kind of standardized datasets would be the optimal base for benchmarking. In alternative pattern recognition domains such datasets are very common and new methods are almost always assessed by comparison to state-of-the-art techniques based on this data. Unfor-tunately, for the bioinformatics domain the situation is rather different. There are hardly any standardized datasets available which are generally used within the community for the abovementioned objective judgment of sequence analysis techniques. Often new develop-ments are explicitly directed to some specialized biological problem and the datasets used for evaluation are gathered correspondingly. Thus, the datasets for training as well as for the evaluation of both state-of-the-art probabilistic sequence analysis techniques and advanced stochastic protein family models developed in this thesis were defined by the author. The basic criterion for data selection was to maximally respect objectivity, i.e. the remote ho-mology data used is as unbiased as possible regarding certain biological specialties.

According to Rainer Spang and colleagues, who cited in [Spa02] the “chicken and egg”

problem for evaluating the effectiveness of annotation and search methods which was for-mulated by Steven Brenner and coworkers in [Bre98], it is rather difficult to assess the power of Profile HMMs by means of unknown data. The actual family affiliation of the data processed needs to be known in advance. Thus, existing sequence annotations ob-tained from one of the major public databases is used for the analysis of current approaches as well as for the comparison to the new techniques developed. Care needs to be taken for the actual selection of the reference database. If the database was created using au-tomatic clustering techniques, the annotation will certainly not be completely error-free.

An accurate annotation of the complete database implies the preceding successful solution of the computational sequence analysis problem, which is in these days unrealistic. How-ever, when comparing the results of Profile HMM based predictions to such databases, the reference annotation created by an alternative classifier will be accepted as optimal. The

prediction results can only be compared to the potentially erroneous reference annotation obtained using alternative automatic classification approaches. Basically, false predictions need to be further examined with biological expertise since it is not clear whether the ref-erence annotation was wrong or the new prediction. Unfortunately, many current databases were created automatically. Thus, they can hardly be used for serious assessments.

In order to properly simulate the actual situation for remote homology detection, the analysis needs to cover sequence data which is highly divergent. Many databases contain large amounts of redundant data or sequences being almost identical. Such close homo-logues can be processed by means of standard pairwise techniques as described in section 3.1. Probabilistic approaches address remote homologues, i.e. highly diverging but related sequences.

In these premises, the SUPERFAMILY hierarchy [Gou01] of the SCOP database [Mur95] is used for the assessments at the level of maximally 95% sequence similarity.

Here, sequences belonging to a distinct superfamily must not have similarity values above 95%. This means, even data having sequence identities of only a few percent may belong to these superfamilies. Since the structural classification of the protein sequences contained in SCOP was performed manually by massively exploiting well-founded biological expertise, the quality of the labeling of the data is extraordinarily good (cf. the description of SCOP in section 2.3.2). Thus, it is predestinated for the general assessment of the capabilities of certain sequence analysis techniques.

Due to the complex model architecture of current Profile HMMs including the rather large number of parameters to be trained, the minimum number of training sequences was set to 44. This minimum number was chosen according to the analysis of the average length of sequences belonging to superfamilies and a rule of thumb for the number of examples per model parameter to be trained. Together with at least 22 sequences for the test case, 16 superfamilies containing at least 66 sequences each were selected for the evaluation. The sub-division of the sequences into training and test sets was performed fixed, i.e. no further leave-N out tests etc. are considered. In the following, the resulting corpus of training and test data is called SCOPSUPER95 66.

In table 4.1 a general overview of the corpus is given whereas in figure 4.1 the distribu-tion of the similarity percentages over the datasets is illustrated by means of a histogram capturing all superfamilies included.

Inspecting the general corpus overview it can be seen that the sequences vary substan-tially in length for both training and test sets. This is the usual case for proteins subsumed in superfamilies which contain data with low sequence identity percentages but whose struc-tures and major functional feastruc-tures suggest a probable common evolutionary origin (ac-cording to the superfamily definition of SCOP, cf. section 2.3.2).

In fact, the sequence similarities, which are limited to maximally 95 percent, are rather uniformly distributed all over the whole range with a small preference to the first third of the histogram for the appropriate training sets as well as for the assigned test sets. The bins of the similarity range histogram are defined via the SUPERFAMILY hierarchy each capturing five percent ranges with one exception in the first bin (0-10%).

SCOP Id SCOP Superfamily Name # Samples Length (Mean/Std.-Derivation) Training Test Training Test

a.1.1 Globin-like 60 30 150.3 (13.6) 151.6 (11.1)

a.3.1 Cytochrome c 44 22 102.6 (24.1) 118.4 (32.6)

a.39.1 EF-hand 49 25 138.1 (48.0) 122.0 (39.3)

a.4.5 ”Winged helix” DNA-binding domain

49 25 93.8 (26.6) 92.9 (23.1)

b.1.1 Immunoglobulin 207 104 108.9 (15.3) 106.7 (12.3)

b.10.1 Viral coat and capsid proteins 64 32 278.0 (92.9) 262.1 (85.2) b.29.1 Concanavalin A-like

lectins/glucanases

52 27 221.2 (51.2) 220.8 (72.9) b.40.4 Nucleic acid-binding proteins 47 24 113.1 (36.6) 111.5 (47.2) b.47.1 Trypsin-like serine proteases 55 28 231.4 (29.5) 226.0 (30.1)

b.6.1 Cupredoxins 50 26 143.9 (34.6) 139.0 (31.5)

c.1.8 (Trans)glycosidases 62 31 376.5 (76.4) 397.8 (84.0) c.2.1 NAD(P)-binding

Rossmann-fold domains

102 51 204.3 (58.9) 211.5 (75.1) c.3.1 FAD/NAD(P)-binding

domain

45 23 226.1 (93.3) 223.3 (86.3) c.37.1 P-loop containing nucleotide

triphosphate hydrolases

127 64 259.3 (120.4) 253.4 (85.6) c.47.1 Thioredoxin-like 56 28 111.6 (38.2) 105.6 (35.3) c.69.1 Alpha/Beta-Hydrolases 51 26 350.1 (103.7) 323.7 (25.0)

Total: 1120 566

Table 4.1: Overview of the SCOPSUPER95 66 corpus created for the assessment of current Profile HMM capabilities as well as for the comparison of the effectiveness of the methods developed in this thesis. For every superfamily the alpha-numerical SCOP Id as well as its real name as defined in the database is given. In the last row the total numbers of samples are summarized.

Im Dokument Advanced stochastic protein sequence analysis (Seite 92-95)