• Keine Ergebnisse gefunden

Protein Family Modeling using Sub-Protein Units (SPUs)

5.3 Protein Family HMMs with Reduced Complexity

5.3.2 Protein Family Modeling using Sub-Protein Units (SPUs)

1/n

1/n1/n 1/n1/n 1/n 1/n1/n

Figure 5.12: Bounded Left-Right model architecture for robust feature based protein family modeling using small amounts of training samples. The model length (here 8) is determined by the median of the lengths of the training sequences and the number of transitions per state is automatically bounded (here 4). For local alignments optionally arbitrary state entrances and exits are allowed (dashed lines) whose probabilities are fixed according to a uniform distribution. Exit probabili-ties are equally set to some small value.

Given the BLR topology for protein family HMMs, feature based models of semi-continuous type are initialized and trained using standard algorithms as discussed in section 3.2.2 on pages 47ff. For model initialization the feature data corresponding to a particular protein sequence is reasonably aligned to the models’ states using simple interpolation.

Standard Baum-Welch training as described on pages 51ff is performed for model opti-mization. Note that the process of BLR model estimation is significantly less complex and can thus be performed much faster than its analogy for Profile HMMs. In addition to the reduced number of parameters involved this is the major outcome of the new approach.

As already mentioned earlier the procedures of general feature space estimation as well as model specialization by adaptation remain identical as described in the previous sec-tions. To summarize, by means of the richer feature based protein sequence representation, less complex global protein family models become possible, namely Bounded Left-Right models as presented here. These models contain significantly less parameters which need to be trained for robust protein family modeling. By means of effective adaptation tech-niques, powerful probabilistic models of protein families can be estimated in a completely data-driven manner without incorporating explicit prior knowledge.

conceptual character, various modifications and especially enhancements aiming at certain concrete tasks within molecular biology research beyond protein family analysis are imagin-able. However, these applications and especially their evaluation with respect to biological (e.g. pharmaceutical) relevance are beyond the scope of this thesis.

Comparing most current protein family modeling with state-of-the-art approaches in dif-ferent pattern recognition applications another fundamental difference becomes obvious.

Usually, protein families are covered by global probabilistic models capturing complete se-quences. Even when estimating models for the functionally smallest units, i.e. the protein domains as treated in this thesis, very large models consisting of more than hundred states are not the exceptional case. Especially for remote homologue sequences rather dissimilar parts of a particular protein family are integrated into a single probabilistic model. Contrary to this, for e.g. speech recognition systems word models are established by concatenations of significantly smaller building-blocks (usually triphones). This becomes reasonable when analyzing complex languages containing large numbers of words which cannot be trained since there are usually too few training samples available. However, triphones are suitable building blocks for even the most complex words which can be trained “easily”.

According to the literature there are hardly any protein family modeling approaches fol-lowing the paradigm of concatenating building blocks. One exception is the MEME system of William Grundy and colleagues which was already discussed in section 3.2.2 on pages 63ff including its general applicability for the remote homology analysis problem. MEME heuristically combines rather primitive motif HMMs to protein family models.

Principally, the idea of motif based protein family HMMs is very promising for tackling the sparse data problem as formulated on page 92. Since the new feature representation of protein sequence data is the fundamental approach of enhancements developed in this the-sis the definition of building blocks directly at the residue level seems counterproductive.

Instead, building blocks are defined directly on feature data. In analogy to sub-word models in automatic speech recognition applications, these building blocks are called Sub-Protein Units (SPUs). The straightforward approach for modeling protein families using concatena-tions of some building blocks is to train models given training sets which are annotated with respect to the particular SPUs. Following this, variants of the protein family are extracted by analyzing the most frequent combinations of SPU concatenations. When classifying unknown sequences all protein family variants obtained during training are evaluated in parallel.

Unfortunately, SPU based annotations of training sequences are generally not available.

The basic dilemma can be described as some kind of a “chicken and egg” problem: SPU models (later serving as building blocks for whole protein family models) can only be trained when SPU based annotations are available. However, these annotations can only be generated if suitable SPU models are available. In the thesis, this principle problem is tackled using an iterative approach which allows combined SPU detection and model training in an unsupervised and data-driven manner.

Once SPUs are found, which cover only the “interesting” or dominant parts of a pro-tein family relevant for successful sequence classification, they are modeled using standard HMMs with reduced model complexity. Biochemical properties of the protein data ana-lyzed are explicitly considered, and thus the resulting building blocks do not necessarily

correspond to motifs. Since the overall protein family model is reduced to small essential parts, significantly less training samples are sufficient. By means of the most frequent SPU occurrences within the particular training sets, protein family models are derived automati-cally by concatenation of the building blocks.

The overall process of modeling protein families using SPU based HMMs can be divided into three parts which are described in the following, namely SPU Candidate Selection, Estimation of SPU Models, and Building Protein Family Models from Sub-Protein Units.

SPU Candidate Selection

The feature extraction method developed in section 5.1 provides a richer sequence represen-tation aiming at better remote homology analysis when using Profile HMMs. The selection of SPU candidates is directly based on the 99-dimensional feature vectors. In the first step, general SPU candidates need to be extracted from protein sequences, i.e. training sequences are annotated with respect to the binary decision whether the underlying frames are SPUs or General. The SPU based annotation of the sample data will be used for SPU-model training and protein family creation.

Various criteria for classifying parts of the overall feature representation of protein se-quences as SPU or non-SPU (so-called General Parts GP) are imaginable. As already men-tioned earlier the SPU based modeling approach discussed here represents a general frame-work for protein family modeling based on building blocks which are conceptually below the level of direct biological functions. In the exemplary version shown here SPUs are de-fined as high-energy parts of the protein sequence which will be motivated below. Note that alternative approaches for SPU candidate selection can be used equivalently within the overall framework.

All parts of the original training sample whose feature vectors’ energy is below the av-erage energy of the particular protein sequence are treated as General parts GP. The actual discrimination method based on the feature vectors’ energy becomes reasonable when ana-lyzing the feature extraction method in more detail (therefore, cf. its general description in section 5.1.2). In order to extract reasonable features, a Discrete Wavelet transformation is applied to the signal-like representation of biochemical properties of residues in their local neighborhood. Following this, the approximation and some detail coefficients are used as the base for further analysis. One fundamental property of the wavelet transformation is the concentration of signal energy in the upper coefficients (cf. appendix A). Thus, high feature vector energy is a reasonable indicator for relevance.

For robust SPU candidate selection the energy signal of the feature vectors correspond-ing to a particular protein sequence is post-processed uscorrespond-ing smoothcorrespond-ing techniques (DWT smoothing and median filtering). By means of these techniques, protein sequences are prin-cipally sub-divided into SPUs and General parts GP which can be seen in figure 5.13 for an exemplary Immunoglobulin (d1f5wa ). SPUs are extracted from the energy signal of the protein sequence (solid line) where the average protein energy (dotted line) is below the actual feature vector energy. By means of post-processing two SPU candidates (dashed rectangles) are selected.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

0 20 40 60 80 100 120 140

Energy -- Immunoglobulin (d1f5wa_) features SPU definition derived Average energy Immunoglobulin (d1f5wa_)

Residues

Energy

Figure 5.13: Example for SPU candidate selection via the energy criterion: All contiguous parts of the feature representation of an exemplary Immunoglobulin (d1f5wa ) whose energy (solid line) is higher than the sequence’s average (dashed line) are marked as SPU candidates (rectangles).

Estimation of SPU-Models

In the first step of the new protein family modeling approach protein sequences are anno-tated with respect to the SPU candidates or General decision. Following this, corresponding SPUs need to be identified in order to train HMMs for a non-redundant set of SPUs relevant for the particular protein family.

The SPUs estimated for the protein family model, and the General model which is unique for every protein family, are modeled using linear, semi-continuous HMMs. Once the train-ing set is finally annotated ustrain-ing the non-redundant set of SPUs, these models are trained with the standard Baum-Welch algorithm.

In the approach presented here, the final set of SPUs relevant for a particular protein family is obtained by applying a variant of the EM algorithm for agglomerative clustering of the initial (unique) SPU candidates. Therefore, model evaluation and training of SPU-HMMs is alternated up to convergence. Here, convergence means a “stable” SPU based annotation of the training set, i.e. only minor differences between the annotations obtained in two succeeding iteration steps. During the iterative SPU determination unique models for corresponding SPUs are estimated since redundant models will not be hypothesized.

Thus the set of effective SPU candidates is stepwise reduced and the most frequent SPUs are used for the final annotation of the training set. This procedure, which is comparable to the k-means clustering for HMMs proposed in [Per00b], is summarized in figure 5.14.

1. Initialization:

Obtain initial setS0of SPU candidates by e.g. energy based annotation of training sequences.

2. Training

Perform Baum-Welch training for SPU candidate models (linear HMMs) using the current annotation of the training set.

3. Annotation

Use updated SPU models for obtaining new annotation of the training set – recog-nition phase.

4. Termination

Terminate if two subsequent annotations of the training set do not (substantially) differ, i.e. convergence, continue with step 2 otherwise.

5. SPU Candidate List Reduction

Reduce set of SPU candidates by discarding those elements which are not included in the final annotation of the training sequences. Perform final annotation of the training set using the remaining list of SPU candidates.

6. Final Training

Perform steps 2-4 until convergence and train SPU models using the final annota-tion of the training set.

Figure 5.14: Algorithm for obtaining a non-redundant set of SPUs which are used for final protein family modeling (comparable to [Per00b]).

Building Protein Family Models from Sub-Protein Units

Given the non-redundant set of SPUs relevant for the particular protein family, the global protein family model is finally created. The protein family itself consists of variants of SPU concatenations obtained during training. The N variants which are most frequently found within the annotation of the particular training sets, are extracted for the conceptual family definition. Here, optional parts as well as looped occurrences are possible. For actual protein sequence classification, all variants are evaluated in parallel and determine the final classification decision. Comparable to e.g. speech recognition applications the variants are mutually connected using pseudo states which gather transitions.

In figure 5.15 the complete concept for estimating SPU based protein family models is graphically summarized. The three steps described above directly correspond to the three parts marked. In the first row SPU candidates are highlighted red. For clarity the amino acid representation is shown. However, the selection is performed using the 99-dimensional feature vectors. Based on the initial list of SPU candidates a non-redundant set of SPUs is obtained by applying the iterative SPU estimation algorithm. In the middle part of the sketch the corresponding linear SPU HMMs are symbolized. For the global protein family model these SPUs are concatenated. The most frequently occurring SPU concatenations within the training set serve as base for the protein family variants which are evaluated in parallel when classifying unknown protein sequences (lower part in figure 5.15).

GP GP

GP GP GP

GP GP

GP

GP GP

GP GP

GP

?

? Variant 1

Variant 2

Variant N .

.. 1

2

3 ? ?

? ?

?

?

? ?

slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalggpnawtgrnlkevhanmgvsnaqfttvighl

Figure 5.15: Overview of the SPU based protein family modeling process: 1) SPU candidate selection; 2) Estimation of the non-redundant set of SPU models; 3) Protein family modeling using variants of the most frequently occurring SPU combinations within the training set. All optional parts of the model variants are marked with ’?”’.