• Keine Ergebnisse gefunden

5.3 Protein Family HMMs with Reduced Complexity

5.3.1 Beyond Profile HMMs

Current Profile HMMs are direct derivatives of the classical Dynamic Programming ap-proach. Based on the analysis of symbolic representations of protein data the essentials of a protein family are captured by a stochastic model. In fact, the more or less soft position

9Note that this thesis is directed to protein family models which means that at least protein domains are covered by stochastic models. Compared to motif based Profile HMMs, the modeling base is significantly larger for protein domains for the majority of applications.

dependent modeling of amino acid distributions is the basic advantage of Profile HMMs compared to classical sequence to sequence alignment techniques. Similar to classic pair-wise alignment techniques, for flexibility regarding residue insertion and deletion, special states (Insert and Delete) are incorporated. This is also important for local sequence to model alignments where only parts of the model match to (parts of) the sequence. Insert states also contain (more or less) position-dependent amino acid distributions.10The classi-cal Profile HMM architecture and its derivatives guarantee highest flexibility for sequence to family alignment. Generally, the basic principle of Profile HMM evaluation corresponds to “probabilistic Dynamic Programming”.

In this thesis, the usual application of Profile HMMs is considered. By means of represen-tative sample sets of protein sequences stochastic models of protein families are estimated.

Using these models, the affiliation of new protein sequences to a particular protein family is predicted, either by local or by global alignment. In classical three-state Profile HMMs the consensus of a multiple sequence alignment is modeled via a linear chain of Match states.

Since the consensus represents the conserved parts of a protein family, the Match states contain the most relevant information for a particular protein family of interest. Thus, the amino acid distributions assigned to Match states are position specific for the columns of the MSA they represent. The amino acid distributions of Insert states are usually not position specific since that would assume conservation which is already covered by the appropriate Match state. However, Insert states’ amino acid distributions are model specific. Since cur-rent Profile HMMs are based on discrete amino acid data, and every non-silent state emits amino acid symbols, the states’ distributions are usually rather specific – they are estimated for a single column of an MSA. Thus, the global decision regarding family affiliation of a particular protein sequence (or parts of it) requires the complex three-state model topology for “probabilistic Dynamic Programming”. In several informal experiments based on the SCOPSUPER95 66 corpus (cf. section 4.1.1) the superior performance of the three-state Profile HMMs compared to various alternative model topologies could be proved when processing discrete amino acid data.

Complexity Reduction due to Feature Representation: The central idea of this thesis is the explicit consideration of biochemical properties of residues in their local neigh-borhood for protein data classification (cf. section 5.1). By means of features covering these properties a richer sequence representation is used for protein sequence analysis. Emissions of protein family HMMs are now based on the mixture density representation of the new feature space. The resulting continuous emission probability distributions are much broader than the discrete amino acid distributions of current Profile HMMs while keeping the speci-ficity necessary for sequence classification. If features properly match the emission proba-bility distributions of a particular state, the resulting contribution to the overall classification score is rather high which corresponds to the Match case of Dynamic Programming. Con-trary to this, if the features do not match the states’ probability distribution, the local score will be small which corresponds to an Insertion. Thus, the explicit discrimination between Insert and Match states is not needed any longer because it is implicitly performed already

10For Insert states often more general, e.g. family specific, amino acid distributions are used instead of strict position dependent distributions.

on the emission level. Furthermore, explicit Deletes are only conceptual and can be replaced by direct jumps skipping direct neighbors. This means that at least two thirds of the states contained in Profile HMMs can be discarded. One third of the states, namely the Insert states, contain transition probabilities as well as emission probabilities. When skipping the Inserts, the model architecture becomes less complex and thus the number of parameters which need to be trained can be decreased substantially.

To summarize, due to the feature based representation, the strict position specificity of Profile HMMs for global protein family models can principally be discarded and alterna-tive model topologies with reduced complexity can be used for protein family modeling – beyond Profile HMMs.

Bounded Left-Right Models for Protein Families: In figure 3.10 on page 45 vari-ous standard HMM topologies which are well-known from different general pattern recog-nition applications were presented. The flexibility of the models depends on their com-plexity. Simple linear models, where every state is adjacent to itself and to its immediate neighbor to the right, are very common e.g. in speech recognition applications for mod-eling small acoustic units (so-called triphones) which do not significantly vary in length.

Bakis models, where every state is connected to three adjacent states (including itself), are the methodology of choice for the domain of automatic handwriting recognition where let-ters are modeled which can contain moderate variations in length depending on the actual writer. The most flexible model architecture which is non-ergodic, i.e. fully connected, rep-resents the Left-Right topology.11Here, every state is connected to all states adjacent to the right. Thus, arbitrary jumps within the model are possible (including self-transitions) which allows covering signals of arbitrary length.

When modeling protein families containing related but highly diverging sequences, global models need to offer high flexibility for covering the length variance of the family members. Thus, generally (almost) arbitrary jumps within the model must be possible for directly accessing matching parts of the model and skipping parts of the model which are irrelevant for particular sequences of interest. Thus, Left-Right topologies are principally the methodology of choice for protein family modeling beyond Profile HMMs.

However, if arbitrary jumps within a protein family model are allowed, as defined for plain Left-Right topologies, especially for models covering larger protein families the num-ber of parameters to be trained is still rather high. Since every state is connected to all states adjacent to the right and itself, the number of transition probabilitiesNtfor a model containingLstates is defined as follows:

Nt=

L

X

i=1

i+ 1 = L

2(L+ 1) + 1.

The additional offset is reasoned by the model exit transition from the last state. If, further-more, every state contains a direct model exit transition (which can be favorable for local

11Ergodic models are of minor importance for the vast majority of applications because either backward jumps are useless (especially when signals evolving in time are analyzed), or the number of parameters is just exorbitant and thus models cannot be trained robustly.

alignments)Ntneeds to be increased byL−1. For an exemplary protein family model con-sisting of 100 states, the number of transition probabilities is 5 051 for the basic Left-Right architecture and 5 150 if model exit is possible from every state.

Even for extremely diverging sequences belonging to a particular protein family it is rather unrealistic to assume arbitrary alignment paths through the appropriate protein fam-ily model which are allowed when using the plain Left-Right topology. The majority of state transitions will not be observed and can, therefore, not be trained. Thus, a variant of standard Left-Right models is developed for protein family modeling – so-called Bounded Left-Right (BLR) Models. The basic idea is to restrict direct state transitions to a local con-text of a particular state by finding a reasonable compromise between linear and complete Left-Right models resulting in significantly less transition parameters to be trained. The number of state transitions depends on the length of the underlying protein family model which is determined as follows.

When applying BLR models to global alignment tasks where protein families are com-pletely evaluated for the whole sequence of interest the model needs to ensure that even the smallest relevant sample sequence can be completely assigned to it. In Profile HMMs and standard Left-Right models principally every state is connected to the model exit either via Delete states or explicit transitions as mentioned earlier. However, for the majority of cases these are only fallback solutions since the transitions certainly cannot be trained using real sample sets. Protein family models need to cover the majority of family members. Thus, the length of the models is determined by analyzing the training samples. The length of BLR models is determined as the median of the lengths of the training data. Compared to e.g. the arithmetic mean the median is more robust against outliers. Given the length of the model and the rule explained above that it must be possible to align even the smallest relevant sample sequence to it, the number of direct state transitionsNsF for a particular statesof a given protein familyF is adjusted as follows:

NsF =min

median(length of sample sequences)

min(length of sample sequences) ,# states adjacent to the right

. (5.12) At the end of a model the number of successors can be smaller than the number of transi-tions calculated. By means of the selection in equation 5.12 (’min’) it is technically guar-anteed that all transitions of a particular state point to states which are actually existing.

For local alignments optionally every state can serve as model entrance and exit. The cor-responding transition probabilities are fixed by assuming uniform distributions which is reasonable according to [Dur98, p. 113ff]. The BLR architecture of protein family models is illustrated in figure 5.12.

Compared to the approximately 5 000 transitions for the complete Left-Right model ar-chitecture of the exemplary protein family given above, the number of parameters to be trained for the BLR topology is decreased to approximately 500 when assuming a median length of 100 and a minimum length of 20. For the three-state Profile HMM architecture the number of transition parameters for the given example is approximately 2 700. Ad-ditionally, the number of emitting states in BLR models is halved compared to standard three-state Profile HMMs. Note that due to respecting local amino acid contexts already at the level of emissions, usually feature based BLR models are significantly shorter than Profile HMMs.

1/n

1/n1/n 1/n1/n 1/n 1/n1/n

Figure 5.12: Bounded Left-Right model architecture for robust feature based protein family modeling using small amounts of training samples. The model length (here 8) is determined by the median of the lengths of the training sequences and the number of transitions per state is automatically bounded (here 4). For local alignments optionally arbitrary state entrances and exits are allowed (dashed lines) whose probabilities are fixed according to a uniform distribution. Exit probabili-ties are equally set to some small value.

Given the BLR topology for protein family HMMs, feature based models of semi-continuous type are initialized and trained using standard algorithms as discussed in section 3.2.2 on pages 47ff. For model initialization the feature data corresponding to a particular protein sequence is reasonably aligned to the models’ states using simple interpolation.

Standard Baum-Welch training as described on pages 51ff is performed for model opti-mization. Note that the process of BLR model estimation is significantly less complex and can thus be performed much faster than its analogy for Profile HMMs. In addition to the reduced number of parameters involved this is the major outcome of the new approach.

As already mentioned earlier the procedures of general feature space estimation as well as model specialization by adaptation remain identical as described in the previous sec-tions. To summarize, by means of the richer feature based protein sequence representation, less complex global protein family models become possible, namely Bounded Left-Right models as presented here. These models contain significantly less parameters which need to be trained for robust protein family modeling. By means of effective adaptation tech-niques, powerful probabilistic models of protein families can be estimated in a completely data-driven manner without incorporating explicit prior knowledge.