• Keine Ergebnisse gefunden

Accelerating the Model Evaluation

4.2 Improving the Quality of HMM Based Sequence Analysis

4.2.3 Accelerating the Model Evaluation

In the “Introduction for the impatient” of the manual for the Wise-tools, which is one commonly used framework for general sequence analysis using HMMs, the currently widespread opinion of the research community regarding the efficiency of model evalu-ation is characterized by the following (hypothetical) dialog between a Wise-user and the developers [Bir01]:

“[Question:] It goes far too slow

[Answer:] Well . . . I have always had the philosophy that if it took you over a month to sequence a gene, then 4 hours in a computer is not an issue.”

Most current HMM based approaches for sequence analysis were developed with exclusive respect to the general method, i.e. neglecting the efficiency of the actual model evalua-tion. Contrary to the argumentation of the developers of Wise as cited above, “4 hours in a computer” is indeed an issue. As discussed in chapter 1, modern molecular biology is strongly influenced by the paradigm shift performed due to computational sequence anal-ysis. The gain in biological knowledge obtained by broad screening of large databases for particular protein families became possible by powerful and efficient techniques like BLAST. Thus, when applying more sensitive and powerful techniques like Profile HMMs, efficiency is important, too. Some of the present techniques are performed on special-ized, and distributed hardware solutions (e.g. HMMER was ported to the massive paral-lel PARACELc GeneMatcherTM architecture). Although by means of such “brute-force”

accelerations HMM based sequence analysis can be performed on large databases, the prin-ciple problem of inefficient model evaluation remains. Increasing the computational power for faster evaluation only treats the symptoms. Especially, when more complex procedures for emission estimations like feature based approaches are applied addressing the improve-ment of the classification accuracy for remote homologue sequences, algorithmic accelera-tions of the model evaluation are required.

Inspired by general pattern recognition applications of Hidden Markov Models, in this thesis concepts for accelerating the evaluation of protein family models are adopted and transferred to the bioinformatics domain. Following the argumentation given above, the fo-cus of such acceleration techniques concentrates on algorithmic changes within the evalua-tion process. For annotaevalua-tion tasks (e.g. for drug target validaevalua-tion), currently multiple Profile HMMs are evaluated sequentially, i.e. every query sequence is aligned serially to every Profile HMM considered and the classification decision is determined by alignment score comparison. Such tasks are generally comparable to automatic speech recognition appli-cations where signal parts are classified with respect to a fixed (large) inventory of words.

Especially online speech recognition is not possible when performing sequential model evaluation combined with the posterior decision as for the bioinformatics case. Instead, all models are evaluated in parallel which makes the overall process much faster when us-ing sophisticated model combinations by combined state spaces and prunus-ing techniques.

In this thesis, protein family models are similarly evaluated in parallel and certain pruning techniques are applied allowing for fast model evaluation.

Furthermore, effective pruning techniques are applied which significantly limit the com-putational effort for model evaluation on average. The basic motivation for such pruning techniques is the observation that in complex Hidden Markov Models (like Profile HMMs) very different local properties of the data are captured. However, when evaluating the mod-els using the Viterbi- or the Forward- (Backward-) algorithm, large amounts of possible paths through the state space are analyzed. Paths covering certain local characteristics are very probable when observing such data but paths representing alternative local character-istics are very improbable to match this data. However, although they hardly contribute to the global solution in the general evaluation scheme, they are also considered. Pruning such

“irrelevant” paths accelerates the model evaluation significantly. In fact, when combining multiple models into a common state space for parallel protein family HMM evaluation, and applying pruning techniques, significant parts of complete models can be skipped for evaluation.

Efficient model evaluation techniques generally address the third basic issue relevant for the successful application of enhanced probabilistic models for protein sequence analysis (cf. page 92). In this thesis it is considered for all parts of the developments.

Overview of the Concepts

The basic concepts for general improvements of HMM based sequence analysis which are developed in this thesis are graphically summarized in figure 4.6. The fundamental ap-proach for improved classification performance of protein family HMMs, namely the fea-ture based sequence representation, is illustrated in the upper frame. For all considered sequences of amino acids, relevant feature vectors based on biochemical properties of the sequences are extracted. The resulting high-dimensional and principally continuous feature space is represented using a mixture density (from left to right).

Based on the new feature representation, semi-continuous Profile HMMs are developed which is shown in the next frame (second from top). The model topology of Profile HMMs is kept fixed while substituting the discrete emissions with semi-continuous values obtained using the mixture density representation of the feature space and the feature vectors ex-tracted from the appropriate protein sequences.

*) Rich feature based protein sequence representation

1) Feature based semi−continuous Profile HMMs

*) Efficient model evaluation

2) Protein Family HMMs with reduced model complexity

Both concepts marked with "*)" are the fundamentals for 1) and 2)

(Sub−Protein−Units: SPUs) and weaker

Comment:

family (PF) or small conserved regions

SPU1

General

Acceleration of model evaluation by

PF

pruning techniques (skipping non−relevant Viterbi−paths early)

of feature space

SPU2 which are for better readability onlyFilled squares represent pseudo−states conserved parts (General)

Parallel evaluation of all models in a Feature vector representation Mixture density representation

Profile HMM model architecture

Semi−continuous emissions for Match−

and Insert−states based on mixture densities representation of feature space

of feature space

based on mixture densities representation Conservation and emissions of all states Less complex models for complete protein Protein sequences

...

Del.

Ins.

Ins.

Del.

Ins.

Del.

Ins.

Del.

Ins.

Del.

Ins.

Del.

Ins.

Del.

Ins.

Del.

Ins.

Match Match Match Match Match Match Match Match Match

B Match E

a combined state spaceV containing all statesS

Seq.~xN Seq.~x

Seq.~x 2 1 Sequence~s

1 Sequence~s

2 Sequence~sN

λbest

Sbestbest, ~o) λ1

λN

2356...T t 4 1

....

....

....

S V

Figure 4.6: Concepts for improved HMM based sequence analysis developed in this thesis (see text for explanation).

In the third frame models with reduced complexity either for the whole protein family of interest (PF) or for Sub-Protein Units (SPU1 and SPU2) and weaker conserved parts of the protein family (General) are shown. All models are based on, and only possible due to the new feature representation. Generally, variants of a certain protein family or parts of it are possible which are evaluated in parallel. The particular models are, therefore, conceptually combined using so-called pseudo-states (filled squares). These pseudo-states gather the transitions from any variant to any other.

The lower frame illustrates the overall acceleration of the model evaluation using a com-bined state space and general pruning techniques. The states of three exemplary models are combined in the common state space and during evaluation irrelevant paths (black circles) are skipped for further analysis. Only the states marked red will actually be evaluated.