• Keine Ergebnisse gefunden

5.4 Accelerating the Model Evaluation by Pruning Techniques

5.4.1 State-Space Pruning

Analyzing the state-of-the-art in HMM based protein sequence analysis and reconsider-ing the basic theory of Hidden Markov Models as summarized in section 3.2.2 it becomes clear that the evaluation of probabilistic protein family models is usually rather straightfor-ward. This means that no optimizations either heuristic or theoretic are applied at all. Espe-cially for automatic speech recognition the situation is rather different. Already in the late 1970s Bruce Lowerre proposed the so-called Beam-Search algorithm for heuristic state-space pruning during model evaluation [Low76]. By means of this technique substantial accelerations in HMM evaluation become possible.

In the following the Beam-Search approach is introduced and its general transfer to the bioinformatics domain is presented. Note that this optimization technique can be applied to all kinds of protein family HMMs including discrete, i.e. state-of-the-art, and semi-continuous feature based models. Due to the substantial additional computational effort required for mixture density evaluation the relevance of state-space pruning is extraordi-nary.

The Beam-Search Algorithm for Accelerated HMM Evaluation

In order to find the most probable path~s through the whole state space V of an HMM λkproducing the observation sequence~o, the Viterbi algorithm is widely used as discussed in section 3.2.2. Reconsidering the main idea, basically, each step t of the incremental algorithm consists of the calculation of maximally achievable probabilitiesδt(i)for partial emission sequences~o1. . . ~otand state sequencess1. . . st:

δt(i) = max

s1...st−1

{P(~o1, . . . , ~ot, s1. . . stk)|st=Si}.

Since the dependencies of the HMM states are restricted to their immediate predecessors (the so called Markov property) the calculation of δt+1(j) is limited to the estimation of the maximum of the product of the precedingδt(i)and the appropriate transition probabil-ity. Additionally the local contribution of the emission probabilitybj(~ot+1) is considered.

Stepping through the state space recursively allδt+1(j) are calculated using the following rule:

δt+1(j) = max

it(i)aij}bj(~ot+1).

Figure 5.17 illustrates the recursive calculation ofδt(i)during the Viterbi algorithm.

When analyzing the necessary computational effort for the Viterbi algorithm it becomes clear, that after only a few steps a large amount of possible paths needs to be considered. In figure 5.18 for two different model architectures, namely a classical linear model as used for speech recognition applications (upper part) and the standard three-state Profile HMM topology (lower part), an idea is given for the number of overall explored states while step-ping through the state space. The more states there are that have to be explored at each step, the more continuations of all paths possible so far become reasonable. Thus, the amount of paths traced overall increases dramatically and as a consequence the processing time neces-sary for model evaluation, too. Alleviating the constraints on the model architecture implies increasing the decoding effort!

bj(~ot+1)

δt+1(j) δt(i)

time states

max

aij

i j

t t+ 1

~ot+1

Figure 5.17: Illustration of the recursive calculation scheme for estimating the path probabilities by recombi-nation (courtesy of [Fin03]).

Ergodic HMMs (cf. figure 3.10 on page 45) in principle allow arbitrary paths through the state space. Since the number of the state combinations which need to be explored reaches the extreme value ofN2 the computational effort is substantial. Linear model architectures moderately bound the number of possible paths by strongly restricting the successors of a given state to the appropriate state itself and its immediate successor. However, due to self-transitions the number of states to be considered increases rapidly when stepping through the model. Since discrete protein data covered by Profile HMMs usually varies in both length and constitution, the more flexible three-stage Profile HMM architecture was devel-oped. Due to the Delete states principally every state (adjacent to the right of a particular one) of the HMM can be reached from anywhere in the model, which is comparable to the ergodic topology. Thus, a huge number of paths is generally required to be evaluated. On the right side of the sketches in figure 5.18 for both architectures the substantially differing number of states that need to be explored is shown. The more states are explored, the more computational effort is required for model evaluation.

In order to bound the problem Bruce Lowerre introduced the Beam-Search algorithm establishing a dynamic criterion for search space pruning based on the relative differences of the partial path scores [Low76]. The state space is pruned following the basic idea of restricting the search to ”promising” paths only. For obvious reasons the processing time for model evaluation is proportional to the amount of promising paths. Especially for large models consisting of numerous states, many of them cover quite different pattern families.

So nearly always at least parts of the state space are quasi irrelevant for one particular final solution. The remaining states are activated regarding the most probable path. Formally all states{st}are activated whoseδt(i)are more or less similar to the locally optimal solution

in Speech Recognition Applications Linear Model Architecture as usual

Three−state Profile HMM Architecture SE S3

SE SD3 SI4 SM3

SM3 SM2

SM1

SD2 SI3 SM2 SD1 SI

2

SM1

SB SI1 S3 S2 S1

S2

S1

SB

1 2 3 4 5 6 ... Tt SD1 SD2 SD3

1 2 3 4 5 6 ... t ....

T V

V

....

S

S

SI1 SI2 SI3 SI4

Figure 5.18: States that need to be explored at each step of the evaluation (right) of HMMs with different model architectures (left) – dashed lines: Viterbi paths through state spacesVfor hypothetical sequences. Especially for Profile HMMs after very few evaluation steps a large number of states can be reached and thus need to be explored when using the standard Viterbi algorithm.

δt = maxjδt(j). The threshold of just acceptable differences in the local probabilities is defined proportional toδt by the parameterB. So the set of activated statesAt at a given time is located in a Beam around the optimal solution and determined by:

At={i|δt(i)≥B δt} with δt = max

j δt(j)and0< B <1.

The only parameter of this optimization technique is the Beam-width B. Exploring the Viterbi matrixVat the next stept+ 1only these active states are treated as possible prede-cessors for the estimation of local path probabilities. Thus at every Viterbi step the following modified rule is used for the estimation of allδt+1(j):

δt+1(j) = max

i∈At

t(i)aij}bj(~ot+1).

Note that the Beam-Search algorithm represents a heuristical approximation of the stan-dard Viterbi procedure. Consequently, it is a sub-optimal solution of the decoding problem which is, however, of sufficient quality.

Acceleration of Protein Family Model Evaluation

In alternative pattern recognition domains normally classification is the primary application for HMMs (comparable to target validation for protein sequence analysis). Usually, small basic models are evaluated in parallel which technically corresponds to a combination of all states involved into a global state-space (see section 5.4.2). This allows global state-space pruning. However, if large patterns are modeled using HMMs the states within a particular model are likely to cover mutually different parts. Thus, Beam-Search is expected to be effective already for single model evaluation.

This thesis is directed to the enhancement of probabilistic protein family models. As pre-viously mentioned the smallest protein unit for which a biological function can be assigned is usually the domain. Thus, model lengths of dozens or even hundreds of conserved parts are very common. There is strong evidence that parts of the model that are further apart do not necessarily interfere even for remote homologies. However, when using the complex three-state Profile topology exactly this fact is implicitly assumed since the chain of Delete states allows almost arbitrary state transitions within a particular model. It can be expected that most of the paths through a Profile HMM are not relevant for the final solution and state-space pruning is very effective when concentrating the model evaluation on the most promising paths only.

All modeling approaches developed in this thesis include the state-space pruning as de-scribed above. The Beam-Search algorithm needs to be configured specifically for the pro-tein sequence analysis domain. This means that a suitable Beam-width needs to be deter-mined which enables efficient model evaluation while keeping the detection, or the classi-fication accuracy as high as possible. In section 6.4 the experiments which give hints for a proper choice of the Beam-width and its results are presented.