Feature Space Representation - Robust Feature Based Profile HMMs and Remote Homology Detection

5.2 Robust Feature Based Profile HMMs and Remote Homology Detection

5.2.1 Feature Space Representation

For discrete data the distribution of a random variableY can generally be defined tabularly by assigning the probabilityP(y)to every discrete event y, whichY can be assigned to.⁶ This is the usual case for protein data where, given a set of sample sequences, the relative frequencies of every amino acid define the discrete distribution representing the data space of amino acids.

Due to their non-discrete character, the underlying probability density distributions of continuous data cannot be defined tabularly. Thus, parametric models are required for proper descriptions of continuous densities usually resulting in compact representations of the particular data spaces. According to the central limit theorem, most natural processes can be described using normal distributions if they were observed for reasonable time.

Normal distributionsN are mathematically very attractive since they contain only a small number of parameters for effective representation of unimodal densities:

N(~x|~µ,C) = 1

2πC|e⁻¹²^(~^x−~^µ)^T^C⁻¹^(~^x−~^µ),

where~µdenotes the mean vector of the data andCthe appropriate covariance matrix. Gen-eral continuous density functionsp(~x)can be approximated with arbitrary precision using linear combinations of normal distributions, so-called mixture density models [Yak70]. Usu-ally, only a finite sum ofK mixture components is used and the parameters of the model, namely the mixture weights, i.e. their prior probabilitiesc_i, the mean vectors~µ_i, and the covariance matrices C_i of the particular normal densities, are summarized in a set of pa-rametersθ:

p(~x|θ) =

∞

i=1

c_iN(~x|~µ_i,C_i)≈

i=1

c_iN(~x|~µ_i,C_i).

Feature vectors extracted from protein sequences as introduced in the previous section generally represent continuous data. Certainly, they originate from discrete amino acid symbols but after their enrichment using biochemical property mapping and their signal-like treatment by applying Wavelet transformation and Principal Components Analysis to their local neighborhood, the character of the data becomes continuous. Thus, the 99-dimensional feature space is represented using mixture density distributions.

Estimation of the General Feature Space Representation

As previously mentioned, the general feature space representation can be estimated inde-pendently from the actual protein family modeling. In the approach presented here, mixture components are estimated by exploiting general protein data from major protein sequence databases. For the developments of this thesis, the SWISSPROT database [Boe03] was used for obtaining the general protein feature space representation. By means of its almost 90 000 sequences, 1024 normal densities are estimated which provide a sufficiently accurate fea-ture space representation. The actual limitation to 1024 mixfea-ture components represents a

6The argumentation regarding the representation of general densities is mainly influenced by [Fin03, p.46f].

good compromise between suitably covering the general feature space and allowing further specialization as described later.

Due to this limitation the accuracy of the data space representation is strongly dependent on the actual choice of the mixture components, i.e. their “position” within the data space and their shape. For high-dimensional data these parameters correspond to mean vectors and covariance matrices of N-dimensional normal densities. Usually, they are estimated using the well-known Expectation Maximization (EM) procedure [Dem77]. By means of this iterative technique both mean vectors and covariance matrices of a predefined num-ber of mixture components are estimated by maximizing the probability of training data X ={~x₁, ~x₂, . . . , ~x_T}depending on the parameters of the mixture density model. Although widely used, the EM algorithm has severe drawbacks. First, it is in no way guaranteed that the global optimum for data space representation can be obtained. Instead, the local opti-mum which is closest to the initialization will be found. Thus, the quality of the solution is severely dependent on the initialization of the procedure. Especially random assignment of mean vectors and covariance matrices is problematic. Furthermore, most critically, the algorithm converges rather slowly, i.e. usually many iterations are required for estimating suitable mixture components. In figure 5.5 the EM algorithm for estimating mixture density models of probability distributions is summarized.

The EM algorithm is closely related to general vector quantization techniques where clusters within data spaces are searched. However, during clustering the data vectors are assigned deterministically to particular clusters compared to probabilistic assignment when using EM. Unfortunately, the EM algorithm is computationally expensive. Thus, instead of using EM in several practical applications a much faster procedure is applied for obtaining suitable parametric representations of high-dimensional continuous data spaces – a slightly modified version of thek-means vector quantization procedure which was originally devel-oped by James MacQueen [Mac67].

The k-means algorithm is a non-iterative technique for estimating vector quantizers where based on a single-pass approach both cluster representatives, so-called prototypes, and the corresponding data space partition can be estimated very efficiently. In several in-formal experiments using different kinds of data vectors it turned out that the quality of the vector quantizers estimated using thek-means approach is at least comparable to those obtained when using iterative clustering techniques like the well-known Lloyd or LBG al-gorithms (cf. e.g. [Ger92, Fin03]). The crucial prerequisite for equivalent or even better vector quantizer design is the existence of reasonable numbers of sample data. This is the case when using general protein data from SWISSPROT. In figure 5.6 the standardk-means algorithm is summarized.

Once a vector quantizer has been obtained for the general data space using k-means, a mixture density can easily be estimated by exploiting the resulting prototypes as well as the partition of the data space. The prototypes of the clusters correspond to the mean vectors of the mixture components and the appropriate covariance matrices C_i can be calculated using the data vectors~xassigned to particular cells of the data space partition (clusters)R_i:

C_i = X

~x∈R_i

(~x−~µ_i)(~x−~µ_i)^T.

Given a set of sample dataX = {~x₁, ~x₂, . . . , ~x_T}and the number K of desired base normal densities and a lower limit∆L_minfor the relative improvement of the likelihood function L(θ) = lnP(~x1, ~x2, . . . , ~xT|θ) = P

~x∈XlnP(~x|θ), where θ represents the parameter vector describing the mixture density model.

1. Initialization: Choose initial parametersθ⁰ = (c⁰_i, ~µ⁰_i,C⁰_i)of the mixture density model, and initialize the iteration counterm←0.

2. Expectation Estimation (E-step): Determine estimates for the posterior prob-abilities of all mixture components ω_i (given the current model θ^m) for every data vector~x∈X:

P(ω_i|~x, θ^m) = ^P^c^mⁱ ^N^(~^x|~^µ^mⁱ ^,C^mⁱ ⁾

cjN(~x|~µ^m_j ,C^m_j ).

Calculate the likelihood of the data given the current modelθ^m: L(θ^m) = lnP(~x₁, ~x₂, . . . , ~x_T|θ^m) =P

~x∈XlnP

jc^m_j N(~x|~µ^m_j ,C^m_j ).

3. Maximization (M-step): Calculate updated parameters θ^m+1 = (c^m+1_i , ~µ^m+1_i ,C^m+1_i ):

c^m+1_i = P

~x∈X

P(ω_i|~x, θ^m)

|X|

µ^m+1_i = P

~x∈X

P(ω_i|~x, θ^m)~x P

~x∈X

P(ωi|~x, θ^m)

C^m+1_i = P

~x∈X

P(ω_i|~x, θ^m)~x~x^T P

~x∈X

P(ω_i|~x, θ^m) −~µ^m+1_i (~µ^m+1_i )^T.

4. Termination: Calculate the relative change of the likelihood since the last itera-tion:

∆L_m = ^L(θ^m_L(θ^)−L(θm)^m−1⁾.

If∆L_m>∆L_minsetm←m+ 1and continue with step 2, terminate otherwise.

Figure 5.5: EM algorithm for estimating mixture density models (adopted from [Fin03, p.64], cf. also [Dem77, Ger92]).

Without further optimization a high quality representation of the biochemical feature space created for protein sequence data can be obtained which is used as starting point for feature based protein family HMMs.

In figure 5.7 the process of mixture density estimation for feature space representation using general protein data is summarized graphically.

Given a set of sample dataX ={~x₁, ~x₂, . . . , ~x_T}and the numberKof desired clusters.

1. Initialization: Choose the firstK vectors of the sample set as initial cluster proto-typesY⁰ = {~y1, ~y2, . . . , ~yK} = {~x1, ~x2, . . . , ~xK}. Alternatively choose K data vectors randomly distributed over the sample set. Initialize the partition of the data set toR⁰_i ={y_i}, i= 1, . . . , K.

Initialize counterm ←0.

2. Iteration: For all data vectors~x_t, K+ 1< t < N not processed so far:

(a) Classification: Determine the optimal prototype~y_i^mwithin the current set of cluster prototypes Y^m for ~xt using some metric d (usually Euclidean distance, cf. equation 3.1 on page 22):

~y_i^m = argmin

~y∈Y^m

d(~x_t, ~y).

(b) Re-partitioning: Change the partition of the data space by updating the clusterR_i belonging to the previously determined prototype~y_i:

R^m+1_j =

(R^m_j ∪ {~x_t}, ifj =i R^m_j , otherwise.

(c) Prototypes Update: Update the prototype ~y_i of the cluster which was changed in the previous step and keep all others:

~y_j^m+1 =

(cent(R^m+1_j ), ifj =i y^m_j , otherwise,

where cent(R^m+1_j )designates the centroid of the currentj-th cluster:

cent(R^m_j ) = argmin

~ y∈R^m_j

E{d(~x, ~y)|~x∈R^m_j } which corresponds to:

cent(R^m_j ) =E{~x|~x∈R_j^m}=R

R^m_j ~xp(~x)d~x

when using elliptical symmetrical metrics like e.g. the Euclidean distance.

Update counterm ←m+ 1.

Figure 5.6:k-means algorithm for vector quantization (adopted from [Fin03, p.61], cf. also [Mac67, Ger92]).

Im Dokument Advanced stochastic protein sequence analysis (Seite 122-125)