• Keine Ergebnisse gefunden

Rich Signal-Like Protein Sequence Representation

5.1 Feature Extraction from Protein Sequences

5.1.1 Rich Signal-Like Protein Sequence Representation

5.3 the complex model architecture of Profile HMMs is discarded and replaced by topolo-gies with reduced complexity. The focus is on an alternative model architecture, namely Bounded Left-Right models. Furthermore, a new concept of protein family modeling using small building blocks, so-called Sub-Protein Units extracted automatically and in an unsu-pervised manner from training samples, is discussed. Approaches for a general acceleration of the evaluation of protein family HMMs are presented in section 5.4. The assignment of the new developments to the three basic issues relevant for the successful application of new protein family modeling approaches is given at the appropriate passages in the text.

Descriptions of some of the new approaches developed for enhanced probabilistic protein family modeling can also be found in [Pl¨o04].

properties of protein domains and furthermore of the complete protein is determined via the local combination of the residues’ properties.

By means of the symbolic description of protein data using amino acid sequences the general biochemical properties of the residues are summarized. Every amino acid contains its specific biochemical characteristics. These numerous specialties are only implicitly con-sidered by discriminating between the 20 standard amino acids. However, neither local contextual relationships between residues nor specific mutual relationships between actual biochemical properties, which might be important for the overall function of the protein (domain), are considered.2 It seems unrealistic that all biochemical properties change rad-ically from one residue to another which is suggested when different amino acid symbols are adjacent. Instead, less abrupt changes and especially higher level mutual relationships between different biochemical properties are expectable.3 One hypothetical example could be that from one residue to the next the hydrophobicity does not change substantially while the electric charge does.

In these premises, an alternative protein sequence representation is developed which is based on protein specific primary structure data and general knowledge about biochemical properties of amino acids. Biologically meaningful biochemical properties of residues are explicitly considered.

Protein Sequence Encoding

Basically, the biochemical characteristics of amino acids were well investigated throughout the years. Numerous researchers performed countless wet-lab experiments measuring vari-ous properties of the standard amino acids. In the literature some general sequence analysis techniques were described exploiting selected individual biochemical properties (cf. sec-tion 3.3.1 on page 75). However, most of such techniques use biochemical properties for more or less technical reasons (like applying spectral analysis based on numerical sequence mapping) as alternative but completely equivalent representation compared to the conven-tional amino acid sequences. Mostly, no further information is incorporated into the protein data representation since the 20 standard amino acids are mapped to 20 (or slightly less) different numerical values.

Basically, it is impossible even for very experienced molecular biologists to determine a single biochemical property being exclusively responsible for protein family affiliations.

Instead, multiple properties are responsible for the biological function of proteins and thus their family affiliation. Certainly hydrophobicity, electric charge, and residue size are rather important for the three-dimensional structure and the surface properties of proteins. Addi-tionally, other biochemical properties are important, too.

Biochemical properties of amino acids are usually collected in so-called amino acid in-dices. Such indices contain mapping rules for every amino acid to numerical values

repre-2Profile HMMs for protein (domain) families cover global contextual relationships of the residues via the model architecture. However, the emissions are determined completely neglecting any residual neighbor-hood.

3For standard pairwise sequence analysis techniques substitution groups are implicitly defined using sub-stitution matrices which alleviates the abrupt character of symbol changes by assigning similar scores to similar amino acids.

senting the appropriate biochemical property. By exploiting these indices, amino acid se-quences are numerically encoded resulting in “signal”-like representations. As previously mentioned (section 3.3.1 on page 75) Shuichi Kawashima and Minoru Kanehisa compiled a large amount of such amino acid indices [Kaw00]. For a rich protein sequence representa-tion, in the new approach presented here, the particular amino acids are mapped to multiple properties carefully selected from the almost 500 indices. The selection was performed in cooperation with several biologists, i.e. incorporating expert knowledge. It turned out that 35 indices are sufficient for describing the biochemical properties of amino acids which are assumed most relevant for family affiliation. The authors of the amino acid indices com-pilation additionally performed a clustering of their indices regarding certain categories of amino acid properties:

• Hydrophobicity indices,

• Composition indices,

• Indices coveringαand turn propensities,

• β propensity indices,

• Indices for physiochemical properties, and

• Indices covering other properties.

The 35 indices selected for the biochemical property based sequence representation devel-oped in this thesis cover these clusters reasonably. A more detailed description of the amino acid indices used for sequence mapping can be found in appendix C.

As the result of the mapping procedure, protein sequences are encoded into multi-channel signal representations, i.e. given the linear chain of amino acids for a particular protein, its biochemical properties are represented in a 35-channel signal of the same length. Note that all components of the resulting feature vectors, i.e. the numerical mappings for all channels, are normalized to the range[−1. . .1]which is mandatory for further processing.

The general process of representation change by amino acid mapping is graphically sum-marized in figure 5.1. It is the base for all further developments, namely the extraction of relevant features describing the essentials of protein sequences.

Analysis of Local Neighborhoods

The previously described sequence encoding method subsumes information regarding bio-chemical properties of the amino acids from various sources in a multi-channel representa-tion. Generally, when using state-of-the-art Profile HMMs for protein family modeling, two levels of residual context are considered. The classification result determining the decision regarding the probable affiliation of the sequence analyzed to a particular protein family is performed using the complete sequence, i.e. the complete residual neighborhood. Thus, the global context is captured by the HMM. Contrary to this, for the estimation of the emission probabilities no residual context is used at all.

Neglecting any residual context for the estimation of the HMM state emissions seems rather crucial since the biochemical characteristics certainly do not abruptly change from

0.1 0.3 0.1 0.1 2.3 0.4 2.1 0.4

...

0.2 1.2 1.3 3.0 ...

1.1 0.4 0.3 0.99

...

...

X D V N R A

Electric Charge ...

Hydrophobicity AA

B Z

NKCIVHTNKKKXPPLCIVVVMAHEETNNKCHVI ...CVIRNDCAARIKKLMRNXXDFIFPSZZKLC PPIRRNDDDCAXXBVHCTTPKLLNPCHIAEECX

Amino acid mapping KKXICHHTVPAAXXXX...

Figure 5.1: Overview of the approach for obtaining a biochemical property based protein sequence represen-tation: Based on protein specific primary structure data (sequence at the upper part) and general knowledge about relevant biochemical properties of amino acids (left-hand side) the new repre-sentation is obtained by mapping the biochemical properties of residues to multiple numerical values (lower part). Note that all sequence and biochemical property data is hypothetical not corresponding to reality and for illustration purposes only.

one residue to another. Instead, they are dependent on their local environment since adja-cent residues severely interfere with each other. As one example the electric charge of a residue is dependent on the electric charge of its immediate environment. Since amino acid indices like the symbolic description of amino acids themselves do not respect the residues’

environment, these local characteristics are captured alternatively in the new approach for HMM state emissions.

In order to respect local signal characteristics already at the level of emission probabili-ties, in the feature extraction procedure local contexts of residues are considered. The emis-sions estimated on the base of the local neighborhoods of a particular residue contain much more information since they cover the residues’ environment. Note that the global context, i.e. the structure of the protein family data, is still captured by the appropriate HMM. Gen-erally, the HMM now describes the structure of protein data at the base of residues in their local neighborhood. In the new approach consecutive samples of the 35 channel signals are analyzed using a sliding window technique (extracting frames). These frames are used for short length signal analysis.

There are two general parameters for configuring the sliding window based context anal-ysis. First, the length of the context analyzed for emission estimation needs to be deter-mined. Certainly, it is dependent on the actual data analyzed. Thus, it generally needs to be treated as a parameter to be learned during training. Unfortunately, especially for HMM based modeling there are no techniques available for such parameter training. However, in informal experiments a fixed window size of 16 could be determined heuristically which is suitable for the majority of protein data. Thus, it was kept fixed for the general procedure.

The sliding window containing the context of a particular residue is (almost) symmetrically organized, i.e. the context of an amino acid consisting of seven neighbors to the left and eight neighbors to the right is considered.

The second parameter determines the actual treatment of the sequence borders. Due to the (almost) centered context of a residue, at the beginning and at the end of a sequence the sliding window cannot be filled using real data – because in fact there is no data. This border problem is also known from general signal filtering approaches, e.g. within image processing applications when convolving images with specialized denoising filters. Since all residues have a special meaning for the particular protein, the borders cannot be skipped (as one common solution for image processing applications suggests) but three general

“solutions” exist:

1. Zero-padding: The borders are filled with 0.

2. Distinct values-padding: The borders are filled with distinct values, optionally dis-criminating between beginning and end of the sequence.

3. Prior-padding: The sequence is extended at the beginning as well as at the end using prior knowledge about the general distribution of amino acids (i.e. filled with the amino acid wildcard ’X’).

Note that all three options incorporate artificial, i.e. potentially erroneous, data into the pro-tein sequence. Since the first two options seem to distort the sequence data too much, in the actual feature extraction approach the borders are treated using prior-padding. Generally, when modeling protein families the influence of the borders is not negligible but also not dramatic.

In figure 5.2 the sliding window approach, which is the prerequisite for the analysis of the residues’ local neighborhoods, is schematically illustrated.