• Keine Ergebnisse gefunden

Alternative Representations of Protein Sequences

3.3 Signal Processing based Sequence Comparison

3.3.1 Alternative Representations of Protein Sequences

Almost the whole set of powerful signal processing techniques was developed for the anal-ysis of natural signals, i.e. real-valued functions. Thus, they cannot be applied to protein sequences without any modifications of the data. The trivial numerical representation of protein sequences would be to assign (arbitrary) numbers to every amino acid resulting in discrete but real-valued “signals”. One choice of assignment could be: A = 1, R = 2, N = 3, . . . , V = 20. The major drawback of this trivial assignment scheme is the incorporation of an artificial and certainly completely wrong relation. There is no reason for a dramati-cally larger distance between Alanine (A) and Valine (V) compared to the distance between Alanine (A) and Asparagine (N). Consequently, numerical encoding schemes must not in-corporate distortions of the relationships between amino acids.

The signal based representations of protein sequences reported in the literature can be divided into two categories:

1. Based on theoretical considerations of the discrete symbol set, sequences are encoded into signals preserving the inter-symbol distances by means of various vector repre-sentations or statistical properties (cf. “Vector Space Derived Encoding Schemes”).

2. For the second encoding method, the actual biochemical properties of amino acids are considered. Real-valued signals are obtained by means of direct mappings of protein sequences’ residues to numerical values representing some biochemical property (cf.

“Encoding based on Biochemical Properties”).

In the following, both categories of sequence encoding including their most important rep-resentatives are discussed.

Vector Space Derived Encoding Schemes

The first category of encoding schemes was already introduced in the discussion of neural network based sequence classification approaches (cf. page 66). Here, the simplest correct,

i.e. distance conserving, encoding method was described as the creation of 20-dimensional orthogonal vectors. Generally, symbolic data is mapped to some N-dimensional vector-space. For the case of orthogonal base vectors described above,N designates the number of different symbols, i.e. for protein sequencesN = 20.

Besides the simple base vectors, certain alternative definitions are imaginable. Here, the most important point is the conservation of distances between the appropriate symbols which does not necessarily imply equal distances. Depending on prior knowledge about the mutual pairwise relations between the appropriate amino acids (e.g. depending on bio-chemical properties – see next section) the distances can be adjusted individually. As one example for a vector space of DNA data, Dimitris Anastassiou in [Ana00] and [Ana01]

defined the base vectors as the complex conjugate pairsT =A andG=C, e.g.:

A= 1 +i, T = 1−i, C =−1−i, G=−1 +i. (3.36) By means of these definitions and the genetic code, i.e. the well defined mapping of nu-cleotide codons to amino acids, numerical values can also be assigned to amino acids using the following procedure. Generally, the protein coding process can be modeled as a FIR digital filter7, in which the inputx[n]is the numerical nucleotide sequence, and the output y[n]represents the possible numerical amino acid sequence:

y[n] =h[0]x[n] +h[1]x[n−1] +h[2]x[n−2].

Ifh[0] = 1, h[1] = 1/2, and h[2] = 1/4andx[n]is defined by the parameters in equation 3.36, theny[n]can only take one out of 64 possible values. Thus, the entire genetic code consisting of 64 codons which encode the 20 amino acids (or STOP) can be drawn on the complex plane as shown in figure 3.20. Here, exemplary Methionine (labeled Met) corresponds to the complex number(1 +i) + 0.5(1−i) + 0.25(−1 +i) = 1.17 + 0.88i [Ana01]. Note that this encoding scheme is partially redundant since some of the amino acids are multiply defined. Compared to the 20-dimensional vector space of the simple orthogonal base vector approach, here two-dimensional representations of the 20 amino acids are sufficient.

As an alternative to the complex number representation of amino acids, Paul Cristea proposed a tetrahedral representation of the genetic code and thus the amino acids [Cri01].

He argued that the classic cartesian representation of the genetic code depending on the actual nucleotides at the first, second and third position in codons which determines the actual amino acid, does not correctly reflect the natural structure of the genetic code. He developed optimal symbolic-to-digital mappings for nucleotides as well as for amino acids which are apparently suitable for the comparison of whole genomes (cf. [Cri03]).

Besides the encoding schemes described above which are all more or less related to some kind of vector space analysis, further encoding schemes based on theoretical considerations were proposed in the literature. As one example Gerhard Kauer and Helmut Bl¨ocker in [Kau03] adopted an early approach of Kenneth Breslauer and colleagues regarding enthalpy based signal representation of DNA sequences [Bre86]. Here, the enthalpies of residues

7FIR is the acronym for Finite Impulse Response which designates a filter whose impulse response is finite in length. For details regarding general digital signal processing including filtering, the standard work of Alan Oppenheim and Ronald Schafer [Opp89] is a good starting point.

Real Part Gly

Ala

Arg

Pro

Glu Asp

Val

Gln His

Leu Ser

Cys Trp

Thr Ser

Arg Lys

Asn Met

Ile Stop

Tyr Leu Phe Imaginary Part

1.17 + 0.88i

Figure 3.20: Numerical amino acid representation using complex plane; the exemplarily marked point desig-nates the actual numerical representation of Methionine (adopted from [Ana01]).

with their respective neighboring nucleotides are taken as signal values. Such enthalpy data can be obtained from the literature and generally the mapping approach can be generalized to sequences of amino acids, too.8

Encoding based on Biochemical Properties

Natural signals evolving in time usually consist of measurements of some physical values captured at well defined, discrete time-steps. As one example, acoustic signals as processed in automatic speech recognition applications represent the progression of the sound pres-sure during uttering. Other examples of natural signals are the progression of air prespres-sure or of temperature values which are important for weather forecast applications. All such signals have one characteristics in common: they represent the progression of some kind of physical, i.e. natural properties.

In these premises, for the second category of signal representation approaches for protein sequences, biochemical properties of the sequences’ residues are used for obtaining the required numerical representations. This becomes obvious since the biological function of proteins is implied by their biochemical properties. By means of the so-called “wet-lab” analysis, i.e. by actual biochemical investigations of the biological matter of proteins, biochemical properties like hydrophobicity, charging, pH-value etc. are in fact measured.

Of course, the sum of biochemical properties corresponds to the appropriate amino acids but the symbol data “shields” details due to abstraction. Throughout the years many researchers analyzed the biochemical properties of amino acids in great detail. The results of these experimental evaluations are usually reported in so-called amino acid indices serving as

8Enthalpy (symbolizedH, also called heat content) is a thermodynamic quantity which represents the sum of the internal energyU of matter (here the particular residue) and the product of its volumeV multiplied by the pressureP:H =U+P V [Her89].

a mapping scheme for amino acids to numerical values. Shuichi Kawashima and Minoru Kanehisa in [Kaw00] compiled a large amount of such amino acid indices which are the base for most of the encoding schemes based on biochemical properties.

As one example the Electron Ion Interaction Potential (EIIP) is used for obtaining nu-merical representations of protein sequences. Originally proposed by physicists in the 1970s (cf. [Vel72]), Irena Cosic and colleagues developed various approaches for the classifica-tion of DNA as well as protein sequences based on this mapping [Cos97, De 00, De 02].

The EIIP describes the average energy of all valence electrons in a particular amino acid.

Every amino acid or nucleotide, irrespective of its actual position in a sequence, can be represented by a unique number which is summarized for both nucleotides and amino acids in table 3.1.

Nucleotide EIIP Amino Acid EIIP

A 0.1260 Ala (A) 0.0373

G 0.0806 Arg (R) 0.0959

T 0.1335 Asn (N) 0.0036

C 0.1340 Asp (D) 0.1263

U 0.0289 Cys (C) 0.0829

Gln (Q) 0.0761 Glu (E) 0.0058 Gly (G) 0.0050 His (H) 0.0242 Ile (I) 0.0000 Leu (L) 0.0000 Lys (K) 0.0371 Met (M) 0.0823 Phe (F) 0.0946 Pro (P) 0.0198 Ser (S) 0.0929 Thr (T) 0.0941 Trp (W) 0.0548 Tyr (Y) 0.0516 Val (V) 0.0057

Table 3.1: The Electron Ion Interaction Potential (EIIP) values for nucleotides and amino acids (cf. [Cos97, p.13]).

The process of obtaining a signal based representation when utilizing biochemical prop-erties can be summarized as follows.

1. Depending on the actual application of sequence analysis a proper amino acid index is selected. This step is rather crucial and expert knowledge is required. Very com-mon choices are the EIIP values explained above or hydrophobicity indices (cf. e.g.

[Man97, Mur02, Qiu03]).

2. All residues of all sequences (training-, query- and comparison data) are directly mapped to numerical representations using the amino acid index selected in the first step.