Encoding Scheme of VONSEA - A Multi-objective Genetic Algorithm for Peptide Optimization

Algorithm 2: Pseudocode: front-based SUS in VONSEA

1 numberOfPointers = 3;

2 FastNondominatedSorting(population, fronts);

3 FrontSortingMagnitude(population);

4 distance=(numberOfPointers)⁻¹;

5 ptr(0)= randomNumber · distance;

6 for i=0, ..., numberOfPointers-1 do

7 ptr(i+1) = (int) (ptr(0)+ i · distance);

8 index = ptr(i+1) · population_size;

9 parent_population.add(population.getIndividuals(ptr(index)));

10 end

as small as possible while still allowing a natural representation of the variables.

The principle of meaningful building blocks is motivated by the scheme theo-rem [21]. The principle of minimal alphabet advises the increase of the potential number of schemata by reducing the cardinality of the alphabet. The princip-les are provided for binary strings and are not advisable as design criteria for non-binary strings. Kershenbaum formulates more precise and applicable guidelines. The recommendations are originally tailored for tree representati-ons, but they are also applicable to other representation types. Five possibly conflicting properties are advised for an ideal encoding [91]:

1. The encoding scheme has to represent all feasible solutions.

2. The encoding scheme has to represent only feasible solutions.

3. All feasible solutions have an equal probability of being represented.

4. The encoding scheme has to represent a useful scheme in a small number of genes that are close to one another in the chromosome.

5. The encoding scheme has to possess locality in the way that small changes to the chromosome result in small changes in the solution.

The first recommendation is a property that is usually easily satisfied. The second recommendation sometimes requires a compromise: A small number of infeasible solutions is better than a high number as this increases the probabi-lity of creating infeasible solutions by the variation operators and makes a GA ineffective. The third recommendation ensures the creation of diverse random starting solutions. Furthermore, the GA is more effective in exploring the entire solution space. The fourth recommendation is the most difficult property of an encoding and it is generally challenging to develop a suitable encoding accor-ding to this property. The property locality in the fifth recommendation refers to a genotype-phenotype mapping. It describes how well neighbored genotypes correspond to neighbors in the phenotype space. This fifth recommendation ensures that the GA is able to perform a guided local search. Otherwise, a low locality results in a more random search instead of a guided search of the GA.

Goldberg further classified the encoding schemata referring to the fact that the fitness functions for each encoding scheme depend either on the factor ’value’,

’order’ or both:

(i) a scheme where fitness depends on order only.

(ii) a scheme where fitness depends on order and value.

(iii) a scheme where fitness depends on value only.

In the following, the most common encoding schemes in GAs for each category are presented:

The best-known encoding scheme of category (i) is the permutation and it is used in combinatorial optimization problems like the Traveling Salesman Problem [139]. This problem is specified by a list of cities. The goal of the search process is to find a route that visits each city only once and has minimal length. A natural representation is an ordered list of city numbers:

Example 1 Individual: 1 2 4 3 5 6 9 8 7.

The permutation is used in another application example from bioinforma-tic [173]. It is used as encoding scheme in a GA to predict the secondary structure of RNA! molecules. The secondary structure is encoded as per-mutation and the GA predicts the specific canonical base pairs that perform hydrogen bonds and build helixes. Specific variation operators are reasonable for permutation encoding: recombination and mutation corrections have to be performed to leave chromosome consistent [100]. The recombination operators associated with permutation are the Partially Mapped Crossover (PMC) [77], the Cycle Crossover (CX) [34] and the Order Crossover (OX) [119]. The mu-tation operator associated with permumu-tation is the inversion that changes the location of characters. The disadvantage of the recombination operators for permutation encoding is the high implementation complexity of these crosso-ver operators [100].

The most common encoding scheme of category(ii)is thebinary encoding.

Each individual is represented as a binary string of the bits 0 and 1. The following example shows a hexadecimal encoding:

Example 2 Individual: 1101011101101.

Moreover, each allele represents a value. An advantage of binary encoding is its support of a wide range of recombination operators. Furthermore, it fulfills the design principles of Goldberg [75] best. Binary encoding causes problems in the case of a continuous search space with large dimension [82]: If a variable has a finite number of discrete valid values, some of the binary codes are redundant.

The most common encoding scheme of category (iii) is the value encoding.

Individuals are represented as strings of some kind of value like integer, real or character:

Example 3 Individual (real encoding): 1.23 2.54 3.55 6.73 2.12.

Individual (character encoding): ABDKGWUFEKZWBS

Real encoding is of increasing interest in the field of real-world optimization problems, like in the field of chemometry [108] or biotechnology [133]. The popularity of real coding is due to the following advantages [82]: Firstly, real coding is very close to the natural representation of the variables for many optimization problems. As there is no difference between genotype and pheno-type in these cases, a genopheno-type-phenopheno-type mapping is not necessary. Secondly, real coding is very natural in optimization problems with variables in conti-nuous domains. Thirdly, real coding has the potential to exploit the concept of graduality of the fitness functions with continuous variables. This means that small changes in the variables correspond to small changes in the fitness function values.

4.3.2 Encoding of Individuals in VONSEA

The individuals in VONSEA represent short peptide sequences composed of 20 amino acids. These amino acids are the 20 canonical amino acids as listed in Table 4.1. Three different individual encodings of these peptide sequences within MOEA are conceivable: Firstly, the encoding of a peptide sequence as character string, secondly an encoding of the single amino acids as nucleotide triples and thirdly an encoding of the amino acids by bit strings. These enco-ding approaches are discussed in the following.

Several tools providing molecular functions to determine physiochemical or structural properties of peptides make use of a single-letter code for the amino acids, depicted in the right column of Table 4.1 (e.g. see [123], [81], amino acids substitution matrices from protein blocks used within the Needleman-Wunsch Algorithm for global sequence alignment). The individuals within VONSEA are encoded as character strings composed of20different characters according to the single-letter code to provide the required input structure for the mo-lecular fitness functions and to avoid a transferring into this input structure before the evaluation of every fitness function.

Example 4 Individual in VONSEA: ADIHMNLKFPSTVWYRCEQG

Therefore, this encoding represents a value encoding and is classified in ca-tegory (ii) ’a scheme where fitness depends on order and value’ in a broader

Amino Acid Char code

Alanine A

Arginine R

Asparagine N

Aspartic acid D

Cystein C

Glutamic acid E

Glutamine Q

Glycine G

Histidine H

Isoleucine I

Leucine L

Lysine K

Methionine M

Phenylalanine F

Proline P

Serine S

Threonine T

Tryptophan W

Tyrosine Y

Valine V

Table 4.1: List of the20canonical amino acids and the established one letter code used for the encoding in VONSEA

sense: In the case of molecular functions predicting peptide properties, each amino acid or character is identified with physiochemical property values and the molecular functions work on these single characters as well as on the orde-ring of the amino acids (or characters) in a sequence, which is a decisive factor on several peptide properties.

In the following, the properties of the proposed character encoding are sum-marized and related to the recommendations of Kershenbaum:

• every peptide of the solution space is exactly represented by a character string

• every feasible character string presents exactly one peptide

• all peptides are equally represented

• a genotype-phenotype mapping is not necessary

• small changes performed by a variation operator on the character strings preserve similarity of the created offspring to their parents

The first two properties ensure that all feasible - and only the feasible - so-lutions are represented by this character encoding. The characters used have an equal probability of being represented in a solution and therefore each fea-sible solution has the same probability of being represented. The phenotype representation allows a depiction as string with the letter code of the cano-nical amino acids; therefore a genotype-phenotype mapping is not necessary.

The last property allows the assessment whether small changes in the molecule structure is related to similar molecule properties [55]

Another potential encoding scheme of the peptide strings is the presentation

Fig. 4.2: Peptide presentation as character string and nucleotide string encoding.

of the canonical amino acids by code triples. The single amino acids are re-presented by nucleotide triples consisting of the four nucleotides A, U, G and C. Figure 4.2 depicts the genotype-phenotype representation of a peptide of5 amino acids and the corresponding code triple encoding. With4³, the number of nucleotide combinations is much higher than the number of coded amino acids, thus revealing a high number of unfeasible combinations. Therefore, this encoding scheme has the following undesirable properties:

• the peptides have different representation forms as most of the single amino acids are encodable by differing nucleotide triples (see Table 4.2 in section 4.5)

• a very high number of nucleotide triple encoded peptides are unfeasible

• a genotype-phenotype mapping is necessary

• the peptides have a differing number of representation forms and have a different probability to be presented

Nucleotide triple encoding requires a higher implementation complexity and storage. Therefore, the character string encoding is preferred over nucleotide triple encoding.

A similar approach is the encoding of the single amino acids as bit strings. As 20 amino acids have to be coded, the bit strings require at least a bit string length of at least5, as the number of bit string combinations provides2⁵ = 32 possible combinations. According to the advice of Goldberg referring to binary encoding, this encoding scheme is not advisable: The disadvantage of binary encoding for the presented purpose of peptide optimization is the representa-tion of infeasible peptides which is a general disadvantage of binary encoding as mentioned above. An exclusion of these infeasible encodings implies a hig-her implementation complexity. Compared to nucleotide triple encoding, bit strings encoding also requires a genotype - phenotype mapping. Otherwise, all bit string encoded peptides are equally represented and every feasible bit string peptide represents exactly one peptide.

Im Dokument A Multi-objective Genetic Algorithm for Peptide Optimization (Seite 84-90)