Evolutionary Algorithm for Solving the Multiple Graph Alignment Problem

1.2.2 3-Dimensional Descriptors and Projection onto Surface

6.3 Graph Alignment

6.3.2 Evolutionary Algorithm for Solving the Multiple Graph Alignment Problem

then given by

⎛

⎜⎜

⎝

⎛

⎜⎜

⎝ aⁱ₁

... aⁱ_m

⎞

⎟⎟

⎠,

⎛

⎜⎜

⎝ a₁^j

... a^jm

⎞

⎟⎟

⎠

⎞

⎟⎟

⎠=

∑

1≤k<l≤m

⎧⎪

⎪⎪

⎨

⎪⎪

⎪⎩ esmm

esmm

esm

esmm

(aⁱ_k, a^j_k)∈E_k,(aⁱ_l, a_l^j)∈/E_l (aⁱ_k, a^j_k)∈/E_k,(aⁱ_l, a_l^j)∈E_l d^i,j_k,l ≤

d^i,j_k,l >

(6.12) where d^i,j_k,l =|E(aⁱ_k, a^j_k)−E(aⁱ_l, a^j_l)|. Again, constants esmand esmmare used to reward or penalize matches or mismatches.

6.3.2 Evolutionary Algorithm for Solving the Multiple Graph

A: 1 2 ⊥ 4 ⊥ 3 5 6 ⊥ ⊥

B: 6 1 2 3 4 5 7 9 8 ⊥

C: 4 ⊥ 3 2 ⊥ 1 ⊥ ⊥ 5 ⊥

D: ⊥ 3 4 ⊥ 2 1 7 5 6 ⊥

Figure 6.5: Matrix representation of a multiple graph alignment. Dummys are represented by a⊥. Note that the order of the columns is arbitrary.

Representation of Individuals

In evolutionary algorithms, individuals correspond to potential solutions of the problem considered, accordingly individuals represent here multiple align-ments. Given a fixed numbering of the nodes of graph G_i from 1 to |^Vi|^, a multiple graph alignment can be represented in an unique way by a two-dimensional matrix, where the rows correspond to the graphs and the columns to the aligned nodes of these graphs or possibly dummys. Figure 6.5 shows an example of such a matrix for the case of 4 graphs of size 6, 9, 5, and 7, re-spectively. The first column indicates a mutual assignment of the first node of graph A, the sixth node of graph B, and the fourth node of graph C, while there is no matching partner other than a dummy indicated by a⊥in graph D.

The number of rows in an individual is known a-priori, since it corresponds to the number of graphs to be aligned. The optimal number of columns, how-ever, is a-priori unknown. It ranges between the two extremes: max_i=1,...,m|Vi| and|V₁|+. . .+|Vm|. On the one hand, using the upper bound will usually be too large a number and may come along with an excessive increase of the runtime needed to solve the multiple graph alignment problem. From an op-timization point of view, a small number of columns is hence preferable. On the other hand, however, using the lower bound flexibility is lost and the op-timal solution is excluded with high probability, since only a small part of the search space is considered during the evolutionary search. Generally, it is quite difficult to find a trade-off between both extremes.

Therefore, to avoid this problem, a self-adaptation technique is employed.

To this end an adaptive representation is used that does not require the a-priori specification of the number of columns. The matrix scheme is initialized with m rows and nmax+1 columns, where nmax = max_i=1,...,m|V_i|. Hence, large

parts of the search space are neglected but can be added into the considera-tion according to an update rule: In randomly chosen intervals, it is checked whether further dummy columns are needed or existing ones have become un-necessary. To this end, all individuals in the population are considered and the number of dummy columns is determined. Three cases can occur:

1. In all individuals of the population at least one dummy column exists and in at least one individual exactly one dummy column, which means that the current length is still optimal.

2. All individuals have more than one dummy column: Apparently, a num-ber of dummy columns are obsolete and can be removed, retaining at least one dummy column in all individuals and exactly one dummy col-umn in at least one individual.

3. At least one individual has no dummy column left: The dummy column has been “consumed” by mapping dummys to real nodes. Therefore, a new dummy column has to be inserted in all individuals of the popula-tion.

This self-adaptation step is applied on the whole population, since especially the recombination operator requires equal dimensionality of all individuals in the population. As a result, this adaptation technique allows one to decrease the runtime dramatically. Obviously, in most cases it is not necessary to con-sider the upper bound to find the optimal alignment, therefore starting with a small size and increasing it if necessary leads to the pruning of large parts of the search space, thus to a more efficient search.

The efficiency of an evolutionary algorithm can be increased further by stor-ing the fitness in the individuals to avoid multiple evaluation of individuals.

Therefore, individuals are extended by an additional real-number representing the fitness of the individual.

Moreover, a self-adaptation technique (Beyer and Schwefel, 2002) for the step sizes of the mutation operator is applied, allowing a simpler adjustment of the mutation strength, a procedure that is introduced later together with the related mutation operator. Here, it is sufficient to know that the individual is extended by a further integer to allow such an automatic adjustment.

Evolutionary Loop

The evolutionary loop is taken unchanged from evolutionary strategies (Beyer and Schwefel, 2002) and depicted in Algorithm 2.1. Its genetic operators mating selection, selection and termination criteria also remain unchanged. Due to the changed representation, the operators recombination, mutation and the fitness evaluation are adapted.

Initialization

The initialization of one individual is performed by first determining the di-mensionality(m, n)of the matrix representing the alignment. This dimension-ality is given by the number of graphs considered and the number of nodes in the graphs (cf. representation of an alignment). Having initialized the matrix, each row i is filled with a random permutation of length n. Since row i rep-resents graph Gi that per definition has fewer than n nodes, entries j > |Vi| are replaced by dummys. Another interesting technique that can be applied is based on additional knowledge. Here, the solution obtained by the greedy heuristic can be added into the population or alternatively, the local cliques can be used as starting points for the initialization of individuals.

Recombination

The recombination operator is a mapping I^ρ −→ I that takes the individu-als chosen by the mating selection to create an offspring. To recombine theρ chosen individuals,ρ−1 random numbers ri, i = 1, . . . ,ρ−1, are generated, where 1 ≤ r₁ < r₂ < . . . < r_ρ−1 < m, and an offspring individual is con-structed by combining the sub-matrices consisting, respectively, of the rows {r_i−1+1, . . . , r_i}from the i-th parent individual, where r₀ = 0 and r_ρ = m by definition. Simply stitching together complete sub-matrices is not possi-ble, however, since the nodes are not ordered in a uniform way. Therefore, in merging step i, the ordering of the ri-th row is used as a reference.

This procedure is illustrated in Figure 6.6 for the caseρ=3. Three individ-uals ’Individual 1’, ’Individual 2’, and ’Individual 3’ and two integers r₁and r₂ in the range{1, . . . , m=7}are chosen at random. In this example the random integers 2 and 4 were drawn, hence all individuals are split at the rows r₁=2 and r₂=4. The resulting blocks are merged into a new individual (offspring).

To preserve the ordering, columns are rearranged according to the rows r₁and

r₂, respectively, whose indices serve as pivot elements: For example, the first framed subcolumn in ’Individual 1’ is copied to the offspring, and since the in-dex in the pivot row r₁is 2, one has to search for the same index in this row in

’Individual 2’. This subcolumn (framed) is also copied into the offspring. This procedure is repeated for all individuals and columns.

1 A 3 4 2

2 1 4 3 5

1 A A 3 2

5 3 1 4 2

2 A 1 A 3

A A 2 1 3

1 3 4 2 A

4 2 1 A 3

5 3 2 4 1

1 A 2 3 A

2 1 5 4 3

3 A A 1 2

1 A A 2 3

A 4 3 1 2

2 1 A 4 3

4 1 2 5 3

1 3 A 2 A

2 1 5 3 4

3 A A 2 1

1 3 A 2 A

4 A 2 1 3

1 A 3 4 2

2 1 4 3 5

2 A 3 A 1

5 3 4 1 2

A 2 1 A 3 A 2 A 3 1

2 1 3 A 4

ρ₁

ρ₂

Individual 1 Individual 2 Individual 3

Offspring

Figure 6.6: Visualization of the recombination operator of the graph alignment via evolution-ary optimization approach.

Mutation

The operator mutation : I → I selects one row and two columns at random and swaps the entries in the corresponding cells. This procedure obviously reaches small step sizes. To enable large mutation steps, therefore, this proce-dure is repeated multiple times for each individual. As the optimal number of repetitions was unknown in the design phase of the algorithm, it was specified as a strategy component adjusted by a self-adaptation mechanism (Beyer and Schwefel, 2002).

Here, the self-adaptation technique was realized as follows: The integer stored for this purpose in the individual is mutated first by adding a normally distributed number. After ceiling, again an integer is obtained that specifies the mutation size, that is the number of swaps performed on pairs of randomly chosen cells for each individual that is subject to mutations. Using this pa-rameter allows one to apply, in addition to the simple mutation in which only two cells are swapped, a mutation of much higher impact on the solution. In

particular, self-adaptation allows for an automatic adjustment which does not require human intervention or problem-specific knowledge.

Fitness Function and Acceleration

To calculate the fitness of each individual, the sum-of-pairs measure (6.10) is used here, where dummy columns are of course excluded from scoring, i.e., the insertion or deletion of dummy columns has no influence on fitness. This measure was introduced by Weskamp et al. (2007) to measure the quality of the solutions of their greedy heuristic. Hence, this measure was evaluated m· (m−1)times. In the case of evolutionary optimization, the fitness function is however evaluated many times, thus this measure becomes the bottle-neck of the whole approach. Therefore, a modification of (6.10) is used that still leads to the same mapping.

The naive implementation of the computation of the sum-of-pairs fitness function (6.10) comes with a complexityO(n²·m²), since the number of sum-mands is n·m²and n²·m²in (6.11) and (6.12), respectively. Theoretically of the same complexity, the runtime however can be reduced considerably in practice by using information about the distribution of edge labels. To this end, all pairs of columns in the matrix scheme are considered, each of which specifies a set of edges mapped onto each other. Edge weights are sorted in ascending or-der in timeO(m log(m)), where dummy edges are assumed to have weight

∞. The sorted vector of weights allows the identification of the index idfrom which on dummy-edges follow. Thus, the edge score (6.12) must be evaluated until i_d is reached. The remaining summands can be calculated by evaluat-ing(m−i_d)·i_d·esmm+ (m−i_d)·(m−i_d−1)·esm. The theoretical runtime remains the same, however, since the graphs are rather sparse due to their con-struction³, this procedure leads in practice to a considerable win on efficiency.

Complexity

The complexity of the GAVEO approach is clearly dominated by the evalu-ation of fitness function (6.10) which has complexity O(n²·m²). Although the complexity of the fitness evaluation is known, the overall complexity of GAVEO cannot be determined due to the reasons mentioned already in Sec-tion 4.1. Therefore, the time complexity of GAVEO is again given as a funcSec-tion

3if one follows the recommendation of Weskamp et al. (2007) and specifiesδ=11 Å

over l, where l is the random variable specifying the number of iterations the algorithm must perform until the termination criterion holds. The resulting complexity hence becomes l· O(n²·m²).

The space complexity of this approach is quite low: Since no complex cal-culations are performed, in particular calcal-culations which require the product graph, GAVEO’s space complexity is given by the matrix needed to represent a potential solution of the MGA problem. In the worst case, this matrix has a size of(m×nm).

6.3.3 Combining Evolutionary Optimization and Pairwise

Im Dokument Geometric, Feature-based and Graph-based Approaches for the Structural Analysis of Protein Binding Sites : Novel Methods and Computational Analysis (Seite 148-154)