An evolutionary strategy for the calculation of graph

3.4 Extreme value distributions

4.1.1 GAVEO - Global Graph Alignment Via Evolutionary Optimiza-

4.1.1.1 An evolutionary strategy for the calculation of graph

As outlined above, the problem of finding an optimal graph alignment can be formu-lated as an optimization problem. An evolutionary algorithm (EA) offers the benefit that each point in the search space can be reached which allows for a more thorough exploration of the search space. Moreover, evolutionary algorithms have proven to be relatively versatile, leading to (near-)optimal solutions for a variety of different optimization problems (Spears et al., 1993). However, while EAs in principle are capable of finding an optimal solution in finite time, they cannot guarantee to find it in a reasonable amount of time and in fact high runtime requirements are the major downsides of EAs (Ashlock, 2006).

Evolutionary algorithms use computational models of evolutionary processes in-spired by Nature to solve optimization problems. As a result, the terminology of evolutionary computation draws heavily on terms used in biology in the context of evolution. EAs typically maintain a set of possible candidate solutions for a given problem called individuals that are iteratively refined by applying a reproduction and selection regime (B¨ack et al., 1997, 2000; Spears et al., 1993). In each iteration, individuals are perturbed by applying search operators typically referred to as mu-tation and recombination that serve as exploration heuristics to explore the search space. Subsequently, individuals are subjected to selection, by evaluating the per-ceived performance of the individuals measured by a fitness function and selecting certain individuals according to a specified selection scheme asoffspring for the next iteration. The set of individuals is referred to as population and an iteration, in accordance with the evolution symbolism, is also called generation.

The GAVEO algorithm builds upon the framework of (Weskamp, 2007). More precisely, GAVEO calculates a global graph alignment as defined in Chapter 3 by maximizing a global scoring function s that serves as a fitness function. However, instead of starting from precalculated seed solutions, GAVEO offers the possibility to calculate graph alignments “from scratch” starting from randomly generated align-ments.

Given a set of graphs G = {G₁, ..., G_m}, a graph alignment A is calculated that maximizes the optimization function s(A). The objective function used by GAVEO is identical to the one used by Weskamp and represents a quality measure based on a sum-of-pairs scheme, generalized for the case of multiple graph alignments. The score of a multiple alignment A= (a¹, . . . , aⁿ) is calculated by summing over the scores of all induced pairwise alignments:

s(A) =

i=1

ns(aⁱ) + X

1≤i<j≤n

es(aⁱ, a^j) . (4.1) The function consists of two parts, considering nodes and edges separately. The node score ns evaluates the correspondence of all mutually assigned nodes within a column aⁱ of the alignment A, which is summed up over the length of the align-ment. Matching node labels are rewarded by a positive scorens_m, mismatches or the

4.1 Global graph comparison

assignment of gaps are penalized by negative values ns_mm and ns_gap, respectively:





 aⁱ₁

... aⁱ_m







= X

1≤j<k≤m









 ns_m nsmm

ns_gap ns_gap

`(aⁱ_j) =`(aⁱ_k)

`(aⁱ_j)6=`(aⁱ_k) aⁱ_j =⊥, aⁱ_k 6=⊥

aⁱ_j 6=⊥, aⁱ_k =⊥

(4.2)

The functionesevaluates the assignment of edges. Tolerance towards edge weights deviation is again realized by tolerance. Thus, the assignment of two edges e1 and e₂ is considered a match, if the respective weights deviate by at most, otherwise mismatch is presumed:











 aⁱ₁

... aⁱ_m





 ,





 a^j₁

... a^j_m













= X

1≤k<l≤m









 es_mm es_mm es_m es_mm

(aⁱ_k, a^j_k)∈E_k, (aⁱ_l, a^j_l)∈/ E_l (aⁱ_k, a^j_k)∈/ E_k, (aⁱ_l, a^j_l)∈E_l d^ij_kl ≤

d^ij_kl >

(4.3) where d^ij_kl =

w(aⁱ_k, a^j_k)−w(aⁱ_l, a^j_l)

. The parameters (i.e., ns_m, ns_mm, ns_dummy, es_m, esmm) are constants used to reward or penalize matches, mismatches and dummies, respectively².

Having defined the fitness function, an evolutionary algorithm is employed to find a globally optimal alignment. More precisely, the GAVEO approach consists of an iterative process according to Beyer and Schwefel (2002). Initially, a population consisting of µ individuals is generated randomly, with µ denoting the population size. Each individual represents a graph alignment as candidate solution. Then, the following iterative loop is performed until a certain stopping criterion is met:

1. At the beginning of each generation, λ = ν·µ new offspring individuals are

2In the experimental part, the scoring parameters andwill be initialized accoring to Weskamp (2007).

generated. This is achieved by selecting ρ parent individuals from the ini-tial population by means of a mating-selection operator, which in the case of GAVEO is simply realized as a random selection according to a uniform distri-bution. The parent individuals are then recombined to yield a new individual.

This is realized by means of a recombination operator.

2. Each individual is subjected to a mutation operator, that (slightly) alters the individual.

3. The offspring individuals are subsequently evaluated using the fitness function 4.1 and a temporary population T is formed as union of the initial population and the newly generated offspring. A selection operator is then applied to select the fittest µindividuals that form the population off the next iteration.

As possible stopping criteria, the elapsed runtime, the number of generations, stall time or stall generations (the amount of time, respectively the number of generations, with no improvement of the fitness value) or a fixed fitness value could be used. The complete GAVEO algorithm is summarized in Algorithm 1.

Algorithm 1 The GAVEO algorithm

Require: G set of graphs, µ population size, λ number of offspring, ρ number of parents,s fitness function

stop=f alse

P = initialize population(G, µ) while stop = false do

O = recombine offspring(P, λ) O = mutate offspring(O) evaluate population(O∪P) P = select best(O∪P, µ) if stop criterion is met then

stop=true

return individual A∈P, with A= arg max_A∈P s(A)

The optimization process can start from arbitrary alignments which allows to reduce the memory requirements compared to the greedy approach, as mentioned above. However, since the search space is relatively large, starting from a random

4.1 Global graph comparison

alignment might not be the best choice. Thus, in case of smaller graphs where the calculation of a maximum clique is unlikely to cause problems, the greedy solution can be calculated as starting point first and included in the initial population.

In principle, any number of graphs can be aligned in this manner directly, which is an additional benefit of the GAVEO approach. The greedy approach instead can only be used to calculate pairwise alignments that are subsequently combined into a multiple graph alignment via star alignment. This, however, introduces another source of inaccuracy, as two heuristics are used in combination instead of one single procedure, as is the case for the GAVEO approach.

One the other hand, the search space of the multiple graph alignment problem grows exponentially with the number of graphs. This, is of course problematic from an optimization point of view, as an efficient exploration of the search space becomes more and more difficult. Thus, decomposing the multiple graph alignment problem into several pairwise ones and resorting to subsequent merging might allow to trade quality for speed.

In doing so, one would achieve a reduction of the search space by simplifying the problem, although bought with the potential loss of quality incurred by the sub-optimal merging of the pairwise alignments. It is difficult to judge which effect would be of greater impact in advance and it might be more advisable to avoid the risk of getting astray in a huge search space. However, as the focus of this thesis is on the pairwise case, this is of minor importance here.

Im Dokument Graph-Based Approaches to Protein StructureComparison - From Local to Global Similarity (Seite 91-95)