Deriving a global alignment - SEGA - SEmi-global Graph Alignment

4.3 Semi-global graph comparison

4.3.1 SEGA - SEmi-global Graph Alignment

4.3.1.2 Deriving a global alignment

4.3 Semi-global graph comparison

to a certain assignment of nodes will later be used to derive a quality measure for the alignment. Squaring the difference (4.29) will serve to increase the influence of node assignments with a nearly perfect accordance regarding the spatial constellation of the associated node neighborhoods. Once the distance matrix D is derived, a graph alignment is calculated in a second step.

more likely to represent a conserved and thus functionally important region of the binding pocket. While this problem is somewhat mitigated by using the squared number of non-matching triangles as local distance measure, creating a preference for highly similar neighborhoods, it is nevertheless possible.

A high neighborhood similarity is only achieved if a conserved common substruc-ture exists. The size of this substrucsubstruc-ture directly affects the number of highly affine node pairs and the most similar pairs should always correspond to nodes located in the center of the associated common subgraph, where a high neighborhood similarity is most likely. Hence, one should consider such nodes first. Consequently, SEGA will assemble an assignment by starting with the nodes exhibiting the lowest observed distance value and then incrementing through the possible distance values (note that only a fixed number of values are possible), making all possible non-ambiguous as-signments before advancing to the next level.

If more than one assignment for a given node is possible with the same cost, SEGA resorts to global information from graph topology to resolve such ambiguities. More specifically, an initial seed solution is constructed in the form of a partial assignment of nodes which will serve as a reference frame. To this end, only nodes v_i⁽¹⁾ ∈V₁ and v_j⁽²⁾ ∈V₂ having a distance of 0 and, hence, being highly affine, are considered. If such nodes exist and can be mutually assigned without ambiguities, these assignments are realized. With

f_c(v_i⁽¹⁾) ={v_j⁽²⁾ ∈V₂|d_ij ≤c} , g_c(v_j⁽²⁾) ={v_i⁽¹⁾ ∈V₁|d_ij ≤c} ,

(where e.g., f_c(v_i⁽¹⁾) denotes the set of vertices in G₂ whose distance to v_i⁽¹⁾ is not greater than c) those pairs v_i⁽¹⁾ and v_j⁽²⁾ satisfying f₀(v_i⁽¹⁾) = {v⁽²⁾_j } and g₀(v_j⁽²⁾) = {v_i⁽¹⁾} are assigned, as they represent unambiguous choices for constructing the seed solution. Those nodes v_i⁽¹⁾ with |f₀(v⁽¹⁾_i )| > 1 (and v_j⁽²⁾ with |g₀(v_j⁽²⁾)| > 1) are not yet assigned, as for these nodes multiple conflicting assignments are possible. Such conflicting choices are later resolved by drawing on the seed solution as reference frame.

4.3 Semi-global graph comparison

The seed solution thus obtained must satisfy the constraint that the set of mapped points for each graph contains a basis of R³ to determine the relative position of a new node in three-dimensional space in an unambiguous way. To ensure this, at least four pairs of points are needed, provided these points can be used to define a spanning set of vectors for R³ that are linearly independent. If this condition is not met, a sufficient number of candidate pairs is collected by relaxing the distance constraint, i.e., a maximal local distance c >0 is allowed.

If even the seed solution cannot be constructed unambiguously, the following strat-egy is employed: Let S₁ ⊆ V₁ and S₂ ⊆ V₂ denote the nodes occurring in these candidates. SEGA then constructs all possible candidate assignments

(s⁽¹⁾₁ , s⁽²⁾₁ ), (s⁽¹⁾₂ , s⁽²⁾₂ ), (s⁽¹⁾₃ , s⁽²⁾₃ ),(s⁽¹⁾₄ , s⁽²⁾₄ )

⊆S1×S2

of size four that represent a unique three-dimensional geometry and are unambiguous in the sense that s⁽²⁾_i ∈ f_c(s⁽¹⁾_i ) and s⁽²⁾_j 6∈ f_c(s⁽¹⁾_i ) as well as s⁽¹⁾_i ∈ g_c(s⁽²⁾_i ) and s⁽¹⁾_j 6∈ g_c(s⁽²⁾_i ) for all 1 ≤i 6=j ≤4. As final seed solution, the candidate minimizing the spatial deviation

1<i<j<4

e(s⁽¹⁾_i , s⁽¹⁾_j )−e(s⁽²⁾_i , s⁽²⁾_j )

is selected to match the candidates that are most similar in terms of geometry.

Now, suppose a current seed in the form of a partial alignment to be given. There may still be the problem that some nodes could not be assigned unambiguously. To solve this problem, one can again formulate an optimal assignment problem, this time augmented by drawing upon global information. In the k-th iteration, nodes having a distance of at most c_k are assigned, where c_k is the k-th smallest cost value in the matrix D. More specifically, let W₁ ⊂V₁ (W₂ ⊂V₂) denote the set of nodes from V₁ (V₂) that have already been assigned in a previous iteration. Moreover, let

U₁^k ={v⁽¹⁾_i ∈V₁|f_c_k(v⁽¹⁾_i )6=∅} \W₁ , U₂^k ={v⁽²⁾_j ∈V2|gc_k(v_j⁽²⁾)6=∅} \W2 .

Then a (partial) assignment of nodes in U₁^k and U₂^k is derived by applying the Hun-garian algorithm to a cost matrix defined as follows. The matrix contains an entry for each pair of nodes v_i⁽¹⁾ ∈U₁^k and v_j⁽²⁾ ∈U₂^k. If v⁽²⁾_j 6∈f_c_k(v_i⁽¹⁾), the corresponding cost value is set to a sufficiently high constant C (indicating that these two nodes should not be assigned). Otherwise, the cost value is determined by resorting to information from the (global) graph structure, by comparing the position of v_i⁽¹⁾ relative to the current seed nodes W₁ with the position of v_j⁽²⁾ relative to W₂. More precisely, the cost is defined by

q=1,2,...,|W₁|

|v_i⁽¹⁾−w_q⁽¹⁾| − |v_j⁽²⁾−w_q⁽²⁾| ,

where wq⁽¹⁾ and w⁽²⁾q denote, respectively, the q-th node in W₁ and W₂ (which are mutually assigned), and |v⁽¹⁾_i −wq⁽¹⁾|is the Euclidean distance between v⁽¹⁾_i and w⁽¹⁾q . Applying the Hungarian algorithm yields again a cost-minimal assignment. If v_i⁽¹⁾ and v⁽²⁾_j participate in this assignment, i.e., have been assigned to each other, v⁽¹⁾_i is added to W₁ and v⁽²⁾_j to W₂ if v_j⁽²⁾ ∈ f_c_k(v⁽¹⁾_i ), i.e., if the corresponding cost value is smaller than C. Intuitively, the main idea behind this step is to choose only those possible node assignments from all ambiguous choices, for which the corresponding nodes are roughly oriented in the same manner towards the reference frame given by the seed solution or at least show the least deviation.

This procedure iterates until all nodes of one graph are assigned, or until a pre-defined upper cost value c_max has been reached, with remaining nodes assigned to gaps. If such an upper limit is not set, SEGA calculates a global graph alignment, considering all nodes in a graph. A limit below cmax can be regarded as a stringency constraint for the partial alignment that controls the tolerated amount of structural deviation. From a biological point of view, it might be reasonable to set such an upper limit and retrieve just a partial alignment, e.g., for two binding sites that share a similar subpocket while being globally dissimilar. Of course, the choice of a proper threshold is not clear in advance and should be chosen based on the application at hand. The complete procedure is summarized in pseudo-code in Algorithm 2.

While theoretically more useful, the question remains whether the above sug-gested strategy will yield an improvement over simply generating the alignment by

4.3 Semi-global graph comparison

Algorithm 2 SEGA: Constructs a global alignment A for the graphs G₁, G₂ Require: distance matrix D, graphG₁ ={V₁, E₁}, Graph G₂ ={V₂, E₂}

S=∅, W₁ =∅, W₂ =∅ k= 0

while c_k ≤c_max do

for all (v_i⁽¹⁾, v_j⁽²⁾) with d_ij ≤c_k do f_c_k(v⁽¹⁾_i )← {v_j⁽²⁾ ∈V₂|d_ij ≤c_k} g_c_k(v_j⁽²⁾)← {v_i⁽¹⁾ ∈V₁|d_ij ≤c_k} for all (v_i⁽¹⁾, v_j⁽²⁾)do

if f_c_k(v⁽¹⁾_i ) = {v_j⁽²⁾} and g_c_k(v⁽²⁾_j ) = {v_i⁽¹⁾} then add (v_i⁽¹⁾, v_j⁽²⁾) to A, add v_i⁽¹⁾ toW₁, add v_j⁽²⁾ toW₂ else

add (v_i⁽¹⁾, v_j⁽²⁾) to S if |A|>4then

if S6=∅ then

U₁^k← {v⁽¹⁾_i ∈V₁|f_c_k(v_i⁽¹⁾)6=∅} \W₁ U₂^k← {v⁽²⁾_j ∈V2|gc_k(v_j⁽²⁾)6=∅} \W2

M, C ←construct matrix(U₁^k, U₂^k, S, D) AH ←hungarian algorithm(M)

for all (v_i⁽¹⁾, v_j⁽²⁾)∈AH, Mij < C do

add (v⁽¹⁾_i , v⁽²⁾_j ) to A, add v_i⁽¹⁾ toW₁, add v_j⁽²⁾ to W₂ S ← ∅

else

S ←S∪A, A← ∅, W₁ ← ∅, W₂ ← ∅ S₄ ={X ⊂S| |X|= 4}

if S4 6=∅ then

S_min ← X ⊂ S₄, dev(X) ≤ dev(X⁰), X⁰ ⊂ S₄ {select S_min with minimal spatial deviation dev(S_min)}

for all (v_i⁽¹⁾, v_j⁽²⁾)∈S_m do

add (v⁽¹⁾_i , v⁽²⁾_j ) to A, add v_i⁽¹⁾ toW1, add v_j⁽²⁾ to W2

k=k+ 1

return Alignment A for the graphs G₁, G₂

Algorithm 3 construct matrix: Constructs a cost matrix from ambiguous mapping candidates

Require: S set of candidate mappings of nodes, U₁^k, U₂^k C=M ax V alue

for all v_i⁽¹⁾ ∈U₁^k do for all v⁽²⁾_j ∈U₂^k do

if (v_i⁽¹⁾, v_j⁽²⁾)∈/ S then M_ij =C

else

Mij =P

q=1,2...|W₁|

|v_i⁽¹⁾−w⁽¹⁾q | − |v⁽²⁾_j −w⁽²⁾q | , w⁽¹⁾q ∈W₁, wq⁽²⁾ ∈W₂

return cost matrix M, constant C

solving a weighted optimal assignment problem. Thus, as an alternative, the SE-GAHA (SEmi-global Graph ALignment - Hungarian Algorithm) variant is proposed as an alternative, where the incremental assignment of nodes is replaced by the Hun-garian algorithm (Kuhn, 2005). The question, which of the two approaches will be more suitable for protein binding site comparison will be addressed in Chapter 5.

Im Dokument Graph-Based Approaches to Protein StructureComparison - From Local to Global Similarity (Seite 129-134)