• Keine Ergebnisse gefunden

1.2.2 3-Dimensional Descriptors and Projection onto Surface

3.1 Geometric Approaches

models also play an important role beyond the domain of bioinformatics. For example, graphs can be used to model other kinds of networks, such as social networks (Wassermann and Faust, 1994), HTML/XML documents (Page et al., 1998), or the internet itself (Borgwardt, 2007).

Another class of modeling concept is sets of points in the Euclidean space, where points are enriched with additional information. Surprisingly, this rep-resentation is not often used for the analysis of proteins. Instead, it is applied in image processing, pattern recognition, cartography, industrial inspection and robotics (Thompson et al., 1999; Fabio, 2003; Höfle et al., 2007; Irfanoglu et al., 2004; Munoz et al., 2008), though it has some advantages compared with graphs: Data is often given as a set of observations that are distributed in the Euclidean space, hence a set of points becomes the natural representation guar-anteeing no loss of information that can occur during transformation from one representation to another one. Moreover, processing on sets of points may al-low more efficient calculations since the combinatorial character of many algo-rithmic problems on graphs that often lead to NP-hard problems usually does not appear for algorithms operating on point sets.

Obviously, both representations, graphs and point clouds as well, are ap-propriate to model protein binding sites, hence algorithms able to process on these representations known from literature are introduced in the following.

Besides general algorithms which operate on arbitrary graphs and arbitrary point clouds, specialized methods are presented that exploit the additional in-formation given by proteins.

must be extended before being applied on protein binding sites.

For point clouds, different measures were proposed, some that are exact measures often combined with the ability to establish a one-to-one mapping, thus an alignment of points. Other methods, which originate from the need of allowing a certain error-tolerance are often based on the Hausdorff-distance, a standard concept to measure the distance between two subsets of a metric space. Other works use geometric hashing or -neighborhoods, both espe-cially for the calculation of an approximate alignment. Measures establishing correspondences between points are usually quite expensive. An interesting approach is therefore to transform a set of points into a feature vector and to compare pairs of feature vectors afterwards. A promising way for doing this was presented by Kupas et al. (2007) who used wavelet functions for this purpose. The point set was decomposed into circular patches and wavelet functions approximating the patches were fitted. The resulting coefficients were subsequently used to describe the geometry. This approach can be easily adapted to labeled point clouds by extending the feature vector, e.g., by the percentage of points within a patch exhibiting a certain label, as proposed by the aforementioned authors.

3.1.1 Exact Point Matching

Interesting methods following the exact matching concept were developed by Atkinson (1987); Alt et al. (1988); Sprinzak and Werman (1994). In (Atkinson, 1987) an efficient technique is used that reduces the problem of point match-ing to the problem of strmatch-ing matchmatch-ing for which efficient algorithms are known (Chao and Zhang, 2009). This technique, however, considers point clouds in two dimensions only. The extension by Alt et al. (1988) considers planar graphs instead of strings and calculates an isomorphism of these planar graphs by us-ing an efficient algorithm (Hopcroft and Wong, 1974), leadus-ing to a matchus-ing of 3-dimensional point clouds. Also in (Atkinson, 1987), coordinate vectors are used to find an exact matching of points in three dimensions. Sprinzak and Werman (1994) use canonical forms for this purpose. Generally, exact point matching is comparable to graph isomorphism, hence suffers from the same problems: Although exact point matching can be calculated efficiently, it is not appropriate as a measure for protein binding sites since these concepts al-low absolutely no error-tolerance and do not take labels into consideration. A

promising alternative to exact point matching therefore is approximate point matching.

3.1.2 Approximate Point Matching

The problem of approximate point matching is typically defined as the mini-mal number so that after proper translation and rotation, each point in the first cloud has a counterpart in the second cloud that fall into its -neighbor-hood, and vice versa. Obviously, the minimal can serve as a distance measure between point clouds. Different concepts were introduced, those that guaran-tee that each point has exactly one counterpart and those that allow a point to match more than one point in the other set. In the latter case the Hausdorff distance is often used as a distance measure.

To solve the one-to-one correspondence problem, Alt et al. (1988) combined algebraic curves traced by certain points with an algorithm solving the bi-partite graph matching problem (Kuhn, 2005). This approach unfortunately suffers from its complexity which is polynomial of order eight. More efficient approaches are given by Efrat and Itai (1996) and Arkin et al. (1992) that re-quire, respectively, that only translation is considered or that the -neighbor-hood regions are disjoint. Decision and approximate decision algorithms for this kind of problem are given in (Heffernan and Schirra, 1992) which solve the problem e.g. with network flow algorithms (Corman et al., 2001). The con-cepts presented here seem very interesting since they allow the calculation of a distance between point clouds and moreover an alignment that is given by the calculated correspondences. Unfortunately, all these concepts come with high complexity that can be reduced only by making assumptions that will not allow use of the resulting methods on protein binding sites. Other problems of these methods are the requirement of equal-sized point clouds and the lack of an ability to consider label information.

Algorithms that are based on the Hausdorff distance do not require point sets of equal cardinality, hence are in this regard more appropriate for the pur-pose of a protein binding site comparison. Moreover, the calculation of this distance can be performed very efficiently using Voronoi-diagrams (Alt et al., 1991). Unfortunately, the Hausdorff distance is not invariant against transla-tion and rotatransla-tion, thus an optimal transformatransla-tion must be calculated for sets having different origin. Huttenlocher et al. (1993a) use properties of

Voronoi-diagrams to calculate an optimal transformation, however, to calculate an op-timal rotation, dynamic Voronoi-diagrams must be used that lead to a poly-nomial complexity of order six. Modifications of this approach consider the one-way Hausdorff distance, hence seek partial matches (Huttenlocher et al., 1993b). As Alt and Guibas (1996) notice, these approaches are numerically unstable and hard to implement, which makes a further extension to labeled points very difficult. More promising algorithms were proposed by Goodrich et al. (1994); Hoffmann et al. (2010): While the former is an approximation of the Hausdorff distance with approximation factor eight, the latter defines its own distance measure (different from the Hausdorff distance) and optimizes it with gradient descent methods to find the optimal translation and rotation.

This approach already takes point labels into account, hence, it can be used directly for the comparison of protein binding sites.

Another interesting approach, namely geometric hashing, allows calcula-tion of a partial alignment between point clouds independent of a certain dis-tance measure (Shatsky et al., 2006; Bachar et al., 1993; Leibowitz et al., 1999;

Lamdan and Wolfson, 1988; Wolfson and Rigoutsos, 1997; Leibowitz et al., 2001). These approaches have in common the use of a hash-table, that con-tains k-tuples drawn from a certain point set. This hash-table is used in the recognition phase, where k-tuples of the other point sets are drawn and looked up in the table. If two matching k-tuples are found, one from the first and one from the second point set, they can be superimposed and thus define a trans-formation that can be used to derive an alignment. This approach allows one to enrich the points with additional information. Moreover, it allows one to calculate multiple alignments. Unfortunately, the hash-table can become quite large, which often leads to a high runtime and additionally to a failure of the whole approach on large inputs due to a memory overflow.

In (Wang and Wang, 2000), point clouds are transformed into 3D graphs and hashing is applied for a fast similarity search. In a recent paper, Bach (2008) proposes transforming a point cloud into a graph and applying kernels on such graphs afterwards, an approach that has already long been used, e.g.

in bio- and chemoinformatics (Borgwardt et al., 2005).

3.1.3 Superposition based on One-to-One Correspondences

If the one-to-one correspondence between points is known and the goal is to find the optimal superposition of point sets, different algorithms can be used.

A well-known method often applied in chemoinformatics is described in (Kab-sch, 1976) which minimizes the root mean squared deviation of two point sets by using simple matrix operations. Another approach analyzes the lower en-velope of multivariate functions, hence functions of more than one variable, to superimpose two point sets (Imai et al., 1989). At first sight, such meth-ods seem worthless since they already require alignment. However, such tools can be used e.g. to improve the score obtained from a (partial) alignment as done in CavBase. Moreover, some approaches require such techniques as a sub-procedure. An example is the geometric hashing approach that looks for approximately equivalent k-tuples that are used to determine the transforma-tion.