• Keine Ergebnisse gefunden

1.2.2 3-Dimensional Descriptors and Projection onto Surface

1.3 Goals of this Thesis

Considering CavBase and its architecture (cf. Figure 1.2), there are different possibilities for improvements. Kuhn (2004) considered for example the sec-ond level of Figure 1.2 and developed another representation of protein bind-ing sites, in which a vector was used which specifies to which degree the seven physicochemical properties are present for a pseudocenter. The first level of CavBase was investigated by different researchers where different concepts were proposed for the detection of protein binding sites based on different al-gorithmic and biological concepts (Pérot et al., 2010).

In this thesis, similarity retrieval, hence the last level of Figure 1.2, is consid-ered, which is obviously the bottle-neck of CavBase. In difference to the other levels this step is applied several times for a protein binding site, leading to the requirement of a very efficient realization. However, searching for common subgraphs in terms of a clique detection on the product graph comes with some

drawbacks: The product graph of two graphs G = (V, E)and G = (V, E) consists ofO(|V| · |V|)nodes, hence ofO(|V|2· |V|2)edges, which becomes too high a number even for middle-size graphs. Thus, such approaches often cannot be applied for the comparison of protein binding sites. Even if the prod-uct graph can be placed in the memory, the identification of cliques in a graph is known to be NP-hard, hence all algorithms solving the problem are a com-promise between efficiency and quality of the solution. CavBase realizes the clique detection by the (Bron and Kerbosch, 1973) algorithm that is exact, yet exponential in runtime (Tomita et al., 2006). However, usually the time needed to detect the cliques does not dominate the whole approach. In fact the calcu-lation of the set of values{S}derived from the cliques is the most expensive part of the similarity retrieval in CavBase, since the expensive scoring must be performed 100 times.

The goal of this work is therefore to enable on the one hand more efficient measures between protein binding sites. Like in CavBase, this problem can be considered on the level of graphs, the representation used so far for cal-culations on protein binding sites. However, already the authors of CavBase reported that this representation is not optimal, which leads to the additional steps in CavBase taking the surface points into consideration. Therefore, on the other hand a more robust representation is considered which is based on labeled point clouds. CavBase stores protein binding sites obviously as such labeled point clouds in the 3-dimensional Euclidean space that can be used

di-6

6 8

9 10 15 20

11

15 15

(a) graph representing a structure A

6 8 6 9

10 20

15

11 15

15

(b) graph representing a structure B different from A

Figure 1.4: Example for two equivalent graph descriptors (circles and lines) describing geo-metrically different structures (circles).

rectly for calculations. In comparison to graphs, this representation does not come with a loss of information effective when distances are considered instead

of coordinates, as demonstrated in Figure 1.4.

A widely used technique to measure similarity between objects is to use a vector representation on single objects and measures on pairs of vectors. There-fore, in addition to the graph-based representation and the novel geometric representation of protein binding sites discussed until now, vector representa-tions are also developed. Since entries in the vector are called features, such approaches are also referred to as feature-based approaches. Even though the transformation from a protein binding site to a vector is usually more complex compared to the transformation into a graph, this representation has the benefit that the comparison of vectors applied afterwards is much more efficient than a comparison of graphs. Since the transformation must be applied once for each protein binding site whereas comparisons are performed multiple times, a vector representation which is stored in parallel to the protein binding site will lead to a enormous gain of efficiency.

Independent of the issue of representation, more advanced techniques are required to handle noise and mutation often appearing in life-sciences. In the case of protein binding sites, error-tolerance is moreover required to handle dif-ferent conformations caused by an induced fit. In CavBase some error-tolerance was enabled by using an appropriate definition of the product graph, where edges are assumed equivalent within a predefined threshold of . Moreover, with the concept of subgraph-isomorphism, an algorithmic error-tolerance was allowed, since similarity can be obtained even if the graphs do not match ex-actly. The error-tolerance provided by these concepts, however, is quite low, which necessitates the development of more robust methods and the investi-gation of their behavior in comparison to less flexible methods.

Finally, motivated by the work of Weskamp et al. (2007), the problem of multiple structural alignments is also considered in this thesis. This prob-lem has so far been tackled in a similar way to the calculation of similarity in CavBase. The common subgraph is calculated that obviously forms a partial alignment which is greedily extended to a complete alignment. Using star-alignment (Böckenhauer and Bongartz, 2007), the state-of-the art technique for merging pairwise alignments to multiple ones, a multiple alignment is formed.

This greedy procedure does not perform optimally in most cases, hence other techniques should be applied to calculate such a graph-alignment. Moreover, motivated by the drawbacks graphs exhibit, an analogous approach is required for the geometric representation of a protein binding site.

Preliminaries 2

In the previous chapter important problems pharmaceutical chemists have to tackle were discussed, namely the extraction of similarity and the construction of alignments. Before presenting algorithms capable of solving these tasks, in this chapter an introduction to fundamental tools is given, which are used in the following of this thesis. Here the discussion comprises techniques allow-ing the modelallow-ing of protein bindallow-ing sites, but also techniques that allow for optimizing complex functions for which gradient-based approaches will fail or that allow the processing of imprecise and vague information.

For modeling protein binding sites at least two approaches can be applied:

As already done in CavBase, a model based on graphs can be used. Another method introduced in this thesis adopts a novel representation based on points in the 3-dimensional Euclidean space, which offers some advantages compared to graphs. To solve optimization problems that are in this thesis often endowed with properties making the optimization hard, evolutionary algorithms be-come an important tool. Here, evolutionary strategies turn out to be especially useful. Another useful concept is provided by fuzzy logic, offering a large set of tools to model and process vague and imprecise data.

Parts of this chapter were already discussed and published in (Fober et al., 2007), (Fober et al., 2009c), (Fober et al., 2009d), (Fober et al., 2011) and (Hüller-meier et al., 2013).