Adding Substructures to the Database - Results and Discussion of NAOMInova 103

7. Results and Discussion of NAOMInova 103

7.3. Adding Substructures to the Database

The performance of the substructure-adding step is determined using three different sub-structures. The exact definition of these substructures can be found in Section 5.3.4. A schematic depiction of the SMARTS pattern for all three substructures is shown in Figures 7.2a-d. All three substructures share the same fragment part in their SMARTS which is

Hydroxyethyl 1:

CC[OH]

no recursion

Hydroxyethyl 2:

[C$(C[CR1])]C[OH]

Hydroxyethyl 3:

[C$(CC1CCCCC1)]C[OH]

a b c d

Figure 7.2.: Three different substructures are added to six different databases. a)-d) Schematic depiction of the three substructures. Pictures were generated with SMARTSviewer [122]. a) The frag-ment part of all three substructures. b) Surrounding part of Hydroxyethyl 1. c) Surrounding part of Hydroxyethyl 2. d) Surrounding part of Hydroxyethyl 3. e) Runtimes for the complete substructure-adding process on databases containing different data sets of protein-ligand com-plexes. f)-h) Number of detected hits in the SMARTS matching step plotted against the runtime for adding one of the three substructures to different databases, respectively. In each plot, a linear regression curve is shown as green dotted line.

’CCO’ (see Figure 7.2a). They differ only in their recursive description of the first carbon (see Figures 7.2b-d).

The runtimes for adding each of the three substructures to databases containing different sets of PDB files are shown in Figure 7.2a. Overall, the runtime for adding Hydroxyethyl 1 is larger than for Hydroxyethyl 2 and Hydroxyethyl 3. The runtimes for Hydroxyethyl 2 and Hydroxyethyl 3 are almost identical. The complete procedure can be divided into two steps: (1) Preparation of protein-ligand complexes and EDIA values and (2) data collection.

For Hydroxyethyl 1, the data collection takes about 50% of the complete runtime. For

Hy-droxyethyl 2 and HyHy-droxyethyl 3, this share is only about 20%. The data collection step again can be divided into the SMARTS matching procedure and handling of all detected hits.

Interestingly, for Hydroxyethyl 1, the share of the SMARTS matching on the data collection step is only about 20%. For Hydroxyethyl 2 and Hydroxyethyl 3, the SMARTS matching requires about 78% and about 85% of the time for the data collection step. The reason here is probably that the for Hydroxyethyl 1, the SMARTS pattern matches very frequently and a large number of matches has to be handled. Hence, the handling of all results requires much more runtime than their detection. On the opposite, only a few results are detected for Hydroxyethyl 2 and 3. Hence, more time is required for their detection than for their subsequent preparation.

In Figure 7.2b, c, and d the overall runtime for the complete process of substructure-adding is plotted against the number of detected hits for each substructure, respectively. The cor-relation can be described with a linear regression line, indicated by the green dotted line in Figure 7.2b, c, and d. The lines strongly differ in their slope. This value indicates the time required for the detection and preparation of one hit. As expected, the slope for Carbonyl 1 is very small, indicating that the data collection step per hit is very fast here. However, the number of detected hits is very large which results in a long overall runtime. Interestingly, the overall runtimes for Hydroxyethyl 2 and Hydroxyethyl 3 are almost identical despite the much larger number of hits detected for Hydroxyethyl 2. Accordingly, the slope of the re-gression line for Hydroxyethyl 3 is much larger than the slope for Hydroxyethyl 2. This is a result of the more time consuming SMARTS matching per hit for Hydroxyethyl 3.

These results show that the runtime required for adding of a substructure growths linearly with the number of detected hits. Since every hit has to be handled individually, this behavior of the runtime cannot be changed.

For all three substructures, the first preparation of the data is one of the most time con-suming steps. This step includes the reconstruction of protein-ligand complexes from the database and the reconstruction of the EDIA values for all atoms. During this step, the same reconstruction procedures are used for each of the protein-ligand complexes. Hence, even a slight improvement of the required runtime here can lead to much faster runtimes for the overall reconstruction step. This could be achieved by using a more efficient way of storing the reconstructed data. However, this procedure is only performed once every time new substructures are added to the database. Thus, it is favorable to add several substructures at a time.

For more complicated SMARTS pattern, the SMARTS matching is also an highly time con-suming step. An acceleration of this step could therefore also lead to shorter overall runtimes.

This could be achieved by using fingerprint techniques. Since the fragment part of the sub-structure description used here always describes a unique molecular fragment, fingerprints which store the occurrence of specific substructures in molecules can be applied. Using this fingerprint, the number of molecules in which the SMARTS pattern occurs could be

Filter criteria for partner points Hydroxyethyl 1 Hydroxyethyl 2 Hydroxyethyl 3

no filter 2.8·10⁷ pps, 7·10⁵ pps, 2 235 pps,

134 s ± 5 s 3.7 s± 0.4 s <_{1 s} element type = oxygen 1.9·10⁷ pps, 4.6·10⁵ pps, 1 636 pps,

96 s ± 2 s 2.2 s± 0.1 s <_{1 s} element type = oxygen, 2.4·10⁵ pps, 5.2·10⁴ pps, 143 pps,

location = ligand 4.4 s± 0.2 s <_{1 s} <_{1 s}

Table 7.1.: Runtimes and number of received partner points (pps) for three different filters on three different substructures based on the PDB_2.5 data set.

reduced rapidly and the exact matching algorithm has to be performed only on the relevant molecules. A similar approach is applied by Relibase and Relibase+ for substructure mining in small molecules.

However, the results presented in this section show that about 5·10⁶ substructures can be detected and inserted into the database in about 160 minutes. In the typical use case, this step is performed only once and the database is afterwards used for several analyses. Hence, an interactive behavior is not as important as in the filtering step and runtimes of about 160 minutes are probably tolerable.

Im Dokument Mining of Interaction Geometries in Collections of Protein Structures (Seite 120-123)