Calculation and Storage of relevant Data - NAOMI nova - Deduction of Preferred Interaction Dire

3. Aims and Preconditions 25

4.4. NAOMI nova - Deduction of Preferred Interaction Directions

4.4.2. Calculation and Storage of relevant Data

The data calculated here is stored in an SQLite database. The MoleculeDB, ProteinDB, and ComplexDB from the NAOMI library (see section 4.2.7) are used to store the relevant infor-mation about proteins and ligands. Additionally, partner points are stored in the database.

These tables are governed by the PartnerPointDB which is described in the following sec-tion. The process of detecting central substructures and the storing of partner points will be described thereafter.

PartnerPointDB

stores 3D information and attributes about partner points.

points point_key edia center origin center

conformation_key_center instance_key_center origin partner element partner coordx partner coordy partner coordz partner resolution partner distance partner edia partner substructures partner inter/intra

amino acid partner

primary key float integer foreign_key foreign_key integer integer integer integer integer float float float blob integer integer

pointmapping point_key conformation_key instance_key atom positon

foreign key foreign key foreign key integer ProteinDB.proteins

MoleculeDB.instances ProteinDB.proteins MoleculeDB.instances

Figure 4.13: Schematic depiction of the two tables of the PartnerPointDB. Arrows indicate cross references between tables. Herein, black arrows represent cross references within the PartnerPointDB, whereas blue and red arrows show cross refer-ences to the ProteinDB and to the MoleculeDB, respectively.

The PartnerPointDB

The PartnerPointDB stores all partner points and their attributes. Its main purpose is to quickly find all partner points fulfilling the query constraints and to provide a link to the original atom for each partner point. Figure 4.13 shows a schematic depiction of the tables which belong to the PartnerPointDB. The table ’points’ is generated for each added substructure. In this table, the main attributes of a partner point are stored, e.g., its 3D coordinates and its distance to the central substructure. Similar to the storing of the PRPs in the InteractionDB (see 4.3.2), the coordinates of a partner point are stored as integer values. Each partner point is identified via a unique id, called ’point key’. This id is also used in the second table, called ’pointmapping’. Here, a reference to the original molecule and the exact atom position is stored for each partner point. These values are required if the original structure in which a partner point was detected should be displayed using the backlink functionality. Moreover, attributes of the central substructure for which a partner points was detected are stored in the table ’point’, e.g., the combined EDIA value and a link to the original molecule. The first table ’points’ is required for each filtering process, explained in Section 4.4.3. The second table of the PartnerPointDB ’pointmapping’ is only used if a backlink to the original structure of a partner points is requested.

Database construction

The construction of a database consists of three steps. In a first step, PDB files are entered to the database. As for Pelikan, the two file formats ’.pdb’ and ’.mmCIF’ are accepted.

The procedure works for apo protein structures as well as for holo structures, i.e., structures of protein-ligand complexes. Apo structures are simply handled as protein-ligand complexes without ligand. In the following, the procedure of constructing a NAOMInova database is therefore explained only for structures of protein-ligand complexes. The perception of molecules from PDB files is explained in Section 4.2.1. Upon entering the structures into the database, the proteins and molecules are stored in the tables of the ProteinDB and the MoleculeDB. Secondly, substructures have to be registered. During this process, the SMARTS of each substructure is checked for validity and the name is checked for uniqueness.

Afterwards, the central substructure is stored in a specific database table. The table contains one column for each of the five attributes of a central substructure. The name of the central substructure is used as primary key. In the third step, the partner points are detected and stored. A schematic overview about this process is presented in Algorithm 3.

In principle, all hits of each substructure in each protein-ligand complex are searched using a first loop over all protein-ligand complexes and a second loop over all substructures. In Line 6 of Algorithm 3 the SMARTS matching step is performed. Then, for each hit, the combined EDIA EDI Ahit of the detected atoms and the affine transformation for its superimposition onto the substructure’s template atoms is calculated. If EDI Ahit < _{EDI A}_min _{or if the} RMSD of the transformation is larger then 0.2, the hit is discarded. Otherwise, partner points in the vicinity of the detected atoms are collected in Line 16. Within this function, all atoms of the protein-ligand complex which have a minimal distance to any of the detected atoms of below 4.5 ˚A are collected. Moreover, only atoms from the organic subset are considered as partner points here, except for carbon and phosphorus. These are atoms with an element type of oxygen, nitrogen, sulfur, halogen, or metal. This is done using the spatial atom index from the NAOMI library (see Section 4.2.5). To this end, a k-D tree is created for all relevant atoms of each protein-ligand complex already in Line 4 of Algorithm 3. Partner points are then detected using range queries on the k-D tree. Finally, for each partner point whose EDIA value is larger or equal than EDI Amin, all necessary attributes are calculated.

These attributes are mainly the values stored in the table ’points’ in the PartnerPointDB.

After all central substructures have been detected in one protein-ligand complex, the data calculated so far is stored in the database.

The upper boundary of the algorithm’s runtime is on the one hand determined by the runtime of the SMARTS matching algorithm. The problem of subgraph-isomorphism is known to be NP-complete [96] and thus all known algorithms have an exponential runtime relative to the size of the used graphs. On the other hand, all steps in the loop starting in Line 7 of Algorithm 3 are performed for each detected hit. If the absolute runtime is examined, it might therefore be possible, that the loop in Line 7 has a longer runtime than the SMARTS

Algorithm 3 Detection and Storage of Partner Points

1: procedure detectAndStorePartnerPoints(database)

2: for all plc∈database.getPLCs do ⊲plc = protein-ligand complex

3: Let databe an empty list of partner points with attributes

4: tree =buildKD-tree(plc.getAllOrganicAtoms)

5: for all sub∈database.getSubsdo ⊲sub = central substructure

6: hits= findAllHits(sub.SMARTS,plc)

7: for allhit∈ hitsdo

8: EDI Ahit =getCombinedEdia(hit)

9: if EDI Ahit <_{sub.EDI A}_min _then

10: continue with next hit

11: end if

12: rmsd,tra f o = getTransformation(hit,sub.templateAtoms)

13: if rmsd>_0.2_then

14: continue with next hit

15: end if

16: partnerPoints = getAllCloseAtoms(hit,tree)

17: partnerPointsTrans =transform(partnerPoints,tra f o)

18: for all pp∈ partnerPointsTransdo

19: EDI App = getEdia(pp)

20: if EDI App≥sub.EDI Amin then

21: pp data =getAttributesForPartnerPoint(pp)

22: data.append(pp data)

23: end if

24: end for

25: end for

26: addDataToDatabase(database,data)

27: end for

28: end for

29: end procedure

matching step in Line 6 in case a large number of hits is detected. Which of those steps is finally more relevant for the overall runtime highly depends on the used SMARTS and the used protein-ligand complex.

Due to possible symmetries of a central substructure, an ambiguous behavior of Algorithm 3 might occur. Firstly, linear symmetry occurs in cases where the substructure’s template atoms resembles a straight line, e.g., for the SMARTS ’O=C’. After transformation, the partner points are randomly distributed around the symmetry axis. No further projection of the partner points is performed here. Secondly, substructure symmetry might occur, e.g., for the SMARTS ’O=C(C)C’. In these cases, the SMARTS matching algorithm would detect several matchings on the same set of atoms. In principle, each hit is treated individually and the partner points for each hit are collected. However, if the detected atoms of a hit have been found before, all partner points are marked with a flag in the database. Thus, it is possible to distinguish between all hits of a substructure and only its arbitrary first hits.

Database extension

Besides building a new database, databases can also be extended by protein-ligand complexes as well as by new substructures. In case of protein-ligand complexes which are added to a database, the same procedure as explained in Algorithm 3 is used. The only difference is in Line 2. In this case, the loop covers only the newly added protein-ligand complexes.

In case new substructures are added to an existing database, also Algorithm 3 is executed.

However, the loop in Line 5 uses only the new set of substructures. Besides this difference, an additional step is required when an existing database is extended by new substructures.

During the database construction, attributes are calculated for each partner point in Line 21 of Algorithm 3. One part of this step is another SMARTS matching step: the substructures a partner points is part of are determined. For this step, all currently registered substructures are used. If a substructure is added after other substructures, it might be possible that an earlier detected partner point is part of the new substructure. Hence, after a new substructure has been added, the detected atoms are temporarily stored for each protein-ligand complex.

If one of these atoms has earlier been detected as partner point, its attributes are updated.

Im Dokument Mining of Interaction Geometries in Collections of Protein Structures (Seite 70-74)