Limitations - Mining of Interaction Geometries in Collections of Protein Structures

8. Conclusion 113

8.2. Limitations

Besides the achievements presented in the previous section, both tools also have limitations.

Based on the main source for the limitation, they can be divided into four different groups.

First of all, there are limitations which derive from the used database technology. Secondly,

the use of the NAOMI library leads to restrictions. The third group deals with limitations derived from the used data. Finally, the specific algorithms implemented in both tools lead to restrictions. In the following, the limitations will be discussed in detail for each group.

Database technology

It has been decided to use SQLite because of its convenient way to interchange databases between different users and platforms. These advantages, however, come at a cost. In general, it could be shown that queries which lead to a high number of results lead to longer retrieval times. This is in part due to the runtime of the database queries. Given the growth of the available macromolecular structures, this will become more and more relevant in the near future. However, the targeted application scenario of Pelikan is to search for a specific pattern which leads only to a small number of results. Hence, even with growing data sets, the retrieval times might be sufficient. The limitation has a stronger effect on NAOMInova since here the querying of large numbers of partner points is a typical task.

Moreover, the used SQLite databases are not optimized for multi-user purposes. Hence, if databases for Pelikan and NAOMInova should be shared among several users, they have to be copied. Especially for databases containing large numbers of PDB files, this can be inconvenient. Updates have to be performed for each copy of a database.

NAOMI library

In principle, the NAOMI library is a very good basis for the development of computer-based approaches in the field of drug-design. However, the used data structures also limit the possible functionalities. First of all, a complex in NAOMI is based on a differentiation between protein and small molecules. In some cases, for example if peptides are present in a structure, this classification can be disadvantageous depending on the desired application.

Once a molecule has been classified as a small molecule during the complex initialization process, it cannot easily be turned into a protein. Since the classification into small molecules and proteins is used in both tools, it would be desirable to have a more flexible classification procedure. Otherwise, unintended results might be detected or important results are never found.

Moreover, the NAOMI complex initialization process is not able to handle flexible structures.

PDB files resulting from molecular dynamics simulations usually contain several coordinates for the same atom encoding the flexibility of the structure. As seen in the comparison between Pelikan and Relibase, this can lead to false negative results since only the first annotated structure is used in the NAOMI library. Also for structures derived from X-ray crystallography, alternate locations of atoms might be annotated in the PDB file which are not handled in the NAOMI library.

Data source

The main data source for both tools developed here is the PDB. The amount of deposited structures here is ever increasing and also the diversity of the proteins increased during the last years. However, there are still proteins who’s structures have been elucidated very often and which are therefore overrepresented in the PDB. At the same time, there are proteins which are difficult to crystallize or which are no typical target of drug design projects. Structures of these proteins are often underrepresented in the PDB. This might lead to a bias in the performed analyzes and might have an influence on the drawn conclusions.

Concretely, if Pelikan is used to find specific interaction patterns, it is not directly clear if the results derive from different structures of the same protein or from different proteins.

Even the inspection of all results might not get a clear decision of the former problem since the naming of proteins in PDB files is very inconsistent. In a similar way, such a bias might lead to wrong conclusion if inNAOMInova a specific patch of atoms around a substructure is detected.

Specific limitations

In both tools, the structures of proteins and their ligands are treated as rigid structures.

However, under physiological conditions, these molecules are flexible and also the binding between two molecules is not a rigid system. This flexibility is in part reflected by the possibility to define ranges for geometric constraints. However, a user has to know the size of movements in the particular region.

The position of hydrogens and the exact tautomeric states of all molecules are determined with the tool Protoss in this work. In addition, the optimal mesomeric forms of the molecule are determined by the NAOMI library during the initialization process. Afterwards, both tools handle these states as fixed and unchangeable. Hence, if a specific substructure is searched inNAOMInova, delocalized bonds and hydrogen positions have to be handled with care. The SMARTS language provides the term ’∼’ which matches any bond. However, expression for substructures can get very difficult if variable positions for bonds and hydrogens are considered. The same holds true if the environment of a search point is defined by a SMARTS patterns in Pelikan.

The tool Pelikan uses only the structural information of protein-ligand interfaces. This limits the use cases of the tool to projects dealing with the binding between a small molecule and a protein. Other interfaces such as protein-protein binding or intra-molecular binding in a protein cannot be investigated with the tools as presented here. Moreover, in Pelikan only the resolution of a structure can be used as quality criterion.

In both tools, the logical combination of filters is currently limited. All filter components are combined with a logical ’AND’. In some cases, a filter component can be negated and in other cases, an ’OR’ combination is possible. The possibility to combine all filter components with the logical operators ’AND’, ’OR’, and ’NOT’ would even increase the flexibility and

precision of the search process of both tools.

Im Dokument Mining of Interaction Geometries in Collections of Protein Structures (Seite 131-134)