Automated Structure Preparation and Its Influences on Protein-Ligand Docking and Virtual Screening

(1)

Automated Structure Preparation and Its Influences on Protein-Ligand

Docking and Virtual Screening

Dissertation

zur Erlangung des akademischen Grades

eines Doktors der Naturwissenschaften (Dr. rer. nat.)

vorgelegt von

Dipl.-Chem. Tim ten Brink

an der

Mathematisch-Naturwissenschaftliche Sektion Fachbereich Chemie

Konstanz, 2011

Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-158832

(2)

Dissertation der Universit¨ at Konstanz Tag der m¨ undlichen Pr¨ ufung: 09.09.2011

Referent: Dr. Thomas E. Exner Referent: Prof. Dr. Karin Hauser

Alle Rechte liegen bei Tim ten Brink

und Dr. Thomas E. Exner

(3)

I

Danksagung Mein Dank gilt

• Dr. Thomas Exner für das interessante Thema, die Betreuung und Unterstützung während der Anfertigung dieser Arbeit sowie für die Möglichkeiten internationale Tagungen und insbesondere Summer Schools zu besuchen um Methoden über das Thema dieser Arbeit hinaus zu erlernen

• Prof. Dr. Karin Hauser f¨ur die ¨Ubernahme des Zweitgutachtens.

• Prof. Dr. Valentin Wittmann für die Übernahme des Prüfungsvorsitz

• Dr. Oliver Korb für PLANTS, das die Docking Studien dieser Arbeit erst ermöglicht hat, für viel Hilfe am Anfang dieser Arbeit, viele fruchtbare Diskussionen in ihrem Verlauf und gründliches Korrekturlesen an ihrem Ende.

• Dr. Nicola Zonta f¨ur die Implementierung der ersten INPHARMA-PLANTS- Schnittstelle

• Dr. Marcel Reese und Dr. Adam Mazur f¨ur die INPHARMA-Bibliotheken und die Unterst¨utzung dabei sie in PLANTS nutzbar zu machen

• Meinen Mitarbeiterpraktikanten Falk Hildebrandt und Matthias Trautwein f¨ur die von ihnen beigetragen Arbeiten.

• Meinen Kollegen in der Arbeitsgruppe in chronologischer Reihenfolge Jens, Simon, Andrea, Fredrick, Ionut und Matthias für das gute Klima und viele anregende Diskussionen über viele verschieden Themen, die wir bearbeitet haben und die die zusammen Verbrachte ausser-universitäre Zeit.

• Meinen Freunden für die gemeinsame Zeit und die allgemeine Unterstützung während Studium und Promotion.

• Meiner gesamten Familie für die grandiose Unterstützung während des gesamten Studiums bis hierhin zum erfolgreichen Ende dieser Arbeit und insbesondere natür- lich Felix auch für das gründliche Korrekturlesen

(4)

(5)

1. Introduction

1.1. Drug Design

The development of new therapeutics against diseases is commonly referred to as drug design. Drug development or design is a large field of research from various disciplines, ranging from vaccines, that can consist of whole dead or only attenuated virus or other microorganisms and need a clearly biological approach, to the classical chemical drugs.

Chemical drugs have a long history in medicine and can be followed back to the use of herbal medicine since the antique. But it was not until the development of the organic chemistry that drugs could be synthesized. The first synthetic drugs were reproduced natural products like the famous salicylate drugs. But these drugs were often used without exact knowledge about the mechanism behind their effects. With the advance in medical, biological and chemical research the targeted development of drugs against certain diseases started. At this point the term drug design came into use. Modern laboratory and research techniques then lead to the establishment of general applicable workflows for drug design.

1.1.1. The Drug Discovery Process

The process of rational drug design starts normally with the identification of a druggable target. In most cases this is a protein but ribozymes and other structurally well defined RNAs or DNAs are gaining more and more importance [1–3]. In the past years G-protein coupled receptors where the most prominent drug targets, followed by ion channels, nuclear receptors and transporters [4, 5]. The methods applied in this stage range from genomic to biochemical assays [6–8]. The target identification is a complicated task and often a limiting factor at the begin of the drug design process [9]. The problems lie in the requirements of a good, druggable target. It must not only be responsible for the disease but also reachable by the drug molecules and its inhibition should not have side effects, which can be especially complicated if the intended target is part of a larger family of similar proteins. When the target is identified, the search for a drug molecule starts. If the structure of the target is solved, the search for potential binding sites for drug molecules is usually the next step. These binding sites are often the ones to which the natural complex partner binds but can also be different ones, when the binding to these sites somehow block the other binding site for example by introducing conformational changes

1

(8)

2 CHAPTER 1. INTRODUCTION

in the target. In most cases the intended drug molecule is an inhibitor to the target which shall prevent the target’s mal- or over-function, which actually causes the disease. For experimental testing the target protein has usually to be isolated either directly from cell tissues or it has to be overexpressed in another organism and then purified. The search for a potent drug molecule is an iterative, multi-step process. A general approach is to screen small molecules against the target with the hope to identify molecules or molecule fragments that bind to the target. From these screening hits a lead structure is developed. This structure is usually a small molecule, which has drug like properties and shows a strong affinity to the target. While affinity is generally considered to be the most important feature of a drug, the choice of a lead candidate should consider other very important factors, together describing the drug likeness of the lead, which should be rated very high at this stage of the development process. For the measurement of the drug likeness of a molecule different sets of physicochemical properties are used. The Lipinski rules [10], that where developed in 1997 by Cristopher Lipinski and state that a drug like molecule should have logP of not more than 5, weight not more than 500 Da, has not more than 5 hydrogen bond donors and 10 acceptors, are a very common set.

However, for lead molecules stricter rules should be applied [11]. During the further steps drug molecules tend to grow when other factors as bio-availability or oral administration properties are improved [12]. During the following lead optimization the proportion between affinity to the target, selectivity, bio-availability, administrative and metabolic properties, as well as side effects and toxicity from the drug itself and its metabolites are tried to be optimized. During the whole process the complexity of the developing drug has to be monitored to avoid problems with chemical synthesis later during production.

Once a satisfactory molecule has been developed the pre-clinical phase starts. Here the molecular properties are tested experimentally with respect to bio-availability, metabolic properties and toxicity. These tests are often conducted under standardized conditions on animal models but efforts are being made to reduce the number of animal tests. If the new drug succeeds in these tests the three phases of clinical studies can be entered.

In these the molecule is tested first on a small group of healthy human volunteers to register possible side effects. If these tests are successful, the drug can be tested for its actual effectiveness against the disease it was designed against. But even if it succeeds in the clinical phases the drug can still fail due to marketing or production issues. Overall, the whole drug design process can take up to over 15 years and cost several hundred million dollars [13, 14]. Figure 1.1 gives an overview of the process from disease to drug candidate molecule and for the whole process an overview of the expected time and costs.

(9)

CHAPTER 1. INTRODUCTION 3

Figure 1.1.: Left: Overview about the drug discovery process from its start to a drug candidate molecule with the methods that are applied to reach the various stages. Orange: stages of the process, green: areas where computer-aided methods are applied, blue: purely experimental methods. Right: The different stages of the drug discovery process with estimated costs and duration of the different steps. Numbers reproduced from [14]. Costs are given in million $.

1.1.2. Computer-Aided Drug Design

Computer-based methods have been used in rational drug design in the past decades and have developed to save time and money in this process [15, 16]. The classical rational drug design approach made use of chemical and biochemical tests to identify molecules with affinity to the target (hits). The lead optimization then required a huge amount of expert knowledge but issues like bio-availability, metabolites and toxicity were hardly accessible and if tried to be accessed needed animal experiments. Developments in robotics and progress with cell tissues as substitute for animal experiments lead to some major breakthroughs in chemical drug design. High throughput screening (HTS) largely increased the amount of molecules that could be tested on a target, while simultaneously reduced the amount of target substance needed for each single test. Albeit being a breakthrough HTS has some major drawbacks. For each new target a new test setup has to be developed and the compound libraries used for the screening have to be bought

(10)

or synthesized and then stored and maintained. Another point is the accuracy of the screening. The test has to be machine readable and therefore fluorescence assays are often used. Problems occur when the tested compounds react with the test assay and not with the target leading to false positive or false negative results. Even the benefit of HTS for the drug discovery process in general was put to debate [17, 18]. To help to avoid these drawbacks different computer-based methods have been developed primarily not to replace but to assist experimental methods and to save time and money in the drug design process. For target identification mainly statistical tools are used to identify for example irregularities in gen expression. Computer-based methods are of greatest use in the lead identification phase [15]. These methods are divided into ligand-based and structure-based methods. Ligand-based methods only use information of one or more molecules that are known to bind to the target. 2D or 3D alignments of the molecules and physicochemical properties as additional descriptors are normally used in ligand- based methods to describe similarity between the molecules. The assumption is then that molecules with similar structural and physicochemical properties will show similar binding and selectivity properties. This method is one of the oldest in computer-aided molecular design and is generally referred to as quantitative structure-activity relationship (QSAR) [19]. Structure-based methods on the other hand need a 3D structure of the target molecule. Therefore this structure has to be solved by X-ray or NMR methods, which will be described in Section 1.2 in more detail since these structures build the input of the work described here. When the structure is solved, the binding sites, where interactions with other molecules are possible, have to be identified. Cocrystallization with a known binding molecule may here be of great help. If cocrystallization is not successful or if the protein is expected to have more than one binding site computer- based methods can be used [20]. After the binding site is identified different methods can be applied to find a lead molecule. The most obvious is called de novo design[16]. Here the binding site is studied for patterns of hydrogen bond donors and acceptors or lipophilic parts, to which the lead molecule is then designed complementarily. Normally molecular fragments are used to fill up the binding site and are connected by linking fragments or atoms. During optimization the rigidity of the molecule is then often increased.

A main problem of de novo design is that the resulting molecules are often hard to synthesize. Approaches to include so-called chemical intelligence to the algorithms in order to avoid problems with syntheses of the molecules proposed by de novo design, have been suggested but are far from being available [16]. The second more popular method is virtual screening (VS) with molecular docking tools. In this process large databases of molecules are tested virtually by a computer program for their affinity to the target molecules. This approach resembles the classicalHTS approach but does not need compound libraries and the target molecule is only needed once to determine its structure. Another advantage is that molecules which have not yet been synthesized can be easily tested too. Accuracy of VS, at least with the programs available at the

(11)

moment, is a general point of debate [21] but for several targets at least comparable results toHTS [22,23] were reported. Another point of view is thatHTS andVS are not competitors in drug design but can give different insights of which the process can even benefit more if they are combined [24]. The already mentioned ligand-based approaches can be used forVS, too. Fragment-based approaches can be seen as a combination ofVS and de novo design. In these studies not whole drugs or at least lead like molecules are screened but much smaller fragments. The goal is to find the optimal binding fragment to each part of the binding site, which are then combined by linker groups or atoms [25].

In more advanced stages of the drug discovery process, computer-based methods can be used to improve the lead molecule. Scaffold hopping [26] can be used to change the main frame of the molecule when unfavorable properties or synthetic problems occur but the peripheral functional groups should be kept for affinity and bio-availability reasons.

Another application of computer-based methods is the prediction of molecular properties based on the molecules’ structure, like in the QSAR approach. Molecular features are taken not only to predict affinity but also toxicity [27, 28], logP, and derived properties like bio-availability. While increased hydrophilicity can be beneficial to stabilize solvent exposed parts of the drug molecule, when it is bound to the target, it is often hindering the membrane permeability and therefore lowers the bio-availability.

1.2. 3D Structure Determination

For structure-based methods in drug design the 3D structure of the target protein needs to be solved. The available 3D coordinates and structures of molecules are generated by experimental and computational methods. The generally used experimental methods, X-ray crystallography and NMR, are well established and their basic principles are the subject of various textbooks [29–32]. Therefore only some specialties of these methods, that arise when dealing with protein and protein complexes are briefly outlined.

One of the most important methods is X-ray crystallography which allows to solve the 3D structures of both small and macromolecules. This method is based on the scattering of X-rays at the electrons of the analyzed crystal. The resulting diffraction pattern is detected for all orientations of the crystal, but the phase information is lost during the measurement. For macromolecules the intensities of the diffraction points are much smaller than for smaller molecules. This leads to a general reduction of the obtained resolution for different reasons. First the high resolution information is contained in the beams of higher diffraction order, whose intensities are already low [29]. Combined with the generally lower intensity this often prevents the detection of the higher order beams.

A increased measurement time is often not possible due to the larger amount of radiation damage dealt to the crystal in longer exposure times. The phase solving problem for macromolecules is also more complicated than for smaller molecules. Direct methods are not usable for macromolecules. Isomorphous replacement is the most common method

(12)

to solve the phases in protein crystallography [30]. With the phases solved, an electron density map can be calculated. The atoms have then to be fitted into the electron density map which is especially in the case of large proteins a nontrivial task. The 3D structure of the molecule obtained in this way is therefore only a model although one with a strong experimental background if the process from the X-ray measurement to the final structures is carefully executed. Despite being a standard technique today, protein crystallography still faces many challenges and unsolved problems. A short review about today’s challenges in protein crystallography and approaches to solve them was given among others by Petrova [33]. One specific problem with X-ray structures used as input for computational methods is that protein structures usually lack hydrogen atom coordinates because of the hydrogen atoms’ low contribution to the diffraction and their low contribution to the electron density to which the structure is fitted.

The second most widely used experimental method to determine 3D structures of molecules is NMR-spectroscopy. Here the interaction of the nuclear-spin with an external magnetic field are measured. In organic chemistry NMR-spectroscopy is a standard technique for the characterization of molecules. Large proteins in their natural H₂O / buffer environment are however much more problematic. The mere number of peaks and other problems, that result from the size of the proteins, like line broadening make the spectra difficult to interpret. Modern 2D, 3D or even higher dimensional methods have greatly improved the usability of NMR-techniques to solve the structure of larger proteins [31,32]. The final result of NMR experiments on proteins are Nuclear Overhauser Effect (NOE) based distance constraints between hydrogen atoms and torsion angle constraints. Because of overlapping peaks and groups of magnetically similar hydrogen atoms these constraints are often ambiguous. To solve the protein structure this network of constraints has to be solved with only the protein sequence as additional information.

This is often done in an iterative approach of peak-assigning and structure solving. NMR structures usually contain the hydrogen atoms and an ensemble of structures which is the result of the force field minimization used at the end of each cycle of assignment and structure solving.

1.3. Scope of this Work

As described above, docking in virtual screening applications is an integral part of the process of computer-aided drug design. The structures needed for docking are often taken from structure databases like the PDB [34] or ZINC [35] database. But due to the limitations in the experimental methods used for their determination, these structures have to be preprocessed to be usable for computer-aided drug design. In this work a program for the preparation of these structures for methods that are applied in computer-aided drug design was developed. The features of this program, called Structure PrOtonation and REcognition System (SPORES) were then tested on several

(13)

standard docking benchmark datasets. Additionally, structures generated with SPORES were used in docking and virtual screening studies. These studies included tests of modified ligand structures like different stereoisomers, protonation states, tautomers and ring conformers. A workflow for the combination of SPORES with the MARVIN software from ChemAxon [36] was introduced to utilize pK_a values calculated with MARVIN in the structure preparation process. In addition to the results obtained in the structure- based virtual screening studies the work was extended to test ligand-based methods on the same dataset. The alignment method used for the ligand-based studies was also used to build CoMFA [37] models to predict small molecule affinities to odor receptors.

(14)

(15)

2. Structure Recognition

2.1. Molecule Representation

For all applications of computational chemistry the representation of molecules as computer models is a crucial step. The kind of representation has to be adapted to the kind of application. Databases, in which information about a large number of molecules is stored, need space efficient representations, which must also be suited for substructure searches. In these applications the representations also need to be canonical to avoid problems with duplicates in the database. An example for a database type of representation is the one dimensional SMILE strings [38] representation. SMILE strings represent molecules as simple strings, containing element symbols for atoms, special characters for several bond-types, brackets for branches and numbers to represent rings by showing which two atoms form the ring closure. Shortly after their introduction an algorithm which creates a unique SMILES notation for a given molecule was presented [39].

The next more complex representation method, which is obvious for chemists, are 2D representations, comparable to structure formulas. In computational chemistry these representations are used for human-machine interaction in draw-board like applications to submit queries for database searches or to create a representation of the molecule, which is than converted by the program to something better suited for the computer like 1D representations. Examples are the ZINC database [35] and theReliBase [40–42]

which are both described in more detail below. They both use sketch programs to allow the user to submit queries based on a drawn 2D structure of the query molecule but internally convert them to SMILES format for query processing. Beside these applications 2D structures are used in 2D QSAR methods. When structural features of molecules are of interest or for energy calculations, like quantum chemical or molecular dynamic applications as well as the calculations preferred here, 3D structures of the molecules are needed. 3D structures of molecules can be generated from the 1D or 2D representation by the use of bond-length, bond-angle and torsion angle tables. CORINA [43, 44]

from Molecular Networks and MARVIN from ChemAxon [36] are only two of numerous programs developed for this task. MARVIN was developed by ChemAxon for conversion between molecular file-formats, structure drawing and property prediction. In its graph- ical user interface it allows to draw 2D structures and convert them into 3D structures, to add hydrogen atoms to structures and to predict pK_a values, stable tautomers and micro-species distribution for given pH values. The MARVIN package provides also other

9

(16)

10 CHAPTER 2. STRUCTURE RECOGNITION

features like drawing and checking chemical reactions. Most features of MARVIN are accessible via the command-line, which makes MARVIN a well suited tool for scripting and programming. The other way to obtain 3D structures of molecules is to generate them from experimental data as described in the introduction in 1.2. For 3D representations different coordinate systems can be used. Many quantum chemical programs use internal coordinate systems like the Z-matrix. For molecular dynamic (MD) applications internal 3D coordinates based on the torsion angles are widely used. For docking external coordinate systems like those directly offered in the Protein DataBank (PDB) [34] are more common. For 3D visualization of molecules different model representations are used.

While some show the molecules just as a frame-work of their bonds, other representation have a more physical background like the CPK [45] model, in which the atoms are represented by spheres with diameters proportional to their van-der-Waals (vdW) radii.

For larger molecules especially proteins more simplified representations are often used to give a clearer view of the molecule. Ribbon representations [46], in which the backbone of the protein is used to give an overview on the structural features, are an example for these simplified representations. While backbone-focused representations are useful to compare protein structures, they are unable to give information about binding pockets or substrate channels in the protein. For the visualization of these, molecular surfaces like the Connolly surfaces [47–49] are useful. To generate the Connolly or solvent-accessible surface of a given molecule a probe sphere representing the solvent molecule is defined.

For the water accessible surface the probe sphere radius is set to 1.4 ˚A. This sphere is then moved along the CPK model of the molecule. If the sphere touches the CPK surface at only one point the CPK surface becomes the Connolly surface. If the sphere touches the CPK surface in more than one point the part of the sphere surface in between these points becomes the Connolly surface. The Connolly surface is useful to display molecular properties like local lipophilicity or shape dependent properties like the shape-index [50].

This combination of topographical and property visualizations of proteins are of great value to study interactions and interaction sites of proteins with small molecules.

2.2. Molecular Databases

The PDB is the largest free database of protein and protein complex structures. It contains mostly crystal structures with a growing amount of NMR structures and some structures obtained by other methods like neutron scattering. Structures built by homol- ogy modeling are not accepted. The PDB offers its own file-format which has become a standard format for protein structures. At the moment (June 2011) the PDB contains over 73000 structures. The free ZINC database contains purchasable chemical compounds for virtual screening applications. It contains at the moment over 13 million compounds (April 2011) which are available in different formats. In addition to the structures ZINC offers information about vendors and database tools e.g. to directly

(17)

CHAPTER 2. STRUCTURE RECOGNITION 11

access drug-like subsets of the database. ReliBase is offered by the Cambridge Crystal- lographic Data Centre (CCDC). It provides access to the structures of the PDB with additional tools, like the possibility to search for structures which contain the same or similar ligands, download of only the binding site of the protein or alignments of the binding sites of different proteins and direct display of the interaction pattern between the protein and the ligand. The additional features make ReliBase especially useful to compile datasets for docking and VS applications.

2.3. The Structure Recognition Problem

The problem of translating experimentally obtained 3D atom coordinates and elemen- tal information into a chemically correct structure model of the molecule is known as structure recognition. During this process the connectivity of the molecule has to be established and the hybridization has to be assigned to the atoms. The hybridization and the element information is combined in a so called atom-type. For the bonds a bond-type is defined which normally refers to the bond order but can also contain additional information. For these atom- and bond-types various conventions exist, which were mainly developed for different molecular mechanic force fields. These type conventions usually utilize a large number of different types which depend not only on the atom or bond under consideration but also depend on the neighbors of the atom and bond and the presence of certain functional groups in the neighborhood [51–57]. Other conventions use only a limited number of different types for each element and take only the atom itself and its direct neighborhood into account [58]. E.g. in the first case the carbonyl carbon of an aldehyde and a ketone would be assigned different types, differentiation can even be made between ketones in different environments, while in the second case no distinction between the carbonyl carbon atom in a aldehyde and a ketone group is made. The structure recognition problem itself is a nontrivial task. Experimental errors and missing hydrogen atoms in protein crystal structures are two of the reasons why automated structure recognition has been long worked on and is not totally solved [59, 60].

For a chemist, who is used to work with structure formulas and is normally able to decide on the the correct hybridization and bond orders at one glance, at least in the case of 2D structures, the great problem of automated structure recognition may seem like a paradox. The problems may become a bit more evident if one imagines to assign sp² or sp³ hybridization and single or double bonds to a chain of carbon atoms without the information about hydrogen atoms. If the structure is then only slightly disordered the correct placement of double bonds and the correct assignment of the hybridization becomes difficult. The whole process is even more problematic for a computer that although much faster in angle and bond length calculations than a human is much inferior in terms of visual perception. The impotance of the problem became also evident in a docking and VS session at the the ACS fall meeting 2010 in Boston. The results

(18)

presented their will be published in a special edition of the Journal of Computer-Aided Molecular Design during this year, but the general consensus was that the quality of docking results depends directly on the quality of the structure preparation (Oliver Korb and Chris Williams, personal communication). In structures taken from the PDB disordered bond lengths and angles are a common problem. Therefore programs that deal with structure recognition of those protein and protein complex structures need to be able to deal with experimental errors in the input data. Especially the ligand molecules are often affected because the molecular force fields used in the refinement process of the structure determination are often poorly parametrized to deal with other organic molecules than proteins. The problems occurring with ligands from the PDB have also been addressed for a longer time with the program BALI [61]. In the following some other programs used for structure recognition are briefly described.

2.4. State of the Art

Various programs have been developed to cope with the structure recognition problem.

On the one hand often standard modeling tools like SYBYL from TRIPOS [62] are used.

SYBYL is a commercial multi-functional program suite, which provides a wide range of tools ranging from different molecular force fields over QSAR methods and docking tools to statistical tools like partial least square (PLS) and multi-linear regression methods.

Additional SYBYL provides easy ways to modify molecular structures or to build them by ”clicking atoms and fragments together”. These programs often have the problem that they are designed to process a specific group of bio-molecules like proteins in the case of SYBYL and not small organic molecules, which contain elements and functional groups not present in amino acids. This often leads to wrong atom- and bond-types and a wrong protonation for small molecules. On the other hand more specialized programs are available. The LIGPREP and PROTPREP software from SCHR ¨ODINGER were designed to be used with the GLIDE [63–65] docking program. These are examples for programs specifically designed to prepare structures for docking. PROTPREP can be used to define the binding site for docking with GLIDE and assigns hydrogen atoms to the protein structure. It tries to find the optimal protonation for the binding site. Side chains which are not part of the binding site and which do not participate in salt bridges are tried to be kept neutral. The LIGPREP routine can be used to prepare the ligand structure. In this case the ligand preparation can be directly done inside the binding site which ensures a protonation that is adapted to the conditions of the binding site. While this feature can help to greatly improve docking results if the binding pose is known, it should be avoided in the setup of virtual screening datasets to avoid preferred treatment of known binding molecules. One of the most sophisticated programs for protonation of crystal structures is the PROTONATE3D software developed by the CHEMICAL COMPUTATION GROUP [66]. In this case after a standard protonation is assigned,

(19)

the hydrogen bond network is solved by free energy calculations to guarantee an optimal placement of the hydrogen atoms. However the program is limited to proteins and protein complexes and can not be used for small molecules alone. The programs OPENBABEL [67] and i-INTERPRET [68] were originally designed for file format transfer, a problem which is related to structure recognition since some file formats provide less information about the molecules than others. i-INTERPRET uses a database of functional groups for faster and more accurate structure recognition. This helps in a way that after these groups have been identified and their properties have been assigned only the linker parts of the molecule between them are left to be treated. The REDUCE software [69] is a program that can add hydrogen atoms to PDB structures and optimizes the protonation of histidine residues. REDUCE can also solve the problem that the amide group in aspargine and glutamine residues can have wrongly assigned oxygen and nitrogen atoms that distract the hydrogen bond network of the protein. PDB2PQR [70,71] is a program that was originally developed to translate pdb files into pqr files, but has evolved to a tool for automated preparation of protein structures. For hydrogen addition it calculates the pK_avalues of side-chains using the PROPKA routine [72]. Another feature of PDB2PQR is the ability to assign the atom-types, parameters and charges for different molecular force fields to the protein, using the PEOE PB routine [73]. The general methodology of all these programs is quite similar. The hybridization of the atoms are determined by looking at their surrounding and the bond-types are set according to the hybridization of the atoms as well as bond lengths [59, 60]. The position of the hydrogen atoms are calculated from the coordinates of the heavy atoms and parameters for angles and bond length.

2.5. SPORES

The Structure PrOtonation and REcognition System (SPORES) [74] is the main product of this thesis. SPORES was developed to generate input structures for the docking tool PLANTS [75, 76]. When protein and ligand structures are taken directly from the PDB they are in the pdb file format and lack hydrogen atom coordinates in most cases. As PLANTS needs input structures in mol2 file format, the molecules have to be translated from the pdb format to mol2 format and the hydrogen atoms have to be added. A key step in SPORES is the structure recognition, that is needed for all other parts of the program. The structure recognition consists of five main parts. First the connectivity of molecule is established in the bond generation phase. In the second step, rings in the molecule structure are detected and stored. With this information the atom-types, which in the case of SPORES mainly stand for the hybridization of the atoms, are set.

Afterward the rings are classified for planarity and aromaticity. With the assigned atom- types the molecule is then protonated if hydrogen atoms are missing in the structure.

This initial protonation of the molecule, the SPORES standard protonation is designed

(20)

Figure 2.1.: A protein (small CPK model) in a grid (red lines) with a resolution of 5.0

˚A used for bond generation. The pictures shows the resolution of the grid compared to the length of the bonds and the size of the whole molecule.

to be the correct protonation for organic compounds at the physiological pH. In the last step bond-types are set. Additionally to these standard structure recognition SPORES can be used to create variations of ligand structures which can then be used for example in docking. These modification of the ligands include protonation states, stereoisomers, tautomers, and ring conformers. The goal of including variations of the same ligand structure into the docking process is to improve the coverage of chemical space which is especially important for virtual screening [77–79]. In the following the different steps and routines of SPORES are described in detail, while the application of the generated structures and their influence on docking and virtual screening are discussed in Chapter 3.

2.5.1. Bond Generation

The bond generation is the first step of the structure recognition. Based on the information provided in a usual crystal structure pdb file, the connectivity of the molecule is established. SPORES uses only the element information and the coordinates from the

(21)

Figure 2.2.: Two carbon atoms are bonded when their distance is shorter than 1.2 times their covalent radii (dotted turquoise spheres).

pdb file. A grid-based method is used to generate bonds between atoms. Therefore a grid with a resolution of 5.0 ˚A that extends two layers of cells over the molecule in every direction is laid over the molecule. All atoms are assigned to the grid cell in which they are located. To find potentially bonded atoms only 27 cells (the cell where the atom is in and its 26 neighbors) of this grid have to be searched. Due to the grid generation this method is slower for small molecules, than the straight forward approach, in which the distance between one atom and all others are checked to find bonded atoms. For bigger molecules the method has nearly linear scaling properties instead of the quadratic complexity of the straight forward approach, since for every added atom only the 27 cells have to be checked. Figure 2.1 shows an example of a protein in the 5 ˚A grid. The bond generation of SPORES is based on atomic radii, which were taken from the MOLCAD software [80–82]. Two atoms are considered as bonded if the distance between them is smaller than 1.2 times the sum of their atomic radii. Figure 2.2 shows an example of an ligand where the 1.2 times criteria is shown. For some elements especially half metals and halogens this criteria is reduced to 1.1 times the sum of the radii. This was necessary because in some structures the 1.2 time criterion produced wrong bond networks between some of these atoms and groups close by. With the reduction to 1.1 times the sum of the radii these problems were avoided and still all correct bond were created. For metals a sphere of 4.88 ˚A is considered for coordinating hetero atoms. The

(22)

Figure 2.3.: Some examples for the ring detection in SPORES. a) The terminal bond mechanism in SPORES. From the upper left structure: Starting with the whole molecule the first set of terminal bonds is searched (red). After the elimination of the first terminal bonds the new ones are marked and eliminated leaving only the ring. b) The principle of the BFS: Starting at one bond in the ring the neighboring bonds to both sides are followed until the ring closure is found. c) The first nontrivial example of the Naphthalene molecule. TheBFS finds two ring closures, first the left single ring and one step later another. d) The two paths from the first ring closure are followed back to the root bond. In this step the branches are followed one after the other and in the last step the root bond is added to the ring.

radius of the coordination sphere of metal ions was defined after several test proteins with different ions in their structure were examined.

2.5.2. Ring Detection

Ring detection is one of the challenging parts of structure recognition. The main problem lies not in finding a ring closure but to find the relevant set of rings for a molecule. For complex ring systems the smallest set of smallest rings (SSSR) is the minimal set of rings to describe the whole system [83]. The naphthalene molecule in Figure 2.3 for example has three rings, the two benzene rings and the 10-membered complete ring. The next molecule in the row of the polyaromatic systems, the anthracene with its three benzene

(23)

rings has already a total of 6 rings, of which the two naphthalene systems are not really important to describe the molecule. The SSSR in this case would consist of the three benzene rings, but the 14-membered cycle may also be helpful to describe the molecule.

Since the 1960s this graph theoretical problem and approaches to solve it have been published in chemical journals as methods to find the cyclic subsystems in molecular graphs. Downs et al. gave in 1989 an overview of the until then published methods for ring perception [84]. In 1996 Figueras presented his method of using a modified breadth-first-search (BFS) [85] to find the SSSR which gave significant improvements in calculation time over previous methods. In SPORES the ring detection problem is also solved with a breadth-first-search. This method finds all basic rings, which is enough information for the further structure recognition process. Before the ring detection is started all bond that belong to acyclic parts of the molecule are iteratively identified.

Therefore, in the first cycle all terminal bonds, e.g. bonds to atoms that are only connected to one other atom are searched. In the second step all bonds that lead to atoms whose other bonds are all terminal are marked as terminal. This is repeated until no more terminal bonds are found, thus reducing the problem size significantly.

Afterward the actual ring detection is started once on all non-terminal bonds. TheBFS stops when no more bonds can be reached while walking through the molecule graph along neighboring bonds. Every time a ring closure is detected the two paths are followed back until the starting bond (the root-bond) is reached again to verify the found ring.

Figure 2.3 shows the mechanism of the terminal bond search (a) along with two examples of aBFS run (b and c). (d) gives an illustration how the paths are followed back to the root-bond.

2.5.3. Known Residues

The atom and bond typing in proteins is easier than for other organic molecules, since they consist of known amino acid residues. Prior to the atom and bond typing routines the residues of the treated molecule are checked if they match one of the known amino acid. In this cases the atom- and bond-types are directly applied to the substructure of the molecule from the residue tables implemented in SPORES. This residue check depends on correctly named residues and atoms in the pdb file that is processed. Atoms that could not be identified to belong to one of the known residues are treated with the normal atom typing routine. At the moment the residue tables of SPORES contain only the twenty amino acids and the heme group to avoid problems with this complicated structure.

2.5.4. Atom-types

Atom-types are descriptors for atomic properties beyond the element information. In most cases they represent information about hybridization and the direct neighborhood

(24)

Table 2.1.: The maximum length to which a bond is considered to be a double (DB) or a triple (TB) bond for the elements with different atom-types. The angle criteria are an additional method to verify the hybridization for atoms with two neighbor atoms.

max DB length [˚A] max TB length [˚A] sp² angles [^◦] sp³ angles [^◦] C 1.45 1.30 111 < χ <165 165 < χ <195

N 1.44 1.30 - 165 < χ <195

O 1.32 - - - -

S 1.74 - - -

of the atom. Some molecular force fields like AMBER [51, 52] or MMFF94 [53–57] use more detailed atom-types which reflect the functional groups of which the atom is part or even the groups that are neighboring the considered atom. In SPORES the simpler TRIPOS convention [58] is used. Table B.1 in the appendix gives an overview of the atom-types used by SPORES. In general, in the used convention different atom-types are only available for the elements which are important for organic or bioorganic compounds.

For most metals and other elements only one generic type exists. To determine the correct atom-type for a given atom means to find the correct hybridization, which has to be determined from geometric properties. In SPORES the same routine for atom-type determination is used whether hydrogen atoms are present in the structure or not. In the case when no hydrogen atoms are present the determination of the correct atom- type and therefore the hybridization is much more difficult as in the case where all hydrogen atoms are present since, in the latter case, the hybridization can essentially be determined by the number of direct neighbors. The atom-type setting is conducted by a multi-step approach beginning with the number of neighboring atoms. As most atoms in organic compounds follow the octet rule [86, 87] the maximum number of neighbors for each element is known. If this maximum number of neighboring atoms is present the atom-type can be set accordingly. If the number of neighbors is lower additional information has to be taken into account. In the case of terminal atoms, when only one neighbor atom is available the length of the bond is the only possibility to discriminate between different hybridizations. This is especially needed for oxygen atoms to distinguish between hydroxyl and carbonyl or aldehyde groups or the thio equivalents in case of sulfur atoms. If the atom has two neighbors, the length of the bonds and the angle between them are calculated to get the correct hybridization. For the bond angles different intervals are used whether the atoms are in rings or not. E.g. carbon atoms, which are not part of a ring are considered to be in sp² hybridization if they have two neighbors and a bond angle between 111^◦ and 165^◦. Below 111^◦ they are considered to be sp³ atoms and above 165^◦ a disordered sp hybridization is suggested. The angle criteria is modified to a minimum of 102.5^◦ if the atom is in a 5 membered ring and

(25)

raised to 114^◦ for other rings. If the length of all the attached bonds exceed a value of 1.45 ˚A, which is set as maximum length of a double bond from a carbon to any other atom, the sp³ hybridization is accepted. For the sp hybridization of carbon atoms a maximum length of 1.3 ˚A is used for the triple bond. Other elements are treated the same way but of course different parameters for bond length and in some cases different angle parameters are used. Table 2.1 gives an overview for the most common elements.

If the atom has more than two neighbors improper torsion angles are calculated which allows for a better discrimination between sp² and sp³ hybridized atoms. The improper torsions are used instead of real torsion angels because they only need the first sphere of neighbor atoms enabling to calculate them up to one atom closer to the end of a chain, which are commonly the most problematic regions for the structure recognition. For each atom with 3 direct neighbors all permutation of improper torsions are calculated.

For carbon atoms an improper torsion value between 145^◦ and 185^◦ is regarded as an evidence for a sp² center while lower or higher values are counted as sp³ evidences. After the atom-types are set according to the hybridization the neighborhood of the atoms is considered to set the more specialized atom-types like N.am for nitrogen in amide groups or C.cat for the carbon-cation in the guanidine group. For some cases SPORES extents or changes the original TRIPOS convention. For example the atom-type N.am is in the original convention reserved for the nitrogen atoms in amide groups of proteins [58] and carbon-acid-amides but in SPORES it is also used for analogous groups like sulfon-amides and phosphor-amides. For other cases like nitro groups no strict convention exist and the different possibilities are used in different tools. The main issue with the use of different atom-types for the same groups is that different atom-types imply different bond-types.

The GOLD [88–90] docking program for example expects a nitro group to be formed of two sp² hybridized oxygen of the O.2 type and one sp² hybridized planar nitrogen of N.pl3 type connected by double bonds. With this convention the partial charges are not correctly represented. Therefore SPORES uses two O.co2 types for the oxygen atoms, a type used for deprotonated carboxylic groups in the original convention. For the nitrogen a N.pl3 type is used too. The bonds of the nitro group are of the aromatic type in SPORES which corresponds to a bond order of 1.5. With these atom- and bond-types the neutrality of the nitro group is represented well, with two negatively charged oxygen atoms with a formal charge of -0.5 each and the central positively charged nitrogen with a formal charge of +1. For other functional groups also different conventions exists.

In most of these different atom- and bond-types are used to represent the same group.

Examples are phosphate, sulfate and guanidine groups.

(26)

Figure 2.4.: The ring normal criterion shown for two planar rings (left) of which one is not aromatic due to the keto group and for a slightly disordered part of a molecule where one ring is recognized as aromatic and the other not (right).

The vectors from the ring atoms to the center of mass are shown in green the normals calculated from these vectors are shown in cyan.

2.5.5. Aromaticity

For aromaticity only rings with less than nine and more than four members are considered.

The main criterion for aromaticity in SPORES is planarity of the ring system. The Hueckel-rule is not applied, because an exact definition of aromaticity is in the case of SPORES not of interest. The main reason for aromaticity detection is to apply the C.ar and N.ar atom-types to carbon and nitrogen atoms in six-membered aromatic rings. In all other planar rings the normal C.2 for sp² carbon atoms and N.2 and N.pl3 for sp² nitrogen atoms with two and three neighbors respectively, are used according to the specifications of the TRIPOS force field. The additional expense for a extra check for sp² atoms in rings is justified as the normal atom typing has problems with ring systems because they tend to have somewhat different bond angles and length compared to the same atom-types in free chains. First all atoms of the ring are checked whether they have three or less neighbor atoms. Afterward the planarity check is done. A first check for each ring atoms is to calculate the vectors to its ring neighbors and the cross-product of these vectors. Then the scalar product of the resulting normal vector and the vector from the atom to the ring center is calculated. If the absolute value of this scalar exceeds 0.2 the atom is rejected for planarity. In a second check the vectors from the ring center to all atoms of the ring are calculated. Then the normal vectors of all pairs of these vectors are averaged. With this averaged normal vector again all vectors, from each ring atom to the center, are multiplied by a scalar product. If the absolute value of all but one of these scalar products does not exceed 0.07 the ring is considered as planar.

The normal vectors for an aromatic ring and a not aromatic ring are shown in Figure

(27)

2.4. Parallel to this the presence of chinone like structures is checked based on the bond length to neighboring atoms which are not part of the ring. For this check carbon, nitrogen, oxygen and sulfur atoms are considered with different threshold below which the ring is set to a chinone type ring. For oxygen and nitrogen 1.30 ˚A, for carbon 1.35

˚A and for sulfur 1.74 ˚A are set as maximum distances for which a chinone like structure is assumed.

2.5.6. Standard Protonation

With the information about connectivity and hybridization the standard protonation is assigned to the molecule. The SPORES standard protonation is a model for the protonation of organic molecules at physiological pH value. The protonation is created with a rule-based method and the hydrogen atom coordinates are calculated from the heavy atom coordinates according to the hybridization and the expected geometry. The neighbors of the considered heavy atom are used to generate a near optimal placement of the hydrogen atoms. As bond length for the hydrogen atoms the equilibrium distances of the TRIPOS force field [58, 62] is used. The protonation of basic and acidic groups is generated purely rule-based in this routine. Amines are protonated to their quaternary form, hydroxyl amines and nitrogen atoms in 6-membered aromatic rings are deprotonated as are all acidic groups like phosphate, sulfate and carboxylic groups. A special problem for protonation are heterocycles containing more than one hetero atom. When nitrogen atoms are present they have to be checked whether they are protonated or not. For six membererd aromatic rings the nitrogen atoms are considered deprotonated. When the aromatic ring system is broken up by chinone like substructures the nitrogen atoms are checked if they are of the amide type, and need to be protonated. Nucleo-bases are common examples for this structural feature. In five membered rings the situation is more difficult. Often it is impossible to decide whether a nitrogen atom should be protonated or not when only the lengths of the attached bonds are considered. SPORES therefore also checks the difference of the lengths of the two bonds and uses other factors like the presence of an already saturated hetero atom which would mean that the current atom is needed for the two double bonds of the ring. If the first nitrogen atom was excluded from protonation for geometric reasons and the second is inconclusive, it is protonated.

For larger rings SPORES uses the same methods as for non-ring atoms.

2.5.7. Bond-types

Bond-types is the most challenging part of the structure recognition. Here SPORES uses again the TRIPOS convention which uses the obvious single, double and triple bond notation but has also special types for aromatic and amide bonds. Besides this a

”no bond” type is available which is used by SPORES internally to describe coordinative bonds between hetero atoms and metal ions. Especially in large conjugated systems

(28)

the correct placement of double and single bonds is a nontrivial problem. SPORES uses bond lengths as the primary measurement for the bond-types. The problem of conjugated systems is solved by an iterative method that tries to find the optimal placement of double and single bonds. Again a multi-step approach is used. At first all bonds to hydrogen atoms are excluded, because they are always single bonds. Then all bonds between sp² atom are checked and the first set of double bonds is placed. The checks performed for each bond depend on the element types of the atoms that form the bond (see Table 2.2). Nitrogen and carbon atoms in sp² hybridization can form one or two single bonds in addition to a double bond. Therefore in case of such sp² centers it has to be checked which of the bonds are single and which are double bonds. The length of the bond is used to decide between single and double bond-type in these cases, the maximum length for double bonds are given in Table 2.1. For oxygen and sulfur atoms in sp²hybridization the length check is omitted because in normal cases they will not have an additional single bond if they form a double bond to another atom. At the same time amide bonds are set to their bond-type. Delocalized bonds, which are not part of an aromatic ring, like the carbon-oxygen bonds in deprotonated carboxylic groups are given the aromatic type and triple bonds are assigned between sp hybridized atoms. The aromatic bond-type is used for carboxylic, nitro, benzamidine, and guanidine groups. With the first set of double bonds the iterative part, which only changes single and double bonds between heavy atoms but does not change any other bond-type, is started. In this process atoms which do not fulfill the octet rule are identified by counting the bonds according to their order.

In this case amide bonds are counted as single bonds and for aromatic bonds a bond order of 1.5 is used. If the bond count is to low for an atom a double bond is placed to a neighbor if the bond was not changed during this or the previous iteration. If the bond count is too high a double bond of the atom is reduced to a single bond again only if it was not changed in the actual or the previous iteration. This leads to a new set of double bonds to which again the same method is applied. The algorithm terminates if no more bond count violations are detected or if a maximal number of iteration is reached. In the last step of the bond-type assignment remaining violations are tried to be solved and atom-types are adjusted to the generated set of double and single bonds. The types of violations solved in this step are mainly erroneous atom-types like sp³ centers between sp² centers in substructures which were thought to be conjugated systems but for which no solution for the bond placement could be found. In this case the atom is set to its sp³ atom-type and the protonation is adjusted accordingly.

2.5.8. Protonation states

2.5.8.1. Small Molecules

In SPORES four different methods to generate protonation states for small molecules are implemented. One is a purely combinatorial approach while the others are based onpK_a

(29)

Table 2.2.: Matrix for the decision about the initial set of double bonds. In some cases no length check is done because the specific atom cannot have more than one double bond in the sp² hybridization. The element symbol within the hybridization indicator means that the sp² geometry of this atom is seen sufficient to set the bond between the two atoms to a double bond.

C N O S

C hybrization / length hybridization / length hybridization (O) hybridization (S) N hybrization / length hybridization / length hybridization hybridization O hybridization (O) hybridization hybridization hybridization (O) S hybridization (S) hybridization hybridization (O) hybridization

values calculated with the MARVIN [36] software. The first method to generate protonation states for small molecules, the combinatorial approach, is started from the SPORES standard protonation and uses predefined functional groups. It generates protonation states by adding and removing single hydrogen atoms to these groups, thus generating all combinatorial possible protonation states. The protonation states resulting from this method are referred to as combinatorial states. Table 2.3 gives an overview over the functional groups used for the generation of protonation states. The three methods in which SPORES uses MARVIN’s ability of fast pK_a calculations are the filtered states, the MARVIN states and the micro-species states. Since MARVIN uses other bond- and atom-type conventions than SPORES for some functional groups, the first step of the workflow of MARVIN along with SPORES is a special atom and bond typer in SPORES to prepare the structure for the processing with MARVIN. To generate thefiltered states SPORES uses the calculated pK_a values to filter the combinatorial states. The filter classifies the atoms in five groups, very acidic, acidic, neutral, basic and very basic, according to the groups pK_a value. A protonation state is rejected if either a very acidic atom is protonated, a very basic atom is deprotonated or at the same time an acidic is protonated and a basic is deprotonated. The pK_a thresholds for the different clas- sification of the groups were varied during the test of the method (see Chapter 3.2.3).

Since this method still uses the predefined groups for the protonation states and in some cases pK_a values close to the physiological pH belonging to other functional groups are calculated by MARVIN the second method, the MARVIN states, was implemented. In this method all atoms, for which a pK_a value was calculated, were used for the combinatorial method. Here a more simple filter system with only two threshold values is applied. The two threshold values define the borders in between which an atom with a given pK_a value is considered as variable in its protonation. Atoms with a pK_a value below the acidic threshold are considered as always deprotonated while atoms with apK_a value above the basic threshold are considered as always protonated. Again the input structures specially generated for MARVIN were used to calculate the pK_a values. To

(30)

Table 2.3.: The functional groups used for the generation of the combinatorial states and the filtered states.

Alcalic groups protonated in the standard protonation

Weak acidic groups protonated in the standard protonation

Weak alcalic groups deprotonated in the standard protonation

Acidic groups deprotonated in the standard protonation

test their influence on the protonation state distribution of docking datasets and on the docking results the threshold values of the filter systems were varied. Besides the pKa

calculation MARVIN can directly generate micro-species distributions for a molecule at given pH values. To utilize these micro species distribution as the micro-species states, the charge information given by MARVIN about the ligand atoms were used by SPORES to generated structures, which protonation corresponds to these charge information. In this case the only variable used in testing of the generated set was the pH value for which the distributions were generated. Micro-species states which had an contribution of less than 1% to the total population of a molecule at the given pH value were not re-generated with SPORES.

2.5.8.2. Protein Protonation States

In SPORES, different protonation states for proteins are restricted to the surface atoms of defined binding sites. The identification of surface atoms in SPORES is the same algorithm as PLANTS uses [75, 76, 91]. This method follows the early approach of Lee

(31)

and Richards [92] on which the Connolly surfaces are based. A grid-based approach to simulate the rolling of the probe-sphere along the molecule surface was proposed shortly after the method of Lee and Richards by Shrake [93]. In SPORES the atoms of the protein are again stored in the cells of a grid with 5 ˚A resolution. For the placement of the probe spheres a second grid with a resolution of 0.4 ˚A is used. For each position of the probe sphere with a fixed radius of 1.4 ˚A the neighboring atoms from the larger grid are checked for their distance to the probe sphere. If they are reasonable close to the sphere (closer than 2.5 ˚A plus the atoms vdW-radius but are not closer than their vdW-radius plus the sphere’s radius) to the sphere’s position they are regarded as solvent accessible from that position. The atom grid ensures, like in the bond generation, that for every probe sphere position only a limited number of atoms around this position has to be checked. After the identification of the surface atoms, all of them within the binding site are used to generate protonation states with the same combinatorial method that is used for small molecules. The binding site has to be predefined by PLANTS or GOLD. The limitation to the binding sites is used to limit the number of generated states, especially to exclude protonation changes in part of the protein that are not participating in the binding of the ligand. The groups, which protonation can be changed, are basically the same that were used for the small molecule combinatorial protonation states. In terms of amino acids these are aspargine and glutamine with their carboxylic groups, lysine with the amine group and the two aromatic nitrogen atoms in histidine, the one in tryptophane and the phenolic hydroxyl group in tyrosine.

2.5.9. Stereoisomers

Stereoisomers are generated in a two step method by first identifying all possible stereocenters and then combinatorial generating all possible stereoisomers. In the nontrivial case, where not all four first neighbor atoms are different, stereocenters are identified by pairwise comparing the neighbor chains via tree searches. If a branch in both chains is discovered, new tree searches are started. A branch in only one chain would mean a dissimilarity and therefore the searches would be finished. The criteria for identity of two atoms are element, atom-type and number of neighbor atoms. The inversion of the stereocenters is then done by a reflection. The two smallest neighbor chains are identified.

If these do not contain other stereocenters and are not part of the same ring system as the stereocenter, they are chosen for the reflection. The first atoms of the two neighbor chains not chosen for reflection are used along with the stereocenter to generate the reflection plane. If it is not possible to use neighbor chains without stereocenters these centers are inverted after the reflection to restore their configuration. As the reflection process often leads to clashes or other unfavorable conformations, the structures need to be energy minimized before they are used for other applications like docking. In this work all energy minimization were executed with SYBYL and the TRIPOS force field. Figure

(32)

Figure 2.5.: The method for stereoisomer generation used in SPORES. When a possible stereocenter is identified the isomers are generated by a reflection of two neighbor chains at the plane formed by the stereocenter and the first atoms of the other two neighbor chains.

2.5 shows a ligand with its two stereoisomers and the plane, used for the reflection, that converts the two isomers into each other.

2.5.10. Tautomers

2.5.10.1. Keto-Enol Tautomerism

Keto-enol-tautomerism is handled by a simple search for oxygen atoms followed by an examination of the neighborhood. If either a ketone with an acidic hydrogen next to it or a enol is found, the other tautomer is generated by simply adjusting the atom- and bond-types followed by an adjustment of the protonation. Afterward an energy minimization is required to adopt the molecule geometry to the new hybridization. The different patterns used for keto-enol tautomerism in SPORES are shown in Figure 2.6.

(33)

Figure 2.6.: The patterns used for keto-enol tautomerism in SPORES.

2.5.10.2. Tautomerism in Heterocycles

Hetero-cyclic aromatic ring systems which contain more than one nitrogen atom and have neighboring oxygen atoms, which are not part of the ring system, can have mul- tiple different tautomers. This tautomerism is handled in SPORES separately from the normal protonation states. First, patterns required for this tautomerism are identified.

These patterns consist of ring systems containing proton donors and proton acceptors either directly in the ring system or in its neighborhood. The hydrogen donor can be a sp² nitrogen atom with at least one hydrogen neighbor or a hydroxyl group. The sp² configuration for nitrogen atoms is required to rearrange the double bonds after the hydrogen atom is shifted. The hydrogen acceptors can either be sp² oxygen in carbonyl groups or sp² nitrogen which are part of a double bond. The possible tautomers are then generated with a combinatorial method including all donor acceptor pairs by replacing the hydrogen atom and adjusting the bond orders between them. Figure 2.7 shows an example of a guanine molecule with all tautomers generated by the algorithm.

2.5.11. Ring Conformers

SPORES is able to generate different conformers of saturated 5- and 6-membered rings.

For 6-membered rings, only the different chair and boat conformations are generated by SPORES. The ring is searched for sp²centers which cannot be flipped and for neighboring rings which prevent corner flips. If no sp² atoms and neighboring rings are present all possible ring conformers are generated by repeatedly flipping single corners. The flipping itself is done by two rotations. Unlike other published tools [94] the method used in SPORES does not require bond breaking or the introduction of dummy atoms. The first part of the corner flip is similar to the method presented by Goto et al. [95] in which an axis is laid through the two ring atoms neighboring the corner that is to be flipped. A 90^◦ rotation around this axis is conducted on the corner atom and its neighbor substructures which are not part of the ring. These structures are identified by tree searches similar to the ones used in the stereocenter inversion. Unlike Goto’s method which stops at this point of the corner flip, SPORES then adjusts neighbor

Automated Structure Preparation and Its Influences on Protein-Ligand Docking and Virtual Screening