Tools for Generating a Torsion Library

A.2 Tools and Libraries

A.2.2 Tools for Generating a Torsion Library

StructureProfiler tests under the consideration of electron density was intro-duced toSIENAin extending theProteinFlexibilityLib. Additionally, a Python framework around SIENA and its result database was added to NAOMI. It pro-cesses the output created by the PDBDataExtractor for identifying and naming clusters after their enzyme function.

python3 run_siena_analysis.py -e ECINFORMATION -i DIROFSIENAOUTPUT -d IDFILE -o OUTPUTDIR

It also allows e.g. the analysis of the interconnectivity of ensembles. The graph can be stored in an SQLite database for future use by the Python framework of the GeohydeEvaluator. For this thesis, further output in e.g. L^ATEX with the amount of unique pdb ids and ligands per ensemble (Chapter 3) can also be created.

a specific molecule file and creating output necessary for the by Guba et al. devel-oped validations strategy. The reimplementation is now based on the up to date NAOMI C++ code using e.g. the recently published SMARTScompare algorithm for SMARTS matching [63]. The tool is supposed to be published in the future when the changes in the new torsion library are finished. In the following, all tool options are given and examples are provided below.

• --outdirLocation to store the output (required)

• --molfileFile path to mols in sdf, will be stored in given database

• --databaseFile path to (new) molecule database

• --initialtorsionlibTorsion lib to be analyzed

• --selectivematching(=false) Match only the most selective smarts pattern - default mode in NAOMI

• --useonlysinglebonds(=false) Only allow single bonds for matching

• --donotuseterminalbonds (=true) Do not use bonds to a terminal heavy atom

• --storeincsdhistograms(=true) True: store in csd histogr., 1: store in pdb histogrs.

• --sequential(=false) Switch to sequential calculation

• --startfrommolStart evaluation from specific mol in database

• --matchpatternwithatleastXhits(=0) Default: 0

• --extractmolExtract specific mol id from database Update of the TorLib Statistics and Peaks

TorsionPatternMinercan update the statistics of a specified torsion library (--initial torsionlib <TorsionLib>) with all data present in a multi mol sdf file (--molfile

<multi mol sdf file>). All molecules will first be stored in the given database file (--database <Database File>) and subsequently processed. For future runs, the molecule database can then be reused. A run tarting from a specific molecule is also possible: (--startfrommol <FilePosition>).

TorsionPatternMiner uses the Intel Threading Building Blocks [24] for auto-matic multiprocessing on all available threads of the machine. If this behav-ior is undesired, the sequential mode should be activated (--sequential true).

To update the statistics according to [62], the multimatching needs to be active (--selectivematching false). This means, that per bond, each matching tor-sion rule will receive an increase in the statistics. If a pattern matches multiple times on the torsion bond e.g. due to leaving the element of the substituing partner on position 1 or 4 undefined, each match will be added to the statistic.

TorsionPatternMiner also updates all peak records and adjusts their tolerances automatically if needed.

SinceTorsionPatternMinermatches all available torsion rules to any bond in all given molecules, one may want to limit the type of bonds to be used for matching. It is possible to explicitly avoid any non-single bond (--useonlysinglebonds true) as well as all bonds connected to a terminal heavy bond (--donotuseterminalbonds true). The torsion library stores the statistics from the CSD in the histogramand histogram shiftedXML tag. It is possible to store a second statistic per pattern in histogram2 and histogram2 shifted with --storeincsdhistograms false. TorsionPatternMinerupdates peaks always based on data in thehistogram shifted tag per pattern.

TorLib Statistics Analysis

Besides the aforementioned command line options, two are relevant for the quality analysis of the derived torsion library. It may be desired to leave out low pop-ulated patterns for the single matching (--matchpatternwithatleastXhits 50).

As each bond in the output is annotated with the molecule id, this id can be used to extract the specific molecule from the database (--extractmol <Molecule ID>) for a single case analysis with the TorsionAnalyzer.

Torsion Rule Visualization

TorsionPatternMiner uses the parameter --visualizetorlib im combination with a torsion library and an output directory to convert each torsion rule into a text format. This can then be converted into graphics to understand the corre-spondence of peaks with the underlying histogram data. The conversion code is available in the attached python package to the tool.

Examples

Create a new torsion library based on a given multi mol file and a torsion library hierarchy:

TorsionPatternMiner --out DIR --initialtorlib TOR_LIB --molfile MULTIMOLFILE --database mols.db

--selectivematching false --storeincsdhistograms true

From the resulting output files, the new tor lib should then be used to control the quality of the peak determination in running the tool in single matching mode on the molecule set.

TorsionPatternMiner --out DIR --initialtorlib DIR/newtorlib.xml --database mols.db --donotuseterminalbonds true

The resultingbondanglesmatching.csvcan then be analyzed by our python script createpaperplots.pyto generate the torsion rule - red flags in per cent plot. It is advisable to compare thebondanglesmatching.csvfile of the initial torsion lib with the one generated by the new tor lib. Likewise, a different molecule set such as the ligand expo can be employed to test its agreement with the presented torsion library.

The python script sortandcomparetorsionpatterns.py takes as input two such files and compares each bond, angle data triplet in terms of the matching torsion rule and the determined angle quality. If the data triplet matches a different torsion rule and, or receives a differing quality assessment, it will be quantified in the output files and annotated with examples. This analysis was applied on the resorted TorLib to control against unwanted sorting until only reasonable switches were found.

visualizeContTorScoreFromPatMiner.py takes as input the directory with the extracted data and visualizes the given patterns.

Im Dokument Structure Profiling and Geometric Optimization of Protein-Ligand Complexes for the Scoring Function HYDE (Seite 131-134)