Datasets - Graph-Based Approaches to Protein StructureComparison

Abbr. Algorithm

BFPH Bin-FingerPrints using the Hamming distance BFPJ Bin-FingerPrints using the Jaccard coefficient

BK Bron-Kerbosch algorithm

CB CavBase approach

GAVEO Graph Alignment Via Evolutionary Optimization GAVEO* GAVEO + original similarity measure

GAVEOc GAVEO + preserved clique

GH Greedy Heuristic

FPH -FingerPrints using the Hamming distance FPJ -Fingerprints using the Jaccard coefficient

FFP Fuzzy Fingerprints

RW Random Walk kernel

SA Sequence Alignment (Smith-Watermann) SEGA SEmi-global Graph Alignment SEGAHA SEGA using the Hungarian Algorithm

SP Shortest Path kernel

SPSA Shortest Path kernel with Sequence Alignment Table 5.1: Algorithms used during the experiments.

This is followed by an assessment of the algorithmic performance of the different approaches when confronted with different levels of structural and mutational distor-tion. Section 5.7 presents a number of classification experiments on different datasets used to assess the performance and suitability of the presented methods for classifi-cation tasks, as the main goal of the graph comparison algorithms presented in this thesis is to discriminate between different classes of protein binding sites. Section 5.6 presents results for another typical application of protein structure comparison tools, the retrieval of similar structures from a reference dataset.

In Section 5.8, the suitability of the presented algorithms for comparison tasks beyond experimentally derived structures and protein binding sites is addressed.

5.1 Datasets

retrieved from the CavBase database Schmitt et al. (2002) which is a part of the ReliBase+ database hosted by the Cambridge Crystallographic Data Center (CCDC) (Hendlich et al., 2003). Currently, CavBase contains 308,141 cavities extracted from 70,850 PDB entries (June 2011).

From this database, several smaller datasets where constructed to assess the per-formance of the different algorithms, with different objectives in mind. This was necessary, since an all-against-all comparison of all these structures is infeasible with-out the use of high-performance computing facilities.

As one major application for the developed approaches is the classification and comparison of protein binding sites with respect to the accommodated ligand, an initial benchmark dataset was constructed by drawing from the two most highly populated groups of binding sites in the CavBase database. Thus, a two-class clas-sification dataset was constructed, which contained protein binding sites known to host either adenosine-5’-triphosphate (ATP) or nicotinamide adenine dinucleotide (NADH). Both molecules act as cofactors for a plethora of functionally and phylo-genetically diverse proteins and can bind to the proteins in different conformations.

Hence, two randomly drawn cavities, even if hosting the same ligand, do not necessar-ily share a common geometric architecture. Therefore, a subset of these two groups was selected.

The main purpose of this initial dataset was to assess the classification perfor-mance of the previously introduced approaches, especially the global methods. One the one hand, this demands a dataset for which binding pockets of the same class show some structural resemblance, which is obviously not the case for the complete set of ATP or NADH binding pockets. On the other hand, since all approaches were developed to tolerate structural variance to a certain extend, the binding pockets should also not be too similar. Hence, instead of simply selecting proteins based on sequence similarity, cavities were instead selected by drawing on ligand information.

More precisely, ligands were superimposed using Kabsch’s algorithm (Kabsch, 1976) and then clustered according to the root mean squared deviation (RMSD) of the superimposed molecules. Subsequently, subsets were selected, for which the dif-ference in RMSD ranged up to 0.4 ˚A. Thereby it is assured that the ligands are at least bound in similar conformation although not necessarily orientation. The RMSD

difference threshold can be regarded as a compromise between conformational simi-larity of the ligand and dataset size. At the given threshold, a sufficiently large dataset (ATP/NADH dataset) for an all against all comparison was obtained, containing 355 protein binding sites in total, 214 NADH binding sites and 141 ATP binding sites. To keep the runtime requirements low, no further cavities were included, although the threshold is well below the accepted RMSD difference of a successful docking solution, which would be 2 ˚A (Verdonk et al., 2008).

While this first dataset ensures a certain similarity of the ligand conformations by enriching structurally similar binding sites, it also favors binding sites belonging to proteins that are related on the sequence and/or fold level. Indeed, the dataset con-tained only 15 different folds. A main motivation for the development of the presented approaches was to uncover non-trivial similarities, i.e., similarities not apparent on the sequence or fold level.

Hence, a more challenging ATP-NADH dataset (subsequently termed 1-fold ATP/

NADH dataset) was created, including only remotely similar binding sites that be-long to different folds according to the SCOP (Structural Classification Of Proteins) database (Murzin et al., 1995). Proteins taken from the complete set of ATP and NADH binding proteins stored in CavBase were filtered to include only one protein per SCOP fold, creating a non-redundant dataset. The resulting dataset contained only binding sites of proteins, that do not exhibit similarity on the folding level, al-though a structural similarity between the binding sites themselves might still exist, based on the notion that proteins with different folds can still accommodate similar ligands. The dataset was constructed to assess, whether a comparison of the protein binding sites alone can still retrieve similarities, even if none are apparent on the folding level and therefore the global protein structure is different.

To obtain a suitable parameter setting for the subsequent experiments, another four class dataset was constructed by randomly drawing 50 cavities per class from all cavities containing either ATP, NADH or FAD (flavin adenine dinucleotide), three of the most abundant ligands in the CavBase. To include also a more rigid group of structures, a fourth class corresponding to cavities containing a porphyrine ring as a ligand was included.

5.1 Datasets

In the retrieval experiments, a high-resolution subset of the CavBase database was used to keep the runtime requirements at a manageable level. This was necessary due to the comparably high runtime requirements of some some of the algorithms. By including only cavities derived from protein complexes with a minimal resolution of 2.5 ˚A in the high-resolution subset, the number of necessary comparisons could be reduced by one-third compared to using the complete set of cavities in CavBase. The final dataset (HiRes) contained 186,507 cavities but still denoted a representative subset of the complete CavBase.

As a further external benchmark set, a set of representative proteins constructed for the evaluation of SiteEngine was included in the experiments. SiteEngine, as mentioned in Chapter 2, is another surface-based protein comparison approach which operates on a concept of binding pockets similar to CavBase. This dataset was originally compiled to include several structurally different classes of proteins, among them fatty acid-binding proteins, serine proteases, adenine-containing ligands and others (for a detailed description, see (Shulman-Peleg et al., 2004)). A summary of the dataset, including the classes defined by Shulman-Peleg et al. and the corresponding PDB codes can be found in Table A.1. Since the SiteEngine model differs from the CavBase model in some aspects, especially modeling and extraction of the binding sites, the PDB codes where used to extract the associated cavities from CavBase to construct a set of corresponding cavities.

Another externally compiled benchmark dataset used in retrieval and classification experiments is the Astex Non-native Set (Verdonk et al., 2008), initially constructed for the assessment of the performance of docking algorithms. The dataset was based on the Astex Diverse Set Hartshorn et al. (2007), another docking benchmark dataset containing 85 different high-quality protein-ligand complexes. In this dataset, each li-gand is represented once. The Astex Non-native Set contains different conformations of the same protein targets that are addressed by the ligand (i.e., either apo struc-tures or strucstruc-tures complexed with different ligands), thus being a more realistic and challenging docking benchmark dataset. The resulting benchmark set allows to assess the performance of the different methods when confronted with structural variation mainly due to ligand-induced protein conformation changes. Only structures with a minimal resolution of 2.5 ˚A were included. After filtering and visual inspection, the

Astex Non-native Set contained 1112 non-native structures for 65 of the original 85 ligands.

Additionally, to apply the approaches to a problem beyond the realm of binding sites, an HIV mutant sequence dataset was used. The objective here was to distinguish different HIV mutants. The sequence dataset was compiled from the HIV sequence database at Los Alamos National Laboratory in a previous study by Sander et al.

(2007), which contained 1100 sequences of the HIV glycoprotein 120 (gp120) derived from clonal samples of 332 patients. By discarding duplicate sequences, a set of 514 mutant variants of the V3 loop of gp120 were derived, a region of the protein which plays an important role in the cellular adhesion process and the infiltration of the host cell. Each of the mutant strains are annotated as being either of the X4, R5/X4 or the R5 phenotype, indicating their capability of interacting with two different chemokine receptors, CCR5 and CXCR4, thus playing a dominant role in the progression of the AIDS disease. To obtain a structural representation of the mutants, the protein structure of the HIV-1 JR-FL gp120 from the PDB (PDB code:

2b4c (Huang et al., 2005)) was used as a template to obtain a structure prediction with RCSB, a structure prediction algorithm using threading (Canutescu et al., 2003).

Sequences were aligned with MUSCLE (Edgar, 2004) and the V3 loop ranging from residues 296 to 331 of PDB structure 2b4c was used as a backbone template for the modeling of the mutant sequences. Of the 514 mutants 82 contained insertions or deletions compared to the template sequence, which is likely to reduce the quality of the structure prediction. Hence they where excluded from the dataset, yielding 432 mutually distinct sequences. Subsequently, the predicted structures were transformed to a pseudocenter representation according to the CavBase rules as introduced in (Kuhn et al., 2006; Schmitt et al., 2002).

None of these datasets where used as test datasets during the development of the approaches, with the only exception of the ATP/NADH dataset.

Im Dokument Graph-Based Approaches to Protein StructureComparison - From Local to Global Similarity (Seite 140-145)