Web Server Development - Evolutionary coupling methods in de novo protein structure prediction

All visualization components described above were combined with a simple bioinformatics workflow system to create a CCMpred webserver using the Django Python web application framework [92] and the Celery task worker system [93].

Starting from a simple submission page, the user can provide a single protein sequence or multiple sequence alignment that will be enriched with homologous protein sequences using HHblits [22] and residue-residue contacts predicted using CCMpred. Once the prediction is completed, the user gets redirected to the results page shown in Figure 10.2 where they can interactively explore the results of the contact prediction and optionally upload a 3D structure to compare predicted contacts to the structural data.

88 10. Interactive Evolutionary Couplings

a. b.

c. d.

Figure 10.1: Interactive Visualization Components for the CCMpred webserver. a. The configurable contact map viewer can show summed score matrices, APC-corrected score matrices, physical contact maps and physical distance maps. Shown here is the APC-corrected predicted contact map and a distance map from the PDB structure. b. We-bGLMol protein structure viewer with a custom addition of showing selected contacts as cylinders. c. Coupling matrix viewer showing a salt bridge interaction. d. Coupling matrix viewer showing a hydrophobic interaction.

10.4 Web Server Development 89

Figure 10.2: CCMpred webserver user interface. a. Job submission page. The user can paste or upload a protein sequence or multiple sequence alignment and the specified sequence will automatically enriched with homologous sequences before the actual contact prediction begins. b. An example prediction page. When a user clicks on an (i, j) pair in the contact map viewer, the corresponding w_i,j(a, b) coupling matrix is loaded into the coupling matrix viewer and the contact highlighted on a user-supplied 3D structure (if available). The log viewer shows the CCMpred run protocolling the pseudo-likelihood optimization and convergence.

90 10. Interactive Evolutionary Couplings

Chapter 11 Conclusion

Modern interactive visualization techniques and minimizing the time passed between the user requesting information and the response appearing allow structural biologists and protein engineering researchers interested in applying evolutionary coupling methods in their studies of a protein family to focus on the biological questions they want to answer instead of concentrating on how a tool is correctly used. By following these principles, the CCMpred webserver provides an intuitive and user-friendly tool to explore the topic of evolutionary coupling and its application in protein folding and has been successfully used in collaborations with structural biology groups. Furthermore, the visualizations convey an intuitive understanding of the coupling potentials learned from multiple sequence alignments and are useful in debugging and improving contact prediction methods further.

The visualization components developed in this work could be useful for developers of other websites so a valuable further step would be to package up individual visualization components as BioJS components.

92 11. Conclusion

Part IV

Generating Protein Sequences from

Couplings

Chapter 12 Introduction to Computational Protein Design

Computational protein design is the part of protein engineering that uses computational methods to predict protein sequences that will fold into a desired fold and exhibit a desired activity [94]. Depending on the goal of the study, protein design can be employed to optimize or change ligand specificity [95], catalyze reactions [96], prefer one of several alternative conformations [97], or, most typically, to maximize the thermostability [98].

The task of finding an optimal protein sequence can be seen as a complex optimization problem where the search space is made up of the space of possible amino acid sequences plus the possible conformations of their side chains. While the search space is already enormous when considering only these two factors, flexible-backbone models have been used more recently to further expand the dimensionality of the search space.

By using empirical and knowledge-based force field models that were further optimized for protein design tasks, a free energy “score” can be attached to a given configuration of sequence and conformation. While the search space cannot be fully enumerated for reasonably-sized proteins, discretizing side-chain conformations [99] combined with Monte Carlo-based [100] or genetic algorithms [101] for traversing the search space efficiently make it possible to stochastically explore meaningful parts of the search space and at least find local optima. For small proteins, Dead End Elimination (DEE) [102] can be used to prune impossible areas of the search space to arrive at a global optimum.

Through Monte Carlo optimizations of a library of 108 protein structures, Kuhlmann and Baker showed that 51% of lowest free energy-model residues in the protein core were identical to the residues appearing at these positions in nature [103], indicating that the stochastic exploration of sequence space by nature has already done a remarkable job at optimizing protein stability. However, by introducing artificial selection pressures that would not be observed in nature, protein design can help optimize proteins to even better fit the needs of humanity.

96 12. Introduction to Computational Protein Design

12.1 Covariation in Protein Design

Covariation has been previously used in protein design applications in the framework of protein sectors by Rama Ranganthan and colleagues. Using covariance statistics as a clus-tering similarity measure, proteins can be dissected into allosterically interacting groups of residues that are mostly biochemically independent from other sectors [104, 66]. An artificial protein MSA was sampled from a Monte Carlo algorithm to match the covariance properties of a MSA of real-world WW domain and proteins from the artificial MSA ex-pressed. While mean sequence identity between artificial and natural sequences was low at 36% as expected when sampling using per-column amino acid frequencies, the artificial MSA sequences taking covariation into account folded into the native structure in 28% of experiments when taking covariation into account while natural sequences folded into the native structure 67% of times and protein sequences taking only single-column frequencies into account would not fold into the native structure [105].

Another application of covariation in a protein context is the work of Ollikainen and Kortemme [106]. Using 40 diverse protein domain MSAs and corresponding structures, ensembles of alternative backbone and side-chain conformations were generated and then low-energy sequences were sampled using the Monte Carlo-based Rosetta toolkit. APC-corrected mutual information covariance scores were calculated for both the natural MSAs and low-energy sequences obtained from protein design. Comparing the covariance scores of natural and designed sequences show significant overlap between covariances that is highest when using intermediate levels of backbone flexibility in the protein design process. On the amino acid level, similar amino acid pairs are found to be covarying in natural and designed sequences, except for interactions not modelled by the energy function underlying the protein design such as cation-pi interactions.

Im Dokument Evolutionary coupling methods in de novo protein structure prediction (Seite 101-110)