Databases and Web applications - Genomics and Phylogeny of Cytoskeletal Proteins: Tools and Ana

1.4.1 Database

During the analysis of a protein family, different type of data is accrued. In our case, information about the species, genomes, sequencing projects, publications, and the sequence itself with different statistical analyses will be collected. To store this data without losing the relation between them, it is necessary to use a relational database, like PostgreSQL (73). This database is open source and free to use.

For each type of data, a database table with specific columns has to be created. For all above named data types, a table is designed to store the corresponding information in.

There are two ways to store the relation between tables. If the relationship is 1:n, a foreign key has to be used. For instance, one sequence belongs to one species, but one species has n sequences. This implies storing the corresponding species ID in the sequence table. If the relationship is n:m, an additional table has to be designed to store the two foreign keys. For instance, one species is listed in many different publications, and a publication lists different species. This leads storing the species ID and the publication ID in the additional table.

Much information found on our web pages is automatically generated by different tools running in the background. If a sequence is changed or added, the molecular weight, the amino acid statistics, and the domain structure using hmmpfam (74) and Pfam (34) are calculated. If a new species is added, the NCBI taxonomy is added and refreshed once a day. Furthermore, the link to the detailed species page is added to NCBI LinkOut (75) and to the Encyclopedia of Life (76). Additionally, the available Blast (77) page of CyMoBase has to be updated and therefore, the underlying Blast database has to be refreshed as well.

To offer other scientist the possibility to benefit of sequence data and the corresponding statistics, CyMoBase (35) was developed. In CyMoBase, all published data and additional analyses of our group can be found.

1.4.2 Web application

It is not only important to store the incurred information and the corresponding meta data in a database. For instance, it is not user friendly to access the data using a terminal and the SQL language. Furthermore, it could be quite dangerous to allow everybody direct access to the database. It is quite easy to change or even delete stored entries.

One elegant way to share the data with colleagues is to create a web application. This kind of applications has the advantage that the user does not have to install a tool and all its

dependencies on the local computer. Only a web browser and internet connection is necessary.

To deploy such a service, the first step is to decide, which programming language fits best to the given task. In our case, the programing language Ruby (78) was selected for nearly every project. This language has the advantage of being object-oriented. Furthermore, source code written in Ruby is easy to read and to understand, even without knowledge of this language. But the main reason for using Ruby in our group is the web framework Ruby on Rails (79).

The Ruby on Rails framework has the advantage of agile and rapid application development, using generators, engines, and gems. Furthermore, the Ruby on Rails community is big and code already exists for many tasks. This framework uses the Model/View/Controller concept that allows the software engineer to structure the different parts of the applications. A Model is used for creating and handling the data. Often, a Model is associated with a database table. The benefit is that e.g. the programmer does not have to code SQL queries by hand, like “SELECT sequence FROM sequence WHERE sequence_id = 123;” to get the sequence of interest. He just have to code

“Sequence.find(123).sequence”. Everything is prepared by ActiveRecord implemented in the Ruby on Rails framework; the security checks, validations, creating the SQL-query, and preparing the result as an object. The Controller is important to collect and to prepare all necessary information for the View, which represents the data.

One additional reason for using Ruby on Rails is its ability to handle huge data. In our internal database, we have more than 29,000 manually annotated sequences, 50 proteins, and 1200 species. One reason is the ability to create different caches. That means that e.g.

the statistics page of diArk (Chapter 2.3, page 90) does not have to be recalculated every time, a user enters the page. Only, if new data was added to the database behind diArk, the graphs will be recalculated, which takes about 30 seconds. Using the cached version, the user will see the statistics page in less than 1 second.

1.4.3 Development

To keep track of the source code changes made by different members of our group, we use the source code versioning and revision control and management software Subversion (80) and git (81). In principal, this means that the software developer first has to check out the latest version of the source code out of the repository. After adding new features or bug fixes, the developer has to commit the changes with comments back to the repository.

Furthermore, an editor for the source code is essential. Of course, the editor of the operating system can be used. But normally, these editors do not have syntax highlighting,

can manage all files of a project, or highlight the changes of the source code compared to the repository. Therefore, a more powerful editor, like Netbeans, TextMate, or Sublime Text 2 should be used.

Normally, only one or two browsers are installed on the local computer. But to be sure that every user of the web application sees the same design and has the same functionality, the application has to be tested with different web browsers and even different versions of them. Therefore, virtual machines with different operating systems and different browsers can be created.

In our group, different web applications are developed. Nearly none of them uses the same Ruby version and set of gems. To avoid conflicts we use the Ruby Version Manager (82) to set up a unique environment for each project.

1.4.4 Deployment

Using the same server for development and for the public access is risky. If there is a change in the source code of the application that disturbs or even breaks the application, it will be directly passed to the user. Furthermore, if there is a security hole in one of the applications and servers running, it could be possible that the main server gets hacked.

Therefore, we are using two different servers; one for developing and one for the public.

But using this setting, it can be trick to get the latest data and source code to the public server. The source code of the application has to be deployed, the genomes and the corresponding images have to be copied, the users’ rights have to be set correctly, and the database and caching servers have to be restarted. Furthermore, the database has to be copied and cleaned up, because not every data in our internal database is already published and therefore should not be public available. To do all these steps by hand take about 30 minutes. Therefore, we use Capistrano (83) for deployment. Different ‘receipts’ were developed for each of the steps mentioned above. Now, it takes only one command in the terminal to perform every step in the background and to deploy the application.

1.4.5 Set up a new database

Internal, we do not have only one database to store information about cytoskeleton and motor proteins. To set up a complete new database does not only mean to create a new one with the existing database schema. Because all information about species, genome files, and sequencing projects are the same, it would be time consuming to undertake each database the same changes. Therefore, we use the replication system Slony (84) for PostgreSQL. This system uses one master and many slave databases. Each change in a replicated table of the master database will be forwarded to the slaves, immediately.

Furthermore, our web applications are designed to use different databases. Only a few minor changes have to be made in configuration files and the complete web interface and all analyses are available for the new database.

1.4.6 NMR

To solve the structure of proteins, the nuclear magnetic resonance (NMR) technique can be used. Today, there are two major techniques, the liquid-state and the solid-state NMR.

Whereas the resonances of a liquid state NMR spectrum are usually separated and an assignment to the corresponding amino acids and atoms is quite easy, the spectra produced with solid state NMR are difficult to interpret. In solid state NMR, the resonances of the atoms fuse and an assignment gets harder. One solution for this issue is to predict the spectrum of the protein of interest and to plot the resulting peaks to the experimental spectrum.

Nowadays, different tools exist to predict the shifts of amino acids atoms (e.g. 81,82). But no software was developed to predict the corresponding spectra based on different experimental settings. Therefore, Peakr and the corresponding web application Webpeakr were developed (chapter 3.1, page 163). Like the other mentioned web applications, Webpeakr only requires a modern web browser.

One feature of NMR is the possibility to study the dynamics of a protein. Based on these studies, the function of the protein can be understood. These exchange processes can be observed. One technique for this goal is the Carr-Purcell-Meiboom-Gill (CPMG) experiment. But the analysis of such an experiment is not easy. To avoid issues during analysis, the web application ShereKhan was developed (chapter 3.2, page 176).

The publications are ordered chronologically, beginning with the newest.

2 Publications

2.1 Evolution of the eukaryotic dynactin complex, the

Im Dokument Genomics and Phylogeny of Cytoskeletal Proteins: Tools and Analyses (Seite 25-30)