Value - DIPLOMARBEIT. GraHMMer: A graphical user interface for biological sequence analysis usi

matrices are used, for example BLOSUM50 (see Figure 6). [Durbin, et al., 1998/2006]

For gaps, resulting from insertions or deletions, penalties are added to the score.

9 This penalty is associated with the length of the gap. There is another determination between a gap-open and a gap-extension penalty, which is usually less than the first version. [Durbin, et al., 1998/2006]

A pairwise alignment can be done either over two full sequences, called a global alignment (including allowed gaps), or with two subsequences, called a local alignment. [Durbin, et al., 1998/2006]

For the global alignment the Needleman-Wunsch algorithm is used.

For this algorithm a matrix F with the indices i and j is build. F (i,j) contains the best score for the alignment at that point. The matrix is filled with values according to the algorithm in Formula 1. An example of such a matrix can is shown in Figure 7.

Formula 1: Needleman-Wunsch matrix algorithm [Durbin, et al., 1998/2006]

Figure 6: Blossum50 substitution matrix for sequence alignments [Durbin, et al., 1998/2006]

10 Figure 7: A sequence alignment matrix derived from Needleman-Wunsch algorithm [Durbin, et

al., 1998/2006, Fig. 2.5]

The local alignment can be done with the Smith-Waterman algorithm (Formula 2).

This type of alignment is useful if one wants to know if two protein sequences share a domain, if extended sections of DNA sequences should be compared and it is a very sensitive method to find similarities between highly diverged sequences.

Formula 2: Smith-Waterman matrix algorithm [Durbin, et al., 1998/2006]

The best local alignment of two subsequences is the one with the highest scoring.

The Smith-Waterman algorithm is very similar to the Needleman-Wunsch algorithm, with only two differences. One is the possibility for values to take the value 0 if all other values would be negative. The second is, that the alignment does not have to end in the bottom right corner of the matrix; it can end anywhere (see Figure 8).

Because of that the highest value in the matrix is assumed to be the best score.

[Durbin, et al., 1998/2006]

11 Figure 8: a sequence alignment matrix derived from Smith-Waterman algorithm [Durbin, et al.,

1998/2006]

1.3.2. Multiple Alignment

A multiple alignment means that more than two sequences are aligned to each other.

Precisely the homologous residues of those sequences get aligned in columns. In case of a multiple alignment homologous residues correspond to structural and evolutionary meaning. [Durbin, et al., 1998/2006]

With the usage of multiple alignments members of gene or protein families can be identified. The membership in such a family can lead to the prediction that those members have similar functions and structures. [Pevsner, 2003]

“Aligned residues tend to occupy corresponding positions in the three-dimensional structure of each aligned protein”

[Pevsner, 2003, p. 320]

To create a multiple alignment by hand an expert needs much experience in protein sequence evolution [Durbin, Eddy, Krogh, & Mitchison, 1998/2006], but if the

sequences developed some divergence a multiple alignment is very difficult to implement. Additionally it is not to be expected that only one correct multiple alignment of a protein family can be identified. [Pevsner, 2003]

For automated multiple alignments a good scoring is necessary to decide over the quality and correctness of the results. [Durbin, et al., 1998/2006]

The most frequently used algorithms for multiple alignments refer to the progressive alignment method of Da-Fei Feng and Russell Doolittle. This method contains three stages.

First, the multiple alignment should contain a global pairwise alignment with each

12 sequence. Second, the creation of a guide tree has to be done with the help of

distance matrices.

In the third stage the sequences are added to the multiple alignment according to the guide tree – the ones with the strongest relationship first.

The software which is very popular to create multiple alignments and which implements such algorithm is ClustalW. [Pevsner, 2003]

Alternatively multiple alignments can be accessed, queried and fetched from the Pfam-database. This is an online accessible database which contains multiple alignments and HMM-profiles (hidden Markov model-profiles) of complete protein domains. [Sonnhammer, et al., 1997]

The database is divided into two parts, Pfam-A and Pfam-B. Pfam-A contains only manually curated HMM-profiles and proteins families, whereas Pfam-B contains data which are not of the same high quality as in Pfam-A. Additionally the data in Pfam-B is not completely annotated. [Pevsner, 2003]

1.4. Hidden Markov Models and Profile Hidden Markov Models

A hidden Markov Model (HMM) is a probabilistic model for sequences of symbols.

With HMMs different problems can be solved, for example if a sequence is part of a certain family. Another problem which can be processed is, if the family is known, to derive the internal structure. [Durbin, Eddy, Krogh, & Mitchison, 1998/2006]

1.4.1. Markov Chains

Markov chains are a probabilistic model which allows the description of sequences, where the appearance of a symbol (nucleotide or amino acid) depends on the previous symbol.

Markov chains are suitable to describe such systems. They can be displayed graphically and contain states and transition probabilities. Each state stands for a particular residue and they are connected by arrows, which are the transition

probabilities. Additionally Begin- and End-states can be added to a Markov chain to build a sequence (see Figure 9). [Durbin, et al., 1998/2006]

Figure 9: a modified graphic of the simple DNA Markov chain, containing additionally begin- and end-states [Durbin, et al., 1998/2006, Fig. 3.1]

In Figure 10 the example of a Markov Chain for a DNA sequence is given. The circles contain the states, in our case the 4 possible nucleotides A, T, C and G of the DNA alphabet.

The arrows are representing the transition probabilities, which nucleotide can follow the current one in the sequence. [Durbin, et al., 1998/2006]

14 Figure 10: graphic of a simple Markov chain, containing 4 states, one for each nucleotide in the

DNA-alphabet [Durbin, et al., 1998/2006]

1.4.2. Hidden Markov Models

The important distinction between a Markov chain and a hidden Markov model (HMM) is based on the fact that there is no more a one-to-one correlation between symbol and state in a HMM. The starting state cannot be estimated from the next state, just by looking at it. [Durbin, et al., 1998/2006]

“The name ‘hidden Markov model’ comes from the fact that the state sequence is a first-order Markov chain, but only the symbol sequence is directly observed”

[Eddy, 1998, p.756]

The most interesting information about a sequence, derived from a HMM, are the underlying states. For this reason the sequence must be decoded. There for some algorithms are available. The algorithm which is the most common one for this problem is the Viterbi algorithm. It is also a dynamic programming algorithm, just like Needleman-Wunsch and Smith-Waterman, both described before. [Durbin, et al., 1998/2006]

Defining a HMM may be one of the most difficult parts in using HMMs. For the

possible states and connections between them - the design of the structure, transition and emission probabilities – the assignment of parameter values must be done. The emission probability stands for the chance that a state becomes observable. The chance of a state to change into a certain other state is called transition probability. In sequence analysis this can be done by using so called training sequences. Those are a set of independent example sequences, which all fit the model we want to achieve well. [Durbin, et al., 1998/2006]

1.4.3. Profile Hidden Markov Models

To build a profile HMM (pHMM) an existing multiple alignment is used as input. [Eddy S. R., 1998] A graphical representation of such a pHMM can be seen in Figure 11.

A pHMM describes a protein family by specifying position-specific letter emission distributions and position-specific insertion and deletion probabilities. [Schuster-Böckler, et al., 2004]

Using a profile HMM to search against a database helps to find distantly related homologs. [Pevsner, 2003]

Figure 11: “A small profile HMM (right) representing a short multiple alignment of five sequences (left) with three consensuns columns. The three columns are modeled by three

match states (squares labeled m1, m2 and m3), each of which has 20 residue emission probabilities, shown with black bars. Insert states (diamonds labeled i0 – i3) also have20 emission probabillities each. Delete states (circles labeled d1-d3) are ‘mute’ states that have no

emission probabilities. A begind and end state are included (b,e). State transition probabilities are shown as arrows.” [Eddy, 1998 p. 757]

1.5. Visualization in Bioinformatics

Human perception and cognition is limited and the amount of information which can be processed and analyzed in a short period is therefore very limited, too. For Information visualization (IV) the data must be selected, transformed and

represented. The visualization of biological data can provide a suitable answer for the processing of the mass of data, by using the bandwidth of human vision, which is able to distinguish trends and patterns. [Tao, et al., 2004]

Information visualization techniques have two big aims. The first one is to visualize large number of data and the second one is to add patterns and trends for better analyzability. [Tao, et al., 2004]

To implement IV data must be obtained, mapped and rendered. In Bioinformatics IV has been implemented for different purposes. For biomolecular structures 2D-tools visualize secondary structures of nucleic acids and proteins and 3D-tools tertiary and higher structures. Sequence analysis can be visualized simply by aligning the text of two sequences against each other or by using dot plots and percent identity plots.

Also for expression profiles, genome and sequence annotation, molecular pathways and ontology, taxonomy and phylogeny several IV techniques were implemented.

Because of the limited knowledge about and the complexity of biological phenomena IV techniques are increasingly important for the analysis of biological data. [Tao, et al., 2004]

Position-specific score matrices (PSSMs), also called sequence profiles, and ungapped multiple alignments Sequence Logos are used for visualization. Such a Sequence Logo consists of a stack of letters which stands for amino acids or

nucleotides in a multiple alignment and describes the conservation of the columns or positions. The height of a letter displays its frequency at this position. Additionally colors are used for other properties. Schuster-Böckler, Schultz, & Rahmann adapted the Sequence Logos and implemented HMM Logos in 2004 (see Figure 12). They included the fact, that positios can be deleted, inserted or conserved. [Schuster-Böckler, et al., 2004]

17 Figure 12: Example of an HMM Logo [Schuster-Böckler, et al., 2004] -

“Comparison of the HMM Logos of the small GTPases Ras and Rab from SMART [3]. The Ras logo is based on an alignment of 35 sequences; the Rab logo on 48 sequences. The height of the entire vertical axis is 5 bits for both logos. Subfamily specific sites RabF2 to RabF5[13] are

indicated by arrows” [Schuster-Böckler, et al., 2004, p. 7]

1.6. Graphical User Interfaces

A graphical user interface (GUI) is a software which provides an interface to an application to the user, with selectable menus, clickable buttons and much more similar. The user does not need to know which commands she or he has to use for the same application on the command line of an operating system, if this application is actually available for the command line use. The options for the application are given via menus or similar controls. A GUI can be the interface to an operating system, Microsoft Windows XP™ for example, or it can be an application which runs itself inside a graphical operating system, like Microsoft Word™ for example.

In Bioinformatics many tools are developed for the use on the command line. For some of those tools GUIs are available yet. A view examples of GUIs for

bioinformatic tools:

• DNAAlignEditor⁸: a desktop tool for the alignment of nucleotide sequences with a user-friendly interface. The user can edit multiple sequence alignments manually. [Sanchez-Villeda, et al., 2008]

• GEDI⁹: a desktop tool for the analysis of gene expression data, extensible, open source, free available and user-friendly. [Fujita, et al., 2007]

• MPI Bioinformatics Toolkit¹⁰: an online tool for protein sequence analysis.

[Biegert, et al., 2006]

• PFAAT 2.0¹¹: a desktop tool, annotation, editing and analyzing of multiple sequence alignmentss. [Caffrey, et al., 2007]

8 http://maize.agron.missouri.edu/~hsanchez/DNAAlignment_Tool.html

9 http://www.iq.usp.br/wwwdocentes/mcsoga/gedi/

10 http://toolkit.tuebingen.mpg.de/

11 http://pfaat.sourceforge.net/

1.7. Available Tools and Databases

Besides HMMER a number of tools which are using profile Hidden Markov models or position specific matrix models are available yet. [Eddy, 2003]

1.7.1. Web Based Tools

• PSI-BLAST: Position Specific Iterative BLAST¹²

• FISH: Family Identification with Structure anchored HMMs¹³

• Meta-MEME: uses motif-based hidden Markov modeling of biological sequences¹⁴

• BLOCKS is an online service for biological sequence analysis¹⁵

1.7.2. Command Line Based

There are some tools which implement profile hidden Markov models, running on the command line. Those tools are SAM, HMMER, PFTOOLS and HMMpro, which all base more or less on HMMs of Krogh et al. [Eddy, 1998]

HMMER will be described in detail in the Material and Methods chapter.

HMMpro seems to be not available anymore, because the website is not accessible and no similar hit can be found by web search.

PFTools¹⁶ is a package of programs for the construction of profiles. Also sequences or sequence libraries can be searched against profiles and profile libraries. It is used for the PROSITE database. [Sigrist, et al. 2002]

SAM (Sequence Alignment and Modeling System)¹⁷ is only available for Unix-Systems; it implements HMMs very similar to HMMER and includes a conversion function between SAM and HMMER formats. [Wistrand, et al., 2005]

1.7.3. Protein and Profile HMM Databases

There are two profile HMM databases, which are annotated, available. First the Pfam database and second the PROSITE¹⁸ profiles database, which is a supplement to the PROSITE motifs database.

Two other profile-HMM web-based databases are SMART and TIGRFAMs. [Pevsner, 2003] All of these databases are available online.

The Pfam¹⁹ database belongs to the Sanger Institute with the current version 22.0 from July 2007. [Sanger]

It consists of a large number of protein families. Those families are represented by hidden Markov Models and multiple sequence alignments. [Finn, et al., 2005]

The Pfam database can be divided into Pfam-A and Pfam-B. Pfam-A is manually curated and contains high quality families. B is used as a supplement to Pfam-A with a lower quality. [Sanger]

Additionally Pfam provides HMM-Logos, graphical representations of an HMM, visualizing the distinguishing features. [Finn, et al., 2005]

Prosite²⁰ is hosted from the Swiss-Prot group from the Swiss Institute of

Bioinformatics. (SIB) It also contains protein families and domains. [Swiss-Prot, 2006]

SMART²¹ (Simple Modular Architecture Research Tool) is an online database for the protein domain identification and can also be used for the analysis of protein domain architectures. [Letunic, et al. 2006]

TIGRFAMs²² is also manually curated and contains protein families which consist of hidden Markov models, multiple sequence alignments, commentary, Gene Ontology (GO) assignments, literature references and pointers to related TIGRFAMs, InterPro and Pfam models. [Haft, et al., 2003]

It belongs to the J. Craig Venter Institute (see http://www.tigr.org/db.shtml).

Superfamily 1.69 is an HMM library which contains all proteins of known structure.

[Gough, et al., 2001]

2. Materials and Methods

To develop a graphical user interface for HMMER 2 and display results for an easier interpretation following parts are necessary.

2.1. HMMER 2

HMMER 2 is a command line based tool for biological sequence analysis. It implements profile hidden Markov models. [Eddy, 2003]

The latest version is 2.3.2 and the package is available for a number of Operating Systems (Table 2) and architectures in precompiled binaries, including Microsoft Windows™. The source code is also freely available for compilation. [HHMI Janelia farm research campus]

HMMER 2 contains nine subprograms which have different aims. [Eddy, 2003]

All of those subprograms (Table 3) need to be mapped in the graphical user interface.

To build a profile HMM with HMMER the user needs a multiple sequence alignment of the protein domain or sequence family she or he wants a profile for. As input many formats are possible: CLUSTAL, GCG, PHYLIP, FASTA and Stockholm²³. Stockholm is the format of the Pfam-database and the native format of HMMER.

HMMER provides E-values for its search results. To get more sensitivity the created HMMs can be calibrated using hmmcalibrate.

With hmmsearch the user can search against single sequence files and major databases like Swissprot²⁴ and NCBI²⁵. Also with HMMER one can search against HMM databases like Pfam or self constructed HMM databases, this is done with hmmpfam.

hmmalign is used to create multiple sequence alignments of large numbers of sequences, using an existing profile HMM. [Eddy, 2003]

23 Description of the formats CLUSTAL, GCG, PHYLIP, FASTA and Stockholm can be found in the Appendix chapter 6.7.12

24 http://www.expasy.ch/sprot/

25 http://www.ncbi.nlm.nih.gov/sites/gquery

22 Architectures and Operating Systems

AMD Opteron/Linux AMD Opteron/Solaris

Apple Macintosh Power PC OS/X Compaq Alpha True64

Compaq Alpha Linux Debian Linux

Hewlett/Packard IA64 (Itanium2), Linux Hewlett/Packard IA64 (Itanium2), HP/UX Intel FreeBSD

Intel GNU/Linux IBM Power4, Linux IBM Power4, AIX IBM Power5, AIX IBM Power6, AIX Intel OpenBSD Intel Solaris

Microsoft Windows™

Silicon Graphics IA64 (Itanium2), Linux Silicon Graphics MIPS IRIX

Sun Sparc Solaris

Table 2: List of architectures and operating systems for which HMMER 2 is available [HHMI Janelia farm research campus]

Program Application

hmmbuild Build a model from a multiple sequence

alignment

hmmalign Align sequences to an existing model

hmmcalibrate Calibrates an HMM, makes searches

more sensitive, by the calculation of better E-value scores

hmmconvert Converts a model file into other formats, e.g. HMMER 2 binary or GCG profiles

hmmemit Emits sequences probabilistically from a

profile HMM

hmmfetch Fetch a single model from an HMM

database

hmmindex Index an HMM database

hmmpfam Search an HMM database for matches to

a query sequence

hmmsearch Search a sequence database for

matches to an HMM

Table 3: HMMER 2 Programs [Eddy, 2003]

2.2. Development Environment

2.2.1. Operating Systems

The operating system (OS) the software was developed and tested on is Microsoft Windows XP™ Home Edition Version 2002, Service Pack 3, v.3264 and Microsoft Windows Vista Home Premium™ on the one hand and Ubuntu Linux 7.10 (Gutsy Gibbon) and 8.04 (Hardy Heron) on the other hand.

2.2.2. Hardware

Microsoft XP was installed on a Notebook with AMD Athlon™ 64 processor 3700+

with 1.66 GHz and 2.00 GB RAM. Microsoft Windows Vista™ was installed on a Personal Computer (PC) AMD Athlon™ 64 3000+ with 2,01 GHz and 2 GB RAM.

The installations of Ubuntu Linux ran on a mobile AMD Athlon™ XP2500+

(Notebook), 1 GB RAM and 1.66 GHz.

2.2.3. Software

Active Perl for Windows™ has been installed to be able to run Perl-Scripts. This Perl distribution is limited compared with Perl for Linux/Unix, because only special

compiled Perl packages, with the extension ppm, can be run on it. Especially the compilation of Perl packages containing C code turned out to be very difficult.

For the realization of the UML-Model²⁶ (Unified Modelling Language-Model) objectiF Visual Studio .NET Personal Edition™ ²⁷ was used.

HMMER 2 for Windows, developed by the Computational Biology Service Unit (CBSU) at Cornell University [HHMI Janelia farm research campus], has been

installed, too. It is a porting of the original HMMER 2 for MS Windows™ systems and contains several *.exe-files, one for each HMMER subprogram, for example

hmmsearch.exe. The source code and compiled executable can be downloaded from http://www.tc.cornell.edu/WBA/, last verified on May 11, 2008. It is free software and can be redistributed freely. [University, Cornell Theory Center – Cornell]

26 Details in chapter 2.7

27 http://www.microtool.de/objectif/de/objectif_vs_net_personal_edition.asp

2.3. .NET-Framework

Before .NET was introduced a view languages were mainly used for application development. These are C++, Java and Visual Basic 6.0. With the .NET framework C# came into this group. [Kühnel, 2006]

Microsoft™ published the development framework .NET 1.0 as well as an Integrated Development Environment (IDE) for it in the year 2002. Meanwhile version 3.0 of the .NET-framework was published. [Kühnel, 2006]

.NET has some similarities to Java and VB 6.0, because it produces a byte code which has to be compiled during runtime on the PC. This byte code is called Microsoft™ Intermediate Language (MIL) or just Intermediate Language (IL). This MIL gets compiled by the Just-In-Time-Compiler (JIT-Compiler), so that it becomes executable (Figure 13). [Kühnel, 2006]

Figure 13: Lifecycle of .NET -Code [Kühnel, 2006]

The framework contains a number of components, which will be described now.

The Common Language Specification (CLS) describes the attributes a programming

26 language has to meet, to be compatible with .NET. [Kühnel, 2006]

The Common Type System (CTS) specifies the language independent development.

There are two categories of Types in .NET - value types and reference types. One specialty of the .NET CTS is the fact that it is not integrated into a specific language, as it is done elsewhere. Because of this objects, written in different languages, can

Im Dokument DIPLOMARBEIT. GraHMMer: A graphical user interface for biological sequence analysis using profile hidden Markov models (Seite 19-95)