DIPLOMARBEIT. GraHMMer: A graphical user interface for biological sequence analysis using profile hidden Markov models

(1)

DIPLOMARBEIT

GraHMMer: A graphical user interface for biological sequence analysis using profile hidden Markov

models

Fachhochschul‐Diplomstudiengang Bioengineering

durch

Ing. Nadine Elpida Tatto

c0410231041

unter der Leitung von

Dr. Markus Jaritz

Wien, im Mai 2008

(2)

ii

Danksagung

Für die Unterstützung und Hilfe bei der Erstellung dieser Diplomarbeit möchte ich zu aller erst Dr. Markus Jaritz danken. Als mein FH-Hauptbetreuer stand er mir jederzeit mit fachlichem und professionellem Rat zur Seite.

Durch die Tatsache, dass dieses Studium berufsbegleitend war, bin ich besonders auch meiner Familie und meinen Freunden zu großem Dank verpflichtet. Sie akzeptierten bereitwillig während dieser vier Jahre, dass ich nicht viel Zeit für sie hatte.

Aber ganz besonders möchte ich meinem Mann Ralf danken. Ohne seine Hilfe und volle Unterstützung wäre sowohl dieses Studium als auch die Erstellung dieser Diplomarbeit für mich so nicht möglich gewesen.

Ing. Nadine Elpida Tatto, Wien im Mai 2008

(3)

iii

Abstract

Sequence analysis is one of the important areas in microbiology and genomics. With the increasing amount of sequence data from sequencing projects it becomes even more important to have algorithms and applications for the analysis of such data.

[Pevsner 2003]

One of those useful tools is HMMER 2, a powerful, command line based sequence analysis application which uses hidden Markov models for the search of related proteins, protein domains and the generation of multiple sequence alignments. [S.

Eddy 2003]

Most of the available tools are command line based and are therefore often not easy to use. For experts in biology and the field of sequence analysis it is important to have access to easy to use applications. This can be provided by a small learning effort which can be offered by a graphical interface.

This leads to the question if the usage of HMMER 2 would be much easier with a graphical user interface which also provides a graphical interpretation of the text based output for a better and easier understanding of the results?

During this thesis a graphical application, called GraHMMer, was developed. This application uses HMMER for Windows™ in the backend.

The GUI¹ provides a practical overview of the available features of HMMER for Windows. Furthermore all available options are displayed.

Additionally the text-output of hmmsearch and hmmpfam, two of the HMMER subprograms, is split and the interesting parts of those results are displayed separately, which provides a better overview.

Input- and output – files can be edited and exported with the application.

GraHMMer is project-based, which means that files which are used for the sequence analysis and result-files are saved in project folders.

1 Graphical User Interface

(4)

iv The graphical representation of the pHMMs² was realized by the integration of an existing web portal of the Sanger Institute³, which creates HMM-Logos.

With the TNF protein family the same processing steps were executed with HMMER for Windows on the command line and with GraHMMer. The usage of both methods has been compared.

As the result of the comparison the usage of the graphical application turned out to be more comfortable then the command line solution. One reason for this is the clear overview of available subprograms and options, which is not given in the command line tool. The user does not need to know all shortcuts for the available options anymore. Another reason is the better and split display of the text-output, as well as the opportunity to edit and view the input- and output-files inside the application.

Furthermore the integration of the Sanger HMM-Logo online platform provides the possibility to create graphical output.

By writing this thesis I gained much experience regarding the difficulties and opportunities which can arose by the implementation of graphical interfaces for existing computer tools for biological tasks. The obtained knowledge will be an important and essential base for my professional future.

2 profile hidden Markov model

3 http://www.sanger.ac.uk/cgi-bin/software/analysis/logomat-m.cgi [Schuster-Böckler, et al., 2004]

(5)

v

Zusammenfassung

Sequenz Analyse ist ein wichtiges Gebiet in der Mikrobiologie und Genetik. Mit der wachsenden Menge an Daten aus Sequenzierungsprojekten wird es immer wichtiger auch Algorithmen und Applikationen für die Analyse solcher Daten zur Verfügung zu haben. [Pevsner, 2003]

Die meisten verfügbaren Programme sind Kommandozeilenbasiert und somit oft auch nicht leicht zu bedienen.

Eine dieser Applikationen ist HMMER2, ein mächtiges, Kommandozeilen basiertes Sequenz Analyse Programm, welches hidden Markov Modelle für die Suche nach verwandten Proteinen, Proteindomänen und der Erstellung von multiplen alignments verwendet. [S. Eddy 2003]

Für Biologen und Experten in der Sequenz Analyse ist es wichtig raschen Zugriff auf bedienungsfreundliche Analysewerkzeuge zu haben. Dies setzt unter anderem einen geringen Lernaufwand voraus welcher durch eine graphische Oberfläche vereinfacht wird.

Dies führt zu der Frage, ob die Verwendung von HMMER2 durch ein graphisches Interface, welches auch eine graphische Interpretation der Textbasierten Ergebnisse liefert, vereinfacht würde und zu einem besseren Verständnis der Ergebnisse führt.

Im Zuge dieser Diplomarbeit wurde eine Applikation namens GraHMMer entwickelt.

Dieses Programm verwendet HMMER für Windows™ im Hintergrund.

Die GUI⁴ stellt einen praktischen Überblick über die verfügbaren Funktionalitäten von HMMER für Windows zur Verfügung. Weiters werden alle verfügbaren Optionen angezeigt. Zusätzlich werden die Textbasierten Ergebnisse von hmmsearch und hmmpfam, zwei der Subprogramme von HMMER, aufgeteilt und die interessanten Teile separiert dargestellt. Dies ermöglicht einen besseren Überblick über die

Ergebnisse. Input- und Output – Dateien können innerhalb der Applikation bearbeitet und exportiert werden.

GraHMMer ist Projekt-basiert, dies bedeutet, dass Dateien welche für die Sequenz Analyse verwendet werden und entsprechende Ergebnis-Dateien in Projekt-Ordnern

4 Graphical User Interface

(6)

vi gespeichert werden.

Die graphische Darstellung der pHMMs⁵ wurde durch die Integration einer bereits existierenden online Plattform des Sanger Instituts⁶ realisiert. Diese Plattform erstellt HMM-Logos aus pHMMs.

Es wurden sowohl mit HMMER für Windows™ auf der Kommandozeile, als auch mit GraHMMer für die TNF Protein Familie die gleichen Verarbeitungsschritte

durchgeführt und die Durchführung miteinander verglichen.

Der Vergleich ergab, dass die graphische Applikation wesentlich komfortabler als die Kommandozeilen-basierten Version ist. Einer der Gründe hierfür ist der klare

Überblick über die verfügbaren Subprogramme und Optionen. Dieser Überblick wird durch die Kommandozeilen-Version nicht zur Verfügung gestellt. Der Benutzer muss die einzelnen Parameter-Bezeichnungen nicht mehr kennen. Ein weiterer Vorteil ist die bessere Darstellung der Textergebnisse, wie auch die Möglichkeit die Dateien zu editieren und zu exportieren. Außerdem bietet die Integration der online Plattform des Sanger Instituts die Option ein graphisches Ergebnis zu generieren.

Beim Verfassen dieser Diplomarbeit konnte ich viel Erfahrung betreffend der Schwierigkeiten und Möglichkeiten bei der Implementierung von Computer

Programmen für biologische Analysen machen. Das gewonnene Wissen stellt eine wichtige und unverzichtbare Basis für meine weitere berufliche Zukunft dar.

5 profile hidden Markov model

6 http://www.sanger.ac.uk/cgi-bin/software/analysis/logomat-m.cgi [Schuster-Böckler, et al., 2004]

(7)

vii

List of Figures

Figure 1: DNA double helix: a - antiparallel strands b - dimensions of the helix c - calotte model

[Knippers 2006] 2

Figure 2: The genetic code [Knippers, 2006] 3

Figure 3: CAP protein with two domains [Knippers 2006] 4

Figure 4: a - α-helix; b - β-sheet [Knippers 2006] 5

Figure 5: Example of a pairwise alignment of HBA_HUMAN with HBB_HUMAN, LGB2_LUBLU and

F11G11.2 [Durbin, et al., 1998/2006, Fig. 2.1] 8

Figure 6: Blossum50 substitution matrix for sequence alignments [Durbin, et al., 1998/2006] 9 Figure 7: A sequence alignment matrix derived from Needleman-Wunsch algorithm [Durbin, et al.,

1998/2006, Fig. 2.5] 10

Figure 8: a sequence alignment matrix derived from Smith-Waterman algorithm [Durbin, et al.,

1998/2006] 11

Figure 9: a modified graphic of the simple DNA Markov chain, containing additionally begin- and end-

states [Durbin, et al., 1998/2006, Fig. 3.1] 13

Figure 10: graphic of a simple Markov chain, containing 4 states, one for each nucleotide in the DNA-

alphabet [Durbin, et al., 1998/2006] 14

Figure 11: “A small profile HMM [..].” [Eddy, 1998 p. 757] 15 Figure 12: Example of an HMM Logo [Schuster-Böckler, et al., 2004] 17

Figure 13: Lifecycle of .NET -Code [Kühnel, 2006] 25

Figure 14: Screenshot of GraHMMer, selected project is marked green in the treeview, selected HMM

subprogram is hmmbuild 33

Figure 15: Screenshot - the GraHMMer Search menu contains HMMER subprograms with search

functionality 34

Figure 16: Screenshot - GraHMMer Build menu contains HMMER subprograms for building profile

HMMs and profile HMM databases 35

Figure 17: Screenshot - dialog box for the creation of profile HMM databases 36 Figure 18: Screenshot - display a file by double clicking it in the treeview, displayed is the file

nucleic.null from the HMMER2 tutorial 37

Figure 19: Screenshot - the input- and selection options for hmmsearch, expert options are displayed 38 Figure 20: Screenshot - hmmsearch information box, opens after clicking the blue 'I'-button 39 Figure 21: Screenshot - Result tab page, detail view on the domain alignments from the raw result 41 Figure 22: extract from the GraHMMer-sourcecode, part of the SetInitialOptions - method of subclass

HmmSearch 43

Figure 23: extract from the GraHMMer-sourcecode, SetUserOptions-method of subclass HmmSearch 44 Figure 24: extract from GraHMMer-sourcecode, method GetUserOptions 45

Figure 25: UML Class Diagram of GraHMMer 46

Figure 26: Screenshot - Search result from Pfam-A; search keywords: tumor necrosis factor tnfa; May

18, 2008; 47

Figure 27: Screenshot - Download of PF00229 alignments from Pfam-A in MSF-format; May 18,

2008 48

Figure 28: Excerpt from PF00229_full.msf 49

Figure 29: Screenshot - before the execution of hmmbuild.exe on PF00229.hmm 50 Figure 30: Screenshot - creation of PF00229.hmm with hmmbuild in GraHMMer and PF00229.msf 52 Figure 31: Sceenshot - PF00229.hmm viewed with the FileViewer-feature of GraHMMer 53 Figure 32: Screenshot - Search with NP_001009835.fa against PF00229.hmm; Sequence Scores 53

(11)

xi Figure 33: Screenshot - Search with NP_001009835.fa against PF00229.hmm; Domain Scores 54 Figure 34: Screenshot - Search with NP_001009835.fa against PF00229.hmm; Alignments 54 Figure 35: Screenshot - Search with NP_001009835.fa against PF00229.hmm; Histogramm 54

Figure 36: Textoutput of hmmcalibrate on PF00229.hmm 55

Figure 37: Screenshot - hmmconvert on PF00229.hmm into GCG Profile 56

Figure 38: Screenshot - result of hmmemit on PF00229.hmm 56

Figure 39: HMM-Logo of PF00229.hmm; edited to fit on a A4 page; [Schuster-Böckler, et al., 2004] 57 Figure 40: Result-text of the search with NP_001009835.fa against PF00229.hmm xxvii

Figure 41: ClustalW output format [Pevsner 2003] xxvii

Figure 42: GCG (MSF-) format [Pevsner 2003] xxviii

Figure 43: simple FASTA file [S. Eddy 2003] xxviii

Figure 44: Minimal Stockholm format [S. Eddy 2003] xxix

Figure 45: Poster about GraHMMer for a poster session on February 2, 2008, at the University of

Applied Life Sciences Vienna xxx

(12)

1

1.

Introduction

Sequence analysis is one of the important areas in microbiology and genomics. With the increasing amount of sequence data from gene and protein sequencing projects it becomes even more important to have algorithms and with them implemented applications for the analysis of such data.

For sequence analysis a number of tools are already available. Most of them are command line based and so are not easy to use for the average user who is not a specialist on using personal computers. [Pevsner, 2003]

Some of those tools are available on online platforms, on websites with input forms.

Even those forms are often unclear and overburden the user with too many options available to enter.

For experts in biology and the field of sequence analysis it is important to have access to easy to use applications. Those tools must not need a high effort to learn their usage. Normally a scientist is not an expert on both, biology and such programs mentioned above.

One of those useful tools is HMMER 2, a powerful, command line based sequence analysis application which uses hidden Markov models for the search of related proteins, protein domains and the generation of multiple sequence alignments.

Primarily it is applicable for protein sequence analysis, but it can be used for RNA and DNA analysis as well.

This thesis deals with:

The question is if the usage of HMMER 2 would be much easier with a graphical user interface (GUI), which also provides a graphical interpretation of the text based

output for a better and easier understanding of the results⁷?

With this thesis I want to show that it is possible to develop and implement such user- friendly applications, by the way of example for HMMER 2, so that it is not absolute essential anymore to be a specialist in using command line based sequence analysis tools. As result such a GUI is implemented for the easy use of HMMER 2.

7 See also the according poster in chapter 6.7.13

(13)

2

1.1. DNA

The DNA (deoxyribonucleic acid) carries the whole gene information of an organism.

The subunits of this macro molecule are called nucleotides.

A nucleotide is made of three components: a purine or pyrimidine base, a deoxyribose and a phosphate rest. There are two purine bases – adenine and guanine, and three pyrimidine bases – cytosine, thymine (DNA) and uracil (RNA).

Thousands of nucleotides are connected in a long, branchless chain to a DNA fiber, via phosphate bridges. DNA forms a double helix (Figure 1), in which adenine and thymine as well as guanine and cytosine, are connected with each other

complementary. Both fibers of a double helix have an anti parallel orientation to each other.

The main function of DNA is to provide the code for proteins (Figure 2), the linear order of nucleotides in DNA causes the linear order of amino acids in a protein.

Figure 1: DNA double helix:

a - antiparallel strands b - dimensions of the helix c - calotte model [Knippers 2006]

(14)

3 Three nucleotides code for one amino acid. This means that a section of the DNA, which codes for a protein, must contain 600 to 1500 nucleotides. Such a section is called “gene”. [Knippers, 2006]

Figure 2: The genetic code [Knippers, 2006]

1.2. Proteins and Protein Sequences

Proteins are macro molecules which are made of single amino acids. Most proteins contain about 20 amino acids of an alternating combination and number. Because of that, proteins are made of long chains without branches.

The order of amino acids in a protein is called sequence and forms the primary structure. A protein usually contains 100 to 800 amino acids, though more or less amino acids can be found in a protein sequence.

On their central carbon atom all amino acids carry a carboxyl group-or acid group, an amino group, a hydrogen atom and a lateral chain. The 20 amino acids (Table 1) can be determinate on size, form and charge of their lateral chains. [Knippers, 2006]

(15)

4 A protein has a three dimensional structure (Figure 3) which can be identified mainly with two methods.

Figure 3: CAP protein with two domains [Knippers 2006]

For the first method crystallized proteins and x ray crystallography are used, for the second method proteins in solution with NMR (nuclear magnetic resonance). The 3D- Structure is essential for the appropriate function of the protein. [Knippers, 2006]

A protein is classified in three further structures. The next one after the primary structure is the secondary structure. It is subdivided into two entities, the α-helix and the β-sheet (Figure 4).

The α-helix turns clockwise and is build by bindings between CO-groups and NH- groups of the fourth nearest peptide-binding. One turn is made of 3.6 amino acids and a gap of 0.15 nm exists between each amino acid and the next.

A β-sheet is made by H-bridges between different regions. It contains β-strands, which are made of 5 to 10 amino acids. Multiple β-strands form a pleated sheet. The strands can be oriented parallel or anti parallel to each other. A strand is connected by a turn with the next one. Such a turn contains 4 to 8 amino acids and is often charged or polar. The turns are mostly found at the surface of the protein.

(16)

5 Figure 4: a - α-helix; b - β-sheet [Knippers 2006]

(17)

6 Abbreviation

Name Characteristics

Gly (G) Glycine Hydrophobic, without any lateral

chain

Asp (D) Aspartic acid Negative charge

Glu (E) Glutamic acid Negative charge

Arg (R) Arginine Positive charge

Lys (K) Lysine Positive charge

Asn (N) Asparagine Polar

Gln (Q) Glutamine Polar

Ser (S) Serine Polar

Thr (T) Threonine Polar

Ala (A) Alanine Hydrophobic

Val (V) Valine Hydrophobic

Leu (L) Leucine Hydrophobic

Ile (I) Isoleucine Hydrophobic

Phe (F) Phenylalanine Hydrophobic

Tyr (Y) Tyrosine Polar

Trp (W) Tryptophan Rarest amino acid in proteins

His (H) Histidine Polar

Met (M) Methionine Hydrophobic

Cys (C) Cysteine Polar

Pro (P) Proline Cyclic imino acid

Table 1: Overview of amino acids, with three-character and one-character (in brackets) abbreviation [Knippers, 2006]

(18)

7 The last but one form is called tertiary structure. A tertiary structure is the

combination of a number of α-helix and β-sheets (pleated sheets). The different secondary structures of a protein are bound by amino acids of inferior position. By this connections different sub domains can be determined. The combination of sub domains is the quaternary structure. A domain is the smallest protein-unit with a defined and independent structure and is composed of 50 to 150 amino acids. Such domains (Figure 3) are often responsible for specific reactions and the interaction between them is essential for the function of the whole protein. [Knippers, 2006]

1.3. Sequence Analysis

The questions that should be answered with the analysis of DNA and protein

sequences are often: are two or more sequences are related to one another? Such analyses can be done by aligning sequences. When sequences of proteins and DNA exhibit a certain relationship, they are homologous. Another outcome of the

relatedness of sequences is the conclusion to analog functions. Furthermore a bigger number of analyzed sequences can lead to the identification of domains and motifs, which may be shared among a group of them. [Pevsner, 2003]

Sequences can be homologous, similar or identical. Homology is an inference, because two sequences are homologous or not, whereas similarity and identity are quantities, though all of these describe the relatedness of sequences.

Homolog proteins can be divided into orthologs and paralogs. Orthologous means that a homolog protein can be found in different species, which share an ancestor.

Such orthologs often have similar biological functions, but this is not necessarily always the fact. They are identified via database searches with a result of significant alignment scores.

The other classification of paralogs means different proteins or sequences in the same species, which are developed for example by gene duplication. Those paralogs often can be found in diverse locations in an organism and frequently have different, but related functions. The relatedness of two sequences can be detected by

performing a pairwise alignment. [Pevsner, 2003]

(19)

8

1.3.1. Pairwise Alignment

In a pairwise alignment two sequences (see Figure 5) or parts of two sequences are placed one upon the other, using the single letter code for proteins. To execute such an alignment a computer algorithm is needed, because this is not easy to do

manually. The difficulties arise from the options of gaps, deletions, insertions and substitutions of single or more amino acids or nucleic acids. For this reason the number of potential alignments is raising exponentially. [Pevsner, 2003]

The relatedness of sequences can be derived from mutation and selection which have an influence of those sequences. Mutational processes are substitutions, insertions and deletions. In a substitution residues are exchanged by other ones. A deletion removes and an insertion adds residues. Insertions and deletions are leading to gaps. Due to selection some mutations have more success than others and the occurrence of those mutations depends on it. The relative likelihood of the relatedness of two sequences is expressed by the total score an alignment gets. This score is a sum of values, given for each aligned pair of residues. Those values

depend on identity, substitutions and gaps which have occurred in the sequences.

Identities and conserved substitutions are expected to be more likely than non- conservative changes.

Figure 5: Example of a pairwise alignment of HBA_HUMAN with HBB_HUMAN, LGB2_LUBLU and F11G11.2 [Durbin, et al., 1998/2006, Fig. 2.1]

Because of that conserved substitutions and identities get positive values, whereas non-conservative changes get negative values. For those values substitution

matrices are used, for example BLOSUM50 (see Figure 6). [Durbin, et al., 1998/2006]

For gaps, resulting from insertions or deletions, penalties are added to the score.

(20)

9 This penalty is associated with the length of the gap. There is another determination between a gap-open and a gap-extension penalty, which is usually less than the first version. [Durbin, et al., 1998/2006]

A pairwise alignment can be done either over two full sequences, called a global alignment (including allowed gaps), or with two subsequences, called a local alignment. [Durbin, et al., 1998/2006]

For the global alignment the Needleman-Wunsch algorithm is used.

For this algorithm a matrix F with the indices i and j is build. F (i,j) contains the best score for the alignment at that point. The matrix is filled with values according to the algorithm in Formula 1. An example of such a matrix can is shown in Figure 7.

Formula 1: Needleman-Wunsch matrix algorithm [Durbin, et al., 1998/2006]

Figure 6: Blossum50 substitution matrix for sequence alignments [Durbin, et al., 1998/2006]

(21)

10 Figure 7: A sequence alignment matrix derived from Needleman-Wunsch algorithm [Durbin, et

al., 1998/2006, Fig. 2.5]

The local alignment can be done with the Smith-Waterman algorithm (Formula 2).

This type of alignment is useful if one wants to know if two protein sequences share a domain, if extended sections of DNA sequences should be compared and it is a very sensitive method to find similarities between highly diverged sequences.

Formula 2: Smith-Waterman matrix algorithm [Durbin, et al., 1998/2006]

The best local alignment of two subsequences is the one with the highest scoring.

The Smith-Waterman algorithm is very similar to the Needleman-Wunsch algorithm, with only two differences. One is the possibility for values to take the value 0 if all other values would be negative. The second is, that the alignment does not have to end in the bottom right corner of the matrix; it can end anywhere (see Figure 8).

Because of that the highest value in the matrix is assumed to be the best score.

[Durbin, et al., 1998/2006]

(22)

11 Figure 8: a sequence alignment matrix derived from Smith-Waterman algorithm [Durbin, et al.,

1998/2006]

1.3.2. Multiple Alignment

A multiple alignment means that more than two sequences are aligned to each other.

Precisely the homologous residues of those sequences get aligned in columns. In case of a multiple alignment homologous residues correspond to structural and evolutionary meaning. [Durbin, et al., 1998/2006]

With the usage of multiple alignments members of gene or protein families can be identified. The membership in such a family can lead to the prediction that those members have similar functions and structures. [Pevsner, 2003]

“Aligned residues tend to occupy corresponding positions in the three-dimensional structure of each aligned protein”

[Pevsner, 2003, p. 320]

To create a multiple alignment by hand an expert needs much experience in protein sequence evolution [Durbin, Eddy, Krogh, & Mitchison, 1998/2006], but if the

sequences developed some divergence a multiple alignment is very difficult to implement. Additionally it is not to be expected that only one correct multiple alignment of a protein family can be identified. [Pevsner, 2003]

For automated multiple alignments a good scoring is necessary to decide over the quality and correctness of the results. [Durbin, et al., 1998/2006]

The most frequently used algorithms for multiple alignments refer to the progressive alignment method of Da-Fei Feng and Russell Doolittle. This method contains three stages.

First, the multiple alignment should contain a global pairwise alignment with each

(23)

12 sequence. Second, the creation of a guide tree has to be done with the help of

distance matrices.

In the third stage the sequences are added to the multiple alignment according to the guide tree – the ones with the strongest relationship first.

The software which is very popular to create multiple alignments and which implements such algorithm is ClustalW. [Pevsner, 2003]

Alternatively multiple alignments can be accessed, queried and fetched from the Pfam-database. This is an online accessible database which contains multiple alignments and HMM-profiles (hidden Markov model-profiles) of complete protein domains. [Sonnhammer, et al., 1997]

The database is divided into two parts, Pfam-A and Pfam-B. Pfam-A contains only manually curated HMM-profiles and proteins families, whereas Pfam-B contains data which are not of the same high quality as in Pfam-A. Additionally the data in Pfam-B is not completely annotated. [Pevsner, 2003]

(24)

13

1.4. Hidden Markov Models and Profile Hidden Markov Models

A hidden Markov Model (HMM) is a probabilistic model for sequences of symbols.

With HMMs different problems can be solved, for example if a sequence is part of a certain family. Another problem which can be processed is, if the family is known, to derive the internal structure. [Durbin, Eddy, Krogh, & Mitchison, 1998/2006]

1.4.1. Markov Chains

Markov chains are a probabilistic model which allows the description of sequences, where the appearance of a symbol (nucleotide or amino acid) depends on the previous symbol.

Markov chains are suitable to describe such systems. They can be displayed graphically and contain states and transition probabilities. Each state stands for a particular residue and they are connected by arrows, which are the transition

probabilities. Additionally Begin- and End-states can be added to a Markov chain to build a sequence (see Figure 9). [Durbin, et al., 1998/2006]

Figure 9: a modified graphic of the simple DNA Markov chain, containing additionally begin- and end-states [Durbin, et al., 1998/2006, Fig. 3.1]

In Figure 10 the example of a Markov Chain for a DNA sequence is given. The circles contain the states, in our case the 4 possible nucleotides A, T, C and G of the DNA alphabet.

The arrows are representing the transition probabilities, which nucleotide can follow the current one in the sequence. [Durbin, et al., 1998/2006]

(25)

14 Figure 10: graphic of a simple Markov chain, containing 4 states, one for each nucleotide in the

DNA-alphabet [Durbin, et al., 1998/2006]

1.4.2. Hidden Markov Models

The important distinction between a Markov chain and a hidden Markov model (HMM) is based on the fact that there is no more a one-to-one correlation between symbol and state in a HMM. The starting state cannot be estimated from the next state, just by looking at it. [Durbin, et al., 1998/2006]

“The name ‘hidden Markov model’ comes from the fact that the state sequence is a first-order Markov chain, but only the symbol sequence is directly observed”

[Eddy, 1998, p.756]

The most interesting information about a sequence, derived from a HMM, are the underlying states. For this reason the sequence must be decoded. There for some algorithms are available. The algorithm which is the most common one for this problem is the Viterbi algorithm. It is also a dynamic programming algorithm, just like Needleman-Wunsch and Smith-Waterman, both described before. [Durbin, et al., 1998/2006]

Defining a HMM may be one of the most difficult parts in using HMMs. For the

possible states and connections between them - the design of the structure, transition and emission probabilities – the assignment of parameter values must be done. The emission probability stands for the chance that a state becomes observable. The chance of a state to change into a certain other state is called transition probability. In sequence analysis this can be done by using so called training sequences. Those are a set of independent example sequences, which all fit the model we want to achieve well. [Durbin, et al., 1998/2006]

(26)

15

1.4.3. Profile Hidden Markov Models

To build a profile HMM (pHMM) an existing multiple alignment is used as input. [Eddy S. R., 1998] A graphical representation of such a pHMM can be seen in Figure 11.

A pHMM describes a protein family by specifying position-specific letter emission distributions and position-specific insertion and deletion probabilities. [Schuster- Böckler, et al., 2004]

Using a profile HMM to search against a database helps to find distantly related homologs. [Pevsner, 2003]

Figure 11: “A small profile HMM (right) representing a short multiple alignment of five sequences (left) with three consensuns columns. The three columns are modeled by three

match states (squares labeled m1, m2 and m3), each of which has 20 residue emission probabilities, shown with black bars. Insert states (diamonds labeled i0 – i3) also have20 emission probabillities each. Delete states (circles labeled d1-d3) are ‘mute’ states that have no

emission probabilities. A begind and end state are included (b,e). State transition probabilities are shown as arrows.” [Eddy, 1998 p. 757]

(27)

16

1.5. Visualization in Bioinformatics

Human perception and cognition is limited and the amount of information which can be processed and analyzed in a short period is therefore very limited, too. For Information visualization (IV) the data must be selected, transformed and

represented. The visualization of biological data can provide a suitable answer for the processing of the mass of data, by using the bandwidth of human vision, which is able to distinguish trends and patterns. [Tao, et al., 2004]

Information visualization techniques have two big aims. The first one is to visualize large number of data and the second one is to add patterns and trends for better analyzability. [Tao, et al., 2004]

To implement IV data must be obtained, mapped and rendered. In Bioinformatics IV has been implemented for different purposes. For biomolecular structures 2D-tools visualize secondary structures of nucleic acids and proteins and 3D-tools tertiary and higher structures. Sequence analysis can be visualized simply by aligning the text of two sequences against each other or by using dot plots and percent identity plots.

Also for expression profiles, genome and sequence annotation, molecular pathways and ontology, taxonomy and phylogeny several IV techniques were implemented.

Because of the limited knowledge about and the complexity of biological phenomena IV techniques are increasingly important for the analysis of biological data. [Tao, et al., 2004]

Position-specific score matrices (PSSMs), also called sequence profiles, and ungapped multiple alignments Sequence Logos are used for visualization. Such a Sequence Logo consists of a stack of letters which stands for amino acids or

nucleotides in a multiple alignment and describes the conservation of the columns or positions. The height of a letter displays its frequency at this position. Additionally colors are used for other properties. Schuster-Böckler, Schultz, & Rahmann adapted the Sequence Logos and implemented HMM Logos in 2004 (see Figure 12). They included the fact, that positios can be deleted, inserted or conserved. [Schuster- Böckler, et al., 2004]

(28)

17 Figure 12: Example of an HMM Logo [Schuster-Böckler, et al., 2004] -

“Comparison of the HMM Logos of the small GTPases Ras and Rab from SMART [3]. The Ras logo is based on an alignment of 35 sequences; the Rab logo on 48 sequences. The height of the entire vertical axis is 5 bits for both logos. Subfamily specific sites RabF2 to RabF5[13] are

indicated by arrows” [Schuster-Böckler, et al., 2004, p. 7]

(29)

18

1.6. Graphical User Interfaces

A graphical user interface (GUI) is a software which provides an interface to an application to the user, with selectable menus, clickable buttons and much more similar. The user does not need to know which commands she or he has to use for the same application on the command line of an operating system, if this application is actually available for the command line use. The options for the application are given via menus or similar controls. A GUI can be the interface to an operating system, Microsoft Windows XP™ for example, or it can be an application which runs itself inside a graphical operating system, like Microsoft Word™ for example.

In Bioinformatics many tools are developed for the use on the command line. For some of those tools GUIs are available yet. A view examples of GUIs for

bioinformatic tools:

• DNAAlignEditor⁸: a desktop tool for the alignment of nucleotide sequences with a user-friendly interface. The user can edit multiple sequence alignments manually. [Sanchez-Villeda, et al., 2008]

• GEDI⁹: a desktop tool for the analysis of gene expression data, extensible, open source, free available and user-friendly. [Fujita, et al., 2007]

• MPI Bioinformatics Toolkit¹⁰: an online tool for protein sequence analysis.

[Biegert, et al., 2006]

• PFAAT 2.0¹¹: a desktop tool, annotation, editing and analyzing of multiple sequence alignmentss. [Caffrey, et al., 2007]

8 http://maize.agron.missouri.edu/~hsanchez/DNAAlignment_Tool.html

9 http://www.iq.usp.br/wwwdocentes/mcsoga/gedi/

10 http://toolkit.tuebingen.mpg.de/

11 http://pfaat.sourceforge.net/

(30)

19

1.7. Available Tools and Databases

Besides HMMER a number of tools which are using profile Hidden Markov models or position specific matrix models are available yet. [Eddy, 2003]

1.7.1. Web Based Tools

• PSI-BLAST: Position Specific Iterative BLAST¹²

• FISH: Family Identification with Structure anchored HMMs¹³

• Meta-MEME: uses motif-based hidden Markov modeling of biological sequences¹⁴

• BLOCKS is an online service for biological sequence analysis¹⁵

1.7.2. Command Line Based

There are some tools which implement profile hidden Markov models, running on the command line. Those tools are SAM, HMMER, PFTOOLS and HMMpro, which all base more or less on HMMs of Krogh et al. [Eddy, 1998]

HMMER will be described in detail in the Material and Methods chapter.

HMMpro seems to be not available anymore, because the website is not accessible and no similar hit can be found by web search.

PFTools¹⁶ is a package of programs for the construction of profiles. Also sequences or sequence libraries can be searched against profiles and profile libraries. It is used for the PROSITE database. [Sigrist, et al. 2002]

SAM (Sequence Alignment and Modeling System)¹⁷ is only available for Unix- Systems; it implements HMMs very similar to HMMER and includes a conversion function between SAM and HMMER formats. [Wistrand, et al., 2005]

12 http://www.ebi.ac.uk/blastpgp

13 http://max.ucmp.umu.se/sahmm/about.php

14 http://metameme.sdsc.edu/

15 http://blocks.fhcrc.org/

16ftp://ftp.expasy.org/databases/prosite/tools/or ftp://ftp.isrec.isb-sib.ch/sib-isrec/pftools/.

17 http://www.cse.ucsc.edu/research/compbio/sam.html

(31)

20

1.7.3. Protein and Profile HMM Databases

There are two profile HMM databases, which are annotated, available. First the Pfam database and second the PROSITE¹⁸ profiles database, which is a supplement to the PROSITE motifs database.

Two other profile-HMM web-based databases are SMART and TIGRFAMs. [Pevsner, 2003] All of these databases are available online.

The Pfam¹⁹ database belongs to the Sanger Institute with the current version 22.0 from July 2007. [Sanger]

It consists of a large number of protein families. Those families are represented by hidden Markov Models and multiple sequence alignments. [Finn, et al., 2005]

The Pfam database can be divided into Pfam-A and Pfam-B. Pfam-A is manually curated and contains high quality families. Pfam-B is used as a supplement to Pfam- A with a lower quality. [Sanger]

Additionally Pfam provides HMM-Logos, graphical representations of an HMM, visualizing the distinguishing features. [Finn, et al., 2005]

Prosite²⁰ is hosted from the Swiss-Prot group from the Swiss Institute of

Bioinformatics. (SIB) It also contains protein families and domains. [Swiss-Prot, 2006]

SMART²¹ (Simple Modular Architecture Research Tool) is an online database for the protein domain identification and can also be used for the analysis of protein domain architectures. [Letunic, et al. 2006]

TIGRFAMs²² is also manually curated and contains protein families which consist of hidden Markov models, multiple sequence alignments, commentary, Gene Ontology (GO) assignments, literature references and pointers to related TIGRFAMs, InterPro and Pfam models. [Haft, et al., 2003]

It belongs to the J. Craig Venter Institute (see http://www.tigr.org/db.shtml).

Superfamily 1.69 is an HMM library which contains all proteins of known structure.

[Gough, et al., 2001]

18 http://www.expasy.ch/prosite/

19 http://pfam.sanger.ac.uk/

20 http://www.expasy.ch/prosite/

21 http://smart.embl-heidelberg.de/

22 http://www.tigr.org/TIGRFAMs/

(32)

21

2. Materials and Methods

To develop a graphical user interface for HMMER 2 and display results for an easier interpretation following parts are necessary.

2.1. HMMER 2

HMMER 2 is a command line based tool for biological sequence analysis. It implements profile hidden Markov models. [Eddy, 2003]

The latest version is 2.3.2 and the package is available for a number of Operating Systems (Table 2) and architectures in precompiled binaries, including Microsoft Windows™. The source code is also freely available for compilation. [HHMI Janelia farm research campus]

HMMER 2 contains nine subprograms which have different aims. [Eddy, 2003]

All of those subprograms (Table 3) need to be mapped in the graphical user interface.

To build a profile HMM with HMMER the user needs a multiple sequence alignment of the protein domain or sequence family she or he wants a profile for. As input many formats are possible: CLUSTAL, GCG, PHYLIP, FASTA and Stockholm²³. Stockholm is the format of the Pfam-database and the native format of HMMER.

HMMER provides E-values for its search results. To get more sensitivity the created HMMs can be calibrated using hmmcalibrate.

With hmmsearch the user can search against single sequence files and major databases like Swissprot²⁴ and NCBI²⁵. Also with HMMER one can search against HMM databases like Pfam or self constructed HMM databases, this is done with hmmpfam.

hmmalign is used to create multiple sequence alignments of large numbers of sequences, using an existing profile HMM. [Eddy, 2003]

23 Description of the formats CLUSTAL, GCG, PHYLIP, FASTA and Stockholm can be found in the Appendix chapter 6.7.12

24 http://www.expasy.ch/sprot/

25 http://www.ncbi.nlm.nih.gov/sites/gquery

(33)

22 Architectures and Operating Systems

AMD Opteron/Linux AMD Opteron/Solaris

Apple Macintosh Power PC OS/X Compaq Alpha True64

Compaq Alpha Linux Debian Linux

Hewlett/Packard IA64 (Itanium2), Linux Hewlett/Packard IA64 (Itanium2), HP/UX Intel FreeBSD

Intel GNU/Linux IBM Power4, Linux IBM Power4, AIX IBM Power5, AIX IBM Power6, AIX Intel OpenBSD Intel Solaris

Microsoft Windows™

Silicon Graphics IA64 (Itanium2), Linux Silicon Graphics MIPS IRIX

Sun Sparc Solaris

Table 2: List of architectures and operating systems for which HMMER 2 is available [HHMI Janelia farm research campus]

(34)

23

Program Application

hmmbuild Build a model from a multiple sequence

alignment

hmmalign Align sequences to an existing model

hmmcalibrate Calibrates an HMM, makes searches

more sensitive, by the calculation of better E-value scores

hmmconvert Converts a model file into other formats, e.g. HMMER 2 binary or GCG profiles

hmmemit Emits sequences probabilistically from a

profile HMM

hmmfetch Fetch a single model from an HMM

database

hmmindex Index an HMM database

hmmpfam Search an HMM database for matches to

a query sequence

hmmsearch Search a sequence database for

matches to an HMM

Table 3: HMMER 2 Programs [Eddy, 2003]

(35)

24

2.2. Development Environment

2.2.1. Operating Systems

The operating system (OS) the software was developed and tested on is Microsoft Windows XP™ Home Edition Version 2002, Service Pack 3, v.3264 and Microsoft Windows Vista Home Premium™ on the one hand and Ubuntu Linux 7.10 (Gutsy Gibbon) and 8.04 (Hardy Heron) on the other hand.

2.2.2. Hardware

Microsoft XP was installed on a Notebook with AMD Athlon™ 64 processor 3700+

with 1.66 GHz and 2.00 GB RAM. Microsoft Windows Vista™ was installed on a Personal Computer (PC) AMD Athlon™ 64 3000+ with 2,01 GHz and 2 GB RAM.

The installations of Ubuntu Linux ran on a mobile AMD Athlon™ XP2500+

(Notebook), 1 GB RAM and 1.66 GHz.

2.2.3. Software

Active Perl for Windows™ has been installed to be able to run Perl-Scripts. This Perl distribution is limited compared with Perl for Linux/Unix, because only special

compiled Perl packages, with the extension ppm, can be run on it. Especially the compilation of Perl packages containing C code turned out to be very difficult.

For the realization of the UML-Model²⁶ (Unified Modelling Language-Model) objectiF Visual Studio .NET Personal Edition™ ²⁷ was used.

HMMER 2 for Windows, developed by the Computational Biology Service Unit (CBSU) at Cornell University [HHMI Janelia farm research campus], has been

installed, too. It is a porting of the original HMMER 2 for MS Windows™ systems and contains several *.exe-files, one for each HMMER subprogram, for example

hmmsearch.exe. The source code and compiled executable can be downloaded from http://www.tc.cornell.edu/WBA/, last verified on May 11, 2008. It is free software and can be redistributed freely. [University, Cornell Theory Center – Cornell]

26 Details in chapter 2.7

27 http://www.microtool.de/objectif/de/objectif_vs_net_personal_edition.asp

(36)

25

2.3. .NET-Framework

Before .NET was introduced a view languages were mainly used for application development. These are C++, Java and Visual Basic 6.0. With the .NET framework C# came into this group. [Kühnel, 2006]

Microsoft™ published the development framework .NET 1.0 as well as an Integrated Development Environment (IDE) for it in the year 2002. Meanwhile version 3.0 of the .NET-framework was published. [Kühnel, 2006]

.NET has some similarities to Java and VB 6.0, because it produces a byte code which has to be compiled during runtime on the PC. This byte code is called Microsoft™ Intermediate Language (MIL) or just Intermediate Language (IL). This MIL gets compiled by the Just-In-Time-Compiler (JIT-Compiler), so that it becomes executable (Figure 13). [Kühnel, 2006]

Figure 13: Lifecycle of .NET -Code [Kühnel, 2006]

The framework contains a number of components, which will be described now.

The Common Language Specification (CLS) describes the attributes a programming

(37)

26 language has to meet, to be compatible with .NET. [Kühnel, 2006]

The Common Type System (CTS) specifies the language independent development.

There are two categories of Types in .NET - value types and reference types. One specialty of the .NET CTS is the fact that it is not integrated into a specific language, as it is done elsewhere. Because of this objects, written in different languages, can communicate with each other without any conversion of data types. There are another two essential components of the .NET framework, the Common Language Runtime (CLR) and the .NET-Class library. [Kühnel, 2006]

The CLR is the runtime environment in which all .NET applications are executed as managed code. This environment consists of several subcomponents, the Class Loader, the Type Checker, the JIT-Compiler, the Exception Manager, the Garbage Collector, the Code Manager, the Security Engine, the Debug Machine, the Thread Service and the COM²⁸ (Common Object Model) Marshaller. [Kühnel, 2006]

The .NET class library is built as a tree and provides the programmer different predefined classes, methods and types. The base of this tree is the Object class, from which all other classes are inherited. [Kühnel, 2006]

Another important term in the .NET framework is “Namespaces”. Namespaces are organized in a tree as well. They structure class hierarchies logically. Also they make it possible to use a class name more often. Precondition therefore is that it occurs in different namespaces. The base namespace is called System. It contains basic types, integer for example. [Kühnel, 2006]

During the compilation the *.cs-files and resource-files (e.g. images) are combined in the Assembly. This file has the extension .exe or .dll. An Assembly is composed of three parts: the code section, the resource section and the manifest with the

Metadata.

The Metadata consists of the information about all types which are contained in the source code, information about the assembly, the version number and other meta information.

28 The predecessor of .NET, implemented the concept of reusable code [Schäpers, Huttary und Bremes 2002]

(38)

27 The main attributes of the .NET framework are [Kühnel, 2006]:

• Object orientation: All elements in .NET can be derived from an object, even simple data types such as integer. Also access to the operation system is packaged into classes. So it is 100% object oriented and represents a consistent layer for application development.

• WinAPI-32 substitution: The complicated Win32-API will be replaced by the .NET framework totally.

• Platform independence: Comparable with Java .NET is running in an

environment in which the code is compiled during runtime, the CLR. This CLR is open. For this reason .NET can be ported to other operating systems, what has already been done for example by the Mono – Project for UNIX/Linux.

• Language independence: The .NET framework can be used with a number of different programming languages. Code written in one of these languages can be used by another one without the necessity to rewrite it. Such languages for example are C#, VB.NET and J#.

• Memory management: In languages like Java or C the programmer have to look at the available memory and therefore on object-recycling. In the .NET framework the garbage collector does this for the programmer. It identifies objects, which are not used anymore and removes them from the memory automatically.

• Distribution: It is no longer necessary to create installations-programs with the .NET framework. The .exe or .dll file can easily be copied to a PC and it runs there without any further work. The only precondition is an installed .NET environment on MS Windows™ PCs or the Mono environment on UNIX or Linux PCs.

(39)

28

2.4. Mono-Framework

Mono is an implementation of the .NET-Framework as open source and cross- platform functionality, started in 2001. It is currently available in version 1.9 [de Icaza]. It consists, just as .NET itself, of a compiler, a virtual machine and API classes. The language focus lays on C#, whereas other languages, like Java, JavaScript, Basic, C and COBOL, are also supported.

Cross-platform means, that Mono, not like Microsoft’s .NET, can be run under Linux, Windows™ and Mac OS X™. It also supports x86-architecture, PowerPC and

SPARC processors. The Mono-specific libraries like Gtk# are available for all of these platforms, too. Mono produces a bytecode (CLI) and gets executed as managed code. [Dumbill, et al., 2004]

The implementation of Mono is based on the ECMA²⁹ standards for C# and the Common Language Infrastructure (CLI) and contains an ECMA compatible runtime engine (Common Language Runtime - CLR). The project is community-driven, but also sponsored by Novell. The ambition of the project is to become the leading programming framework on and for Linux. To be compatible with Windows .NET, there are compatibility libraries integrated. Additionally it contains Gtk#, a third party library for the development of Gnome³⁰ applications. [Dumbill, et al., 2004]

2.5. C# and Gtk#

C# was developed from Microsoft™ as the core language for the .NET framework. It is an object oriented language based on C/C++. It also combines the characteristics of many other languages. The aim was to bundle the achievements of following languages: C++, Visual Basic, Delphi and Java. [Schäpers, et al., 2002]

Gtk# is a wrapper for the GTK+ user interface toolkit, which was developed to provide the user interface for GIMP (GNU Image Manipulation Program). Today it is an important part of the GNOME desktop platform. Gtk# is the Mono API to the GTK+

toolkit. [Dumbill, et al., 2004]

29 Ecma International is an industry association founded in 1961 and dedicated to the standardization of Information and Communication Technology (ICT) and Consumer Electronics (CE) [ECMA]

30 GNU Network Object Model Environment, a free Desktop environment, published under GPL and LGPL by the GNU Project, for UNIX-Systems [GNOME]

(40)

29

2.6. Visual C# Express™ and MonoDevelop

For the development of .NET-applications no IDE is necessary, but it is much easier to work with.

Microsoft™ provides a free downloadable version of its professional integrated development environment (IDE) Visual Studio, called Visual Studio Express Edition.

This IDE is available for Visual Basic, Visual C# and Visual C++. This IDE enables the programmer to create graphical user interfaces in an easy way, without the need to deal with the problems which occur by creating such graphical user interfaces only by coding them manually. Visual C# Express³¹ does not have all features the

professional version has. [Microsoft Corporation]

On Linux-based systems a similar IDE is available. Its name is MonoDevelop and it is a port of SharpDevelop. SharpDevelop is a free, open source IDE for .NET-

development on Microsoft Windows™. MonoDevelop uses the Gtk# user interface toolkit and it is free and open source, too. [Dumbill, et al., 2004]

31 http://www.microsoft.com/express/vcsharp/, last verified on May 18, 2008

(41)

30

2.7. UML and Class Model

UML provides the ability to display the implementation and development of software in a graphical way. It is a language with clear usage rules and was defined by the Object Management Group (OMG). It is divided into four sections [Pilone, 2006]:

• Diagram Interchanged

• UML Infrastructure

• UML Superstructure

• Object Constraint Language

There are five types of diagrams which are used for static modeling. Static modeling handles defined relationships on code base. The five types are [Pilone, 2006]:

• Class Diagram

• Component Diagram

• Composite Structure Diagram

• Package Diagram

• Deployment Diagram

A Class Diagram is used for the modeling of static relationships of system

components. [Pilone, 2006] For the development of the application only the Class Diagram was used.

(42)

31

2.8. Documentation

The implemented code is documented in line, which means that comments are inserted between the source code lines.

The inline comments are processed with a command line based documentation tool, which creates an HTML-output (Hypertext Markup Language). The open source tool is called NaturalDocs³² and can be downloaded for free. The HTML-output can be viewed in every web browser. [Valure]

32 http://www.naturaldocs.org/, last verified on May 12, 2008

(43)

32

3. Results

In this chapter the results will be presented. They can be split into the description of the implemented application, which will be done in the sections ‘GraHMMer’ and

‘Class Model’, and the illustration of the differences and output in using the application or the original command line tool, done in ‘Experiments’.

3.1. GraHMMer

The developed and implemented application is called GraHMMer. The name derives from the word “graphic” and “HMMER”. The used programming language is C# with Visual C# Express as IDE.

GraHMMer has some essential prerequisites to run properly. First HMMER for Windows must be installed; second the .NET-Runtime is necessary. So far

GraHMMer runs on Microsoft Windows™ systems. It was tested on Windows XP and Windows Vista Home Edition successfully.

The intended cross platform functionality is not implemented completely, because of some incompatibilities between .NET-framework 3 and the current implementation of the Mono framework. The transfer of the graphical components made some

difficulties. Some of the used controls are not available for Mono or show a slightly changed behavior.

A graphical user interface was implemented which provides a practical overview of the available features HMMER for Windows provides itself. Beside the HMMER- subprograms, HMMER for Windows delivers some additional helpful tools. One of these tools, afetch, is included into GraHMMer, too. The other additional programs are prepared for later implementation.

The results of HMMER should get processed to get a graphical presentation. To implement this feature the usage of HMM-Logos was intended, but it turned out that the compilation of the HMM-Logo Perl package on a Microsoft Windows™ system is very difficult. This problem comes from the fact, that the HMM-Logo package itself and some of its prerequisite packages contain C-code. This C-code turned out to be impossible to compile without errors on a Windows™ system. Furthermore the Active

(44)

33 Perl software needs a certain format for added Perl packages, which is not available for the HMM-Logo package, too. To substitute the planed embedded functionality of a graphical output, a web interface got integrated into GraHMMer. With this interface the access to the online platform of the Sanger institute is made possible, where HMM-Logos can be created³³. For the usage of this online platform inside GraHMMer an internet access is needed.

Additionally the text output of hmmsearch and hmmpfam is processed in a way that the parts it contains are displayed separately for a better overview. Those parts are

‘Sequence Scores’, ‘Domain Scores’, ‘Alignments’ and the produced ‘Histogram’. The full output gets displayed as ‘raw Result’.

Test-results can be edited inside GraHMMer and also they can be exported into other files. Principally GraHMMer is project based. This means a treeview on the left side of the application shows folders, which represent projects (see Figure 14).

Figure 14: Screenshot of GraHMMer, selected project is marked green in the treeview, selected HMM subprogram is hmmbuild

33 http://www.sanger.ac.uk/cgi-bin/software/analysis/logomat-m.cgi

(45)

34 All files produced by the usage of GraHMMer are stored in the current selected

project-folder. The files in this folder are screened by extensions, e.g. *.hmm, which activates certain switches in the call of the appropriate subprogram.

The menu of GraHMMer is divided into four sections: ‘GraHMMer Project’, ‘HMMer Subprograms’, ‘Options’ and ‘Help’.

The ‘GraHMMer Project’ menu provides submenus for the handling of project-folders.

New projects, in the form of folders, can be added to the main project path – which will be explained afterwards. An existing project can be selected as the current working project, files can be added to the working project and the application can be closed from here.

The HMMER subprograms are grouped into two submenu categories. They can be found under the ‘HMMer Subprograms’ menu and are called ‘Search’ and ‘Build’. The

‘Search’ menu (see Figure 15) contains all subprograms which can be used for searching or ‘reading’ with, in or from the profile HMMs.

Figure 15: Screenshot - the GraHMMer Search menu contains HMMER subprograms with search functionality

Those subprograms are hmmsearch, hmmpfam, hmmalign, hmmemit and hmmfetch as can be seen in Figure 15.

(46)

35 Figure 16: Screenshot - GraHMMer Build menu contains HMMER subprograms for building

profile HMMs and profile HMM databases

The ‘Build’ menu (see Figure 16) contains all HMMER subprograms which are used for creating or optimizing a profile HMM, respectively for building a profile HMM database. This programs are hmmbuild, hmmconvert and in another submenu –

‘Optimize’ – hmmcalibrate and hmmindex. Additionally the menu contains another entry – ‘create profile hmm database’. This entry opens a dialog box, which provides all available profile HMMs in the working project folder. The option to add already existing profile HMM databases can be selected, too.

The third submenu of the ‘HMMer Subprograms’-menu is ‘Tools’. This menu is designed to carry the additional HMMER for Windows programs. In this version only one of these programs, afetch, is implemented in GraHMMer.

As to be seen in Figure 17, the profile HMMs can be selected and a name for the new pHMM database has to be entered. The created database will be saved into the working project folder.