• Keine Ergebnisse gefunden

Multiple Sequence Alignment with R / submitted by Dipl.-Inf. Enrico Bonatesta ; Mag. Dipl.-Ing. Christoph Horejš-Kainrath

N/A
N/A
Protected

Academic year: 2021

Aktie "Multiple Sequence Alignment with R / submitted by Dipl.-Inf. Enrico Bonatesta ; Mag. Dipl.-Ing. Christoph Horejš-Kainrath"

Copied!
328
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Submitted by

Dipl.-Inf. Enrico Bonatesta Mag. Dipl.-Ing. Christoph Horej ˇs-Kainrath

Submitted at

Institute for Bioinformatics

Supervisor Assoz.Univ.-Prof. Dipl.-Ing. Dr. Ulrich Bodenhofer July 2016 JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, ¨Osterreich

Multiple Sequence

Alignment with R

Master Thesis

to obtain the academic degree of

Master of Science

in the Master’s Program

(2)
(3)

Eidesstattliche Erkl ¨arung

Wir erkl ¨aren an Eides statt, dass wir die vorliegende Masterarbeit selbstst ¨andig und ohne fremde Hilfe verfasst, andere als die angegebenen Quellen und Hilfsmittel nicht benutzt bzw. die w ¨ortlich oder sinngem ¨aß entnommenen Stellen als solche kenntlich gemacht haben. Die vorliegende Masterarbeit ist mit dem elektronisch ¨ubermittelten Textdokument identisch.

Enrico Bonatesta, Juli 2016

(4)
(5)

Danksagung

An erster Stelle bedanken wir uns bei unserem Aufgabensteller, Betreuer und Men-tor, Herrn Assoz.Univ.-Prof. Dipl.-Ing. Dr. Ulrich Bodenhofer, der uns w ¨ahrend des ganzen Entwicklungszeitraums immer hilfreich und meist kurzfristig zur Seite stand und zahlreiche L ¨osungsvorschl ¨age bzw. -ans ¨atze gab. Zus ¨atzlich hat er uns kurz vor der Ver ¨offentlichung des Softwarepaketes auch durch den Aufnahmeprozess gef ¨uhrt.

Außerdem m ¨ochten wir unserem Kommilitonen Johannes Palme danken, der – kurzfristig vor der Ver ¨offentlichung des Paketes – seine Hardware bzw. Betriebssys-tem zur Verf ¨ugung gestellt hatte, durch die einige Tests ohne Zeitverzug erst m ¨oglich wurden.

Zus ¨atzlich bedanken wir uns bei unseren Familien und unseren Partnerinnen, die w ¨ahrend des gesamten Studiums nie den Glauben an uns verloren und uns immer unterst ¨utzt haben.

Ferner sollen an dieser Stelle auch die indirekt Mitbeteiligten ein dankendes Wort erhalten, die mittels Email-Korrespondenz ebenfalls sehr zum Gelingen des Paketes beitrugen. Teilweise wurden innerhalb eines Tages Fehler in den Originalpaketen be-hoben oder hilfreiche L ¨osungsvorschl ¨age geliefert.

Schlussendlich m ¨ochten wir uns auch bei den beteiligten Autoren der integrierten Pakete bedanken, deren Entwicklungsarbeit die Basis f ¨ur unsere Arbeit war.

(6)
(7)

Kurzfassung

Zielsetzung der Masterarbeit war ein komplett dokumentiertes Softwarepaket, welches eine Schnittstelle zwischen der Programmiersprache R und mehreren Mulitple Se-quence Alignment Algorithmen implementiert, um so eine bis dato vorhandene L ¨ucke im Fachbereich Bioinformatik zu schließen und das Arbeiten zu vereinfachen. Dieses Paket sollte abschließend in dem f ¨ur Bioinformatiker wichtigen Softwareprojekt Bio-conductor ver ¨offentlicht werden.

Dieser Masterarbeit geht eine Studienarbeit voraus, bei der ein Prototyp entwickelt wurde, um zu zeigen, dass dies realisierbar ist. In dieser Machbarkeitsstudie wurden die zwei Algorithmen ClustalW und MUSCLE rudiment ¨ar als R-Wrapper verwirklicht, wobei dieser Wrapper nur auf einer Linux-Plattform implementiert und getestet wurde. Ausgehend von diesem Entwicklungsstand wurde im Zuge dieser Masterarbeit der Prototyp zu einem vollwertigen Programm erweitert, in dem die restlichen Pa-rameter der beiden Algorithmen ClustalW und MUSCLE implementiert und die notwendige Dokumentation zus ¨atzlich geschrieben wurden. Ferner fand eine plattfor-munabh ¨angige Realisierung statt und der Algorithmus ClustalΩ wurde hinzugef ¨ugt. Ein weiterer Algorithmus – T-Coffee – war geplant und wurde teilweise implementiert. Nach zahlreichen Problemen und nicht mehr abzusch ¨atzenden Risiken wurde dieser Algorithmus schließlich aus dem Paket genommen, um dessen Realisierung nicht zu gef ¨ahrden.

Das fertiggestellte Paket msa wurde im M ¨arz 2015 an das Bioconductor-Gremium gesendet, dort ¨uberpr ¨uft, mehrfach minimal ge ¨andert und im April 2015 im Rah-men der Ver ¨offentlichung von Bioconductor 3.1 freigeschaltet. Zum Zeitpunkt der Einreichung der Masterarbeit ist das Paket in der Version 1.4.3 verf ¨ugbar, nachdem zus ¨atzlich noch mehrere hilfreiche Funktionen und eine benutzerfreundlichere Bedi-enung hinzugef ¨ugt und kleinere Fehler ausgebessert wurden.

(8)

The objective of this master thesis was the development of a software package, which implements an interface between the programming language R and multiple sequence alignment algorithms. Afterwards, the package was released in the software project Bioconductor, which is important for bioinformaticians.

In the context of a student research project, a prototype has been implemented as a feasibility study. In this project, the two MSA algorithms ClustalW and MUSCLE were realized rudimentary as an R wrapper, constrained to the Linux platform.

Starting from this stage of development, the prototype has been extended to a fully-fledged program in this master thesis. Nearly all parameters of both algorithms ClustalW and MUSCLE have been implemented, additionally all the necessary docu-mentation – like the help pages, manual and vignette – have been elaborated. Further-more, the whole architecture has been modified to reach a platform independence and the MSA algorithm ClustalΩ has been added. T-Coffee has been considered as part of the solution and has been implemented partially, however, after numerous problems and incalculable risk this algorithm was removed from the package.

In March 2015, the package msa was sent to the Bioconductor board, and after some review processes and modifications, the package was part of the Bioconductor release 3.1 in April 2015. The current version is 1.4.3, after some helpful functions and a few bugfixes have been added.

(9)

Contents

Contents

Eidesstattliche Erkl ¨arung i

Danksagung iii

Kurzfassung v

Abstract vi

Contents vii

1 Introduction 1

1.1 Design of the Thesis . . . 2

1.2 Preliminary Work to the Thesis . . . 4

1.3 Definition of the Thesis . . . 5

1.3.1 Implementation . . . 5

1.3.2 Testing . . . 5

1.3.3 Documentation . . . 5

1.4 Functional Requirements . . . 6

1.5 Non Functional Requirements . . . 6

1.6 Boundaries of the Thesis . . . 6

1.7 Resources . . . 7

(10)

2 Multiple Sequence Alignment 13

2.1 Pairwise Alignment . . . 14

2.1.1 Pairwise Alignment Strategies . . . 14

2.1.2 Pairwise Alignment Algorithms . . . 15

2.1.3 Conclusion . . . 17

2.2 Objectives of MSA . . . 17

2.3 Problems of MSA . . . 19

2.4 Approaches of MSA Algorithms . . . 20

2.4.1 Exact Approach . . . 20

2.4.2 Progressive Approach . . . 21

2.4.3 Iterative Approach . . . 23

2.4.4 Consistency-Based Approach . . . 24

2.4.5 Structure-Based Approach . . . 26

2.5 Known Issues of MSA . . . 28

3 MSA Algorithms 29 3.1 ClustalW . . . 30

3.2 ClustalOmega . . . 32

3.3 MUSCLE . . . 34

3.3.1 Stage 1 – Draft Progressive . . . 35

3.3.2 Stage 2 – Improved Progressive . . . 36

3.3.3 Stage 3 – Refinement . . . 37 3.4 T-Coffee . . . 38 3.4.1 M-Coffee . . . 40 3.4.2 Expresso/ 3D-Coffee . . . 40 3.4.3 R-Coffee . . . 41 3.4.4 MOCCA . . . 41 3.4.5 PSI-Coffee . . . 42 3.5 Further Algorithms . . . 42 3.5.1 DIALIGN . . . 43

(11)

Contents

3.5.2 Kalign . . . 44

3.5.3 MAFFT . . . 45

3.5.4 Praline . . . 45

3.5.5 Assessment of the MSA Algorithms . . . 46

4 Usage of msa 49 4.1 Installation . . . 50

4.2 msa for the Impatient . . . 50

4.3 Functions for Multiple Sequence Alignment in More Detail . . . 58

4.3.1 ClustalW Specific Parameters . . . 61

4.3.2 ClustalOmega Specific Parameters . . . 61

4.3.3 MUSCLE-Specific Parameters . . . 62

4.4 Printing Multiple Sequence Alignments . . . 63

4.5 Processing Multiple Alignments . . . 65

4.6 Pretty-Printing Multiple Sequence Alignments . . . 68

4.6.1 Consensus Sequence and Sequence Logo . . . 69

4.6.2 Color Shading Modes . . . 71

4.6.3 Subsetting . . . 72

4.6.4 Additional Customizations . . . 72

4.6.5 Sweave or knitr Integration . . . 73

5 Implementation 75 5.1 Architecture . . . 76

5.2 Implementation . . . 78

5.2.1 msa.R . . . 78

5.2.2 R helper files . . . 82

5.2.3 msaMuscle.R, msaClustalW.R, and msaClustalOmega.R . . . . 83

5.2.4 Rto C/C++ Interface (Rcpp) . . . 87

5.2.5 MUSCLE C-Code . . . 88

(12)

5.2.7 ClustalΩ C++-Code . . . 96 5.3 Cross-Platform Requirements . . . 99 5.3.1 Linux . . . 99 5.3.2 Windows . . . 99 5.3.3 Mac OS X . . . 100 5.4 Package Build . . . 100 5.4.1 gc Library . . . 102 5.4.2 MUSCLE Library . . . 103 5.4.3 ClustalW Library . . . 105 5.4.4 ClustalΩ Library . . . 106

5.5 Bioconductor Conventions and Reported Bugs . . . 108

5.5.1 Shared Arguments in msa-function . . . 108

5.5.2 Output Order . . . 109

5.5.3 gapOpeningand gapExtension . . . 109

5.5.4 TeXshade problem . . . 110

5.5.5 Compilation Issues on Windows and Mac . . . 110

6 Testing 113 6.1 RUnit . . . 114 6.1.1 msaMuscle() . . . 114 6.1.2 msaClustalW() . . . 117 6.1.3 msaClustalOmega() . . . 119 6.2 Performance . . . 120

6.2.1 Benchmark Tests in General . . . 121

6.2.2 BAliBASE 3.0 . . . 121

6.2.3 PREFAB 4.0 . . . 122

6.2.4 BAliBASE and PREFAB for Testing msa . . . 123

6.2.5 Side Note on Technical Issues . . . 124

6.2.6 Discovered Issues . . . 127

(13)

Contents

6.2.8 Real Condition Comparison . . . 134

6.2.9 Assessment of the Performance . . . 135

7 Evaluation 139 7.1 Future Extensions . . . 140

7.2 Known Issues . . . 141

7.2.1 Memory Leaks . . . 141

7.2.2 ClustalΩ vs. Older GCC Versions on Linux/Unix . . . 141

7.2.3 ClustalΩ: OpenMP Support on Mac OS X . . . 142

7.3 Outlook on Future Prospectives . . . 142

A Substitution Matrices 143 A.1 Default Values . . . 143

A.2 Other Matrices . . . 144

A.3 Layout & Design . . . 144

A.4 Crosscheck of Same Matrices in Different Algorithms . . . 145

A.4.1 Appearance in ClustalW, ClustalOmega and TCoffee . . . 145

A.4.2 Appearance in ClustalOmega and TCoffee . . . 145

A.4.3 Appearance in ClustalW and TCoffee . . . 145

A.4.4 Conclusion . . . 145

A.5 Crosscheck with gapOpening and gapExtension . . . 145

A.5.1 ClustalOmega . . . 145

A.5.2 ClustalW . . . 145

A.5.3 MUSCLE . . . 145

A.5.4 TCoffee . . . 146

A.6 Scaling of the Matrices . . . 146

A.6.1 ClustalOmega . . . 146

A.6.2 ClustalW . . . 146

A.6.3 MUSCLE . . . 146

(14)

B Implemented Parameters 147 B.1 msaMuscle . . . 147 B.1.1 Common Parameters . . . 147 B.1.2 Value Options . . . 149 B.1.3 Flag Options . . . 151 B.2 msaClustalW . . . 154 B.2.1 Common Parameters . . . 154 B.2.2 Value Options . . . 155 B.2.3 Flag Options . . . 157 B.3 msaClustalOmega . . . 160 B.3.1 Common Parameters . . . 160 B.3.2 Value Options . . . 160 B.3.3 Flag Options . . . 162 C Code Changes 163 C.1 MUSCLE . . . 163 C.1.1 ../msa/src/Muscle/aligngivenpath.cpp . . . 163 C.1.2 ../msa/src/Muscle/aln.cpp . . . 164 C.1.3 ../msa/src/Muscle/aln.cpp . . . 165 C.1.4 ../msa/src/Muscle/diaglist.h . . . 165 C.1.5 ../msa/src/Muscle/domuscle.cpp . . . 165 C.1.6 ../msa/src/Muscle/domuscle.cpp . . . 168 C.1.7 ../msa/src/Muscle/estring.cpp . . . 168 C.1.8 ../msa/src/Muscle/fastdistjones.cpp . . . 168 C.1.9 ../msa/src/Muscle/fastdistmafft.cpp . . . 169 C.1.10 ../msa/src/Muscle/glbalignle.cpp . . . 169 C.1.11 ../msa/src/Muscle/globals.cpp . . . 169 C.1.12 ../msa/src/Muscle/globalsosx.cpp . . . 169 C.1.13 ../msa/src/Muscle/globalswin32.cpp . . . 169 C.1.14 ../msa/src/Muscle/main.cpp . . . 170

(15)

Contents C.1.15 ../msa/src/Muscle/main.cpp . . . 170 C.1.16 ../msa/src/Muscle/msa.cpp . . . 170 C.1.17 ../msa/src/Muscle/msa.h . . . 171 C.1.18 ../msa/src/Muscle/muscle.h . . . 171 C.1.19 ../msa/src/Muscle/muscle.h . . . 172 C.1.20 ../msa/src/Muscle/muscleout.cpp . . . 172 C.1.21 ../msa/src/Muscle/muscleout.cpp . . . 173 C.1.22 ../msa/src/Muscle/objscore.cpp . . . 173 C.1.23 ../msa/src/Muscle/onexception.cpp . . . 173 C.1.24 ../msa/src/Muscle/options.cpp . . . 173 C.1.25 ../msa/src/Muscle/phyfromclust.cpp . . . 174 C.1.26 ../msa/src/Muscle/phyfromclust.cpp . . . 174 C.1.27 ../msa/src/Muscle/profile.h . . . 174 C.1.28 ../msa/src/Muscle/profile.cpp . . . 175 C.1.29 ../msa/src/Muscle/readmx.cpp . . . 175 C.1.30 ../msa/src/Muscle/realigndiffse.cpp . . . 175 C.1.31 ../msa/src/Muscle/refine.cpp . . . 175 C.1.32 ../msa/src/Muscle/refinehoriz.cpp . . . 176 C.1.33 ../msa/src/Muscle/refinetreee.cpp . . . 176 C.1.34 ../msa/src/Muscle/refinew.cpp . . . 176 C.1.35 ../msa/src/Muscle/refinew.cpp . . . 176 C.1.36 ../msa/src/Muscle/savebest.cpp . . . 176 C.1.37 ../msa/src/Muscle/scorepp.cpp . . . 177 C.1.38 ../msa/src/Muscle/seq.cpp . . . 177 C.1.39 ../msa/src/Muscle/seq.h . . . 177 C.1.40 ../msa/src/Muscle/setnewhandler.cpp . . . 177 C.1.41 ../msa/src/Muscle/timing.h . . . 177 C.1.42 ../msa/src/Muscle/timing.h . . . 178 C.1.43 ../msa/src/Muscle/tree.h . . . 178

(16)

C.1.44 ../msa/src/Muscle/tree.h . . . 178 C.1.45 ../msa/src/Muscle/aln.cpp . . . 178 C.1.46 ../msa/src/Muscle/domuscle.cpp . . . 179 C.1.47 ../msa/src/Muscle/main.cpp . . . 179 C.1.48 ../msa/src/Muscle/muscle.h . . . 179 C.1.49 ../msa/src/Muscle/muscleout.cpp . . . 180 C.1.50 ../msa/src/Muscle/phyfromclust.cpp . . . 180 C.1.51 ../msa/src/Muscle/profile.cpp . . . 180 C.1.52 ../msa/src/Muscle/readmx.cpp . . . 180 C.1.53 ../msa/src/Muscle/refine.cpp . . . 180 C.1.54 ../msa/src/Muscle/refinew.cpp . . . 181 C.1.55 ../msa/src/Muscle/timing.h . . . 181 C.1.56 ../msa/src/Muscle/tree.h . . . 181 C.2 ClustalW . . . 181 C.2.1 ../msa/src/ClustalW/src/Clustal.cpp . . . 181 C.2.2 ../msa/src/ClustalW/src/Clustal.h . . . 184 C.2.3 ../msa/src/ClustalW/src/alignment/Alignment.cpp . . . 184 C.2.4 ../msa/src/ClustalW/src/alignment/AlignmentOutput.cpp . . . 185 C.2.5 ../msa/src/ClustalW/src/alignment/AlignmentOutput.h . . . 191 C.2.6 ../msa/src/ClustalW/src/alignment/Sequence.cpp . . . 191 C.2.7 ../msa/src/ClustalW/src/fileInput/ClustalFileParser.cpp . . . 192 C.2.8 ../msa/src/ClustalW/src/fileInput/EMBLFileParser.cpp . . . 192 C.2.9 ../msa/src/ClustalW/src/fileInput/FileParser.cpp . . . 193 C.2.10 ../msa/src/ClustalW/src/fileInput/FileParser.h . . . 193 C.2.11 ../msa/src/ClustalW/src/fileInput/FileReader.cpp . . . 193 C.2.12 ../msa/src/ClustalW/src/fileInput/FileReader.h . . . 195 C.2.13 ../msa/src/ClustalW/src/fileInput/GDEFileParser.cpp . . . 195 C.2.14 ../msa/src/ClustalW/src/fileInput/MSFFileParser.cpp . . . 196 C.2.15 ../msa/src/ClustalW/src/fileInput/PIRFileParser.cpp . . . 196

(17)

Contents C.2.16 ../msa/src/ClustalW/src/fileInput/PearsonFileParser.cpp . . . 196 C.2.17 ../msa/src/ClustalW/src/fileInput/RSFFileParser.cpp . . . 197 C.2.18 ../msa/src/ClustalW/src/general/OutputFile.cpp . . . 197 C.2.19 ../msa/src/ClustalW/src/general/Stats.cpp . . . 197 C.2.20 ../msa/src/ClustalW/src/general/VectorOutOfRange.h . . . 197 C.2.21 ../msa/src/ClustalW/src/general/clustalw.h . . . 197 C.2.22 ../msa/src/ClustalW/src/interface/CommandLineParser.cpp . . . 198 C.2.23 ../msa/src/ClustalW/src/interface/CommandLineParser.h . . . 202 C.2.24 ../msa/src/ClustalW/src/interface/InteractiveMenu.cpp . . . 203 C.2.25 ../msa/src/ClustalW/src/main.cpp . . . 203 C.2.26 ../msa/src/ClustalW/src/multipleAlign/Iteration.cpp . . . 203 C.2.27 ../msa/src/ClustalW/src/multipleAlign/MSA.cpp . . . 204 C.2.28 ../msa/src/ClustalW/src/pairwise/FastPairwiseAlign.cpp . . . 204 C.2.29 ../msa/src/ClustalW/src/pairwise/FullPairwiseAlign.cpp . . . 204 C.2.30 ../msa/src/ClustalW/src/substitutionMatrix/SubMatrix.cpp . . . . 204 C.2.31 ../msa/src/ClustalW/src/substitutionMatrix/SubMatrix.h . . . 210 C.2.32 ../msa/src/ClustalW/src/tree/Tree.cpp . . . 210 C.2.33 ../msa/src/ClustalW/src/tree/TreeInterface.cpp . . . 211 C.2.34 ../msa/src/ClustalW/src/tree/UPGMA/RootedClusterTree.cpp . . 211 C.2.35 ../msa/src/ClustalW/src/tree/UPGMA/UPGMAAlgorithm.cpp . . . 212 C.2.36 ../msa/src/ClustalW/src/tree/UPGMA/upgmadata.h . . . 212 C.2.37 ../msa/src/ClustalW/src/tree/UnRootedClusterTree.cpp . . . 212 C.3 ClustalOmega . . . 213 C.3.1 ../msa/src/ClustalOmega/src/clustal/hhalign wrapper.c . . . 213 C.3.2 ../msa/src/ClustalOmega/src/clustal/ktuple pair.c . . . 215 C.3.3 ../msa/src/ClustalOmega/src/clustal/list.c . . . 215 C.3.4 ../msa/src/ClustalOmega/src/clustal/log.c . . . 215 C.3.5 ../msa/src/ClustalOmega/src/clustal/mbed.c . . . 216 C.3.6 ../msa/src/ClustalOmega/src/clustal/mbed.h . . . 219

(18)

C.3.7 ../msa/src/ClustalOmega/src/clustal/pair dist.c . . . 219 C.3.8 ../msa/src/ClustalOmega/src/clustal/pair dist.h . . . 220 C.3.9 ../msa/src/ClustalOmega/src/clustal/progress.h . . . 220 C.3.10 ../msa/src/ClustalOmega/src/clustal/seq.c . . . 220 C.3.11 ../msa/src/ClustalOmega/src/clustal/seq.h . . . 226 C.3.12 ../msa/src/ClustalOmega/src/clustal/symmatrix.c . . . 227 C.3.13 ../msa/src/ClustalOmega/src/clustal/symmatrix.h . . . 228 C.3.14 ../msa/src/ClustalOmega/src/clustal/tree.c . . . 228 C.3.15 ../msa/src/ClustalOmega/src/clustal/util.c . . . 228 C.3.16 ../msa/src/ClustalOmega/src/clustal-omega.c . . . 229 C.3.17 ../msa/src/ClustalOmega/src/clustal-omega.h . . . 231 C.3.18 ../msa/src/ClustalOmega/src/clustalo-api-test.c . . . 232 C.3.19 ../msa/src/ClustalOmega/src/hhalign/general.h . . . 232 C.3.20 ../msa/src/ClustalOmega/src/hhalign/hhalign.cpp . . . 233 C.3.21 ../msa/src/ClustalOmega/src/hhalign/hhalign.h . . . 236 C.3.22 ../msa/src/ClustalOmega/src/hhalign/hhalignment-C.h . . . 237 C.3.23 ../msa/src/ClustalOmega/src/hhalign/hhdecl-C.h . . . 242 C.3.24 ../msa/src/ClustalOmega/src/hhalign/hhfullalignment-C.h . . . . 242 C.3.25 ../msa/src/ClustalOmega/src/hhalign/hhfullalignment.h . . . 243 C.3.26 ../msa/src/ClustalOmega/src/hhalign/hhfunc-C.h . . . 243 C.3.27 ../msa/src/ClustalOmega/src/hhalign/hhhalfalignment-C.h . . . . 244 C.3.28 ../msa/src/ClustalOmega/src/hhalign/hhhit-C.h . . . 244 C.3.29 ../msa/src/ClustalOmega/src/hhalign/hhhitlist-C.h . . . 248 C.3.30 ../msa/src/ClustalOmega/src/hhalign/hhhitlist.h . . . 254 C.3.31 ../msa/src/ClustalOmega/src/hhalign/hhhmm-C.h . . . 254 C.3.32 ../msa/src/ClustalOmega/src/hhalign/hhmatrices-C.h . . . 258 C.3.33 ../msa/src/ClustalOmega/src/hhalign/hhutil-C.h . . . 258 C.3.34 ../msa/src/ClustalOmega/src/hhalign/list-C.h . . . 261 C.3.35 ../msa/src/ClustalOmega/src/hhalign/list.h . . . 261

(19)

Contents C.3.36 ../msa/src/ClustalOmega/src/hhalign/util-C.h . . . 261 C.3.37 ../msa/src/ClustalOmega/src/kmpp/KMeans.cpp . . . 261 C.3.38 ../msa/src/ClustalOmega/src/kmpp/KMeans.h . . . 264 C.3.39 ../msa/src/ClustalOmega/src/kmpp/KmTree.h . . . 265 C.3.40 ../msa/src/ClustalOmega/src/kmpp/KmUtils.cpp . . . 266 C.3.41 ../msa/src/ClustalOmega/src/kmpp/KmUtils.h . . . 267 C.3.42 ../msa/src/ClustalOmega/src/main.cpp . . . 268 C.3.43 ../msa/src/ClustalOmega/src/mymain.c . . . 268 C.3.44 ../msa/src/ClustalOmega/src/mymain.h . . . 276 C.3.45 ../msa/src/ClustalOmega/src/squid/a2m.c . . . 276 C.3.46 ../msa/src/ClustalOmega/src/squid/clustal.c . . . 277 C.3.47 ../msa/src/ClustalOmega/src/squid/hsregex.c . . . 279 C.3.48 ../msa/src/ClustalOmega/src/squid/msa.c . . . 279 C.3.49 ../msa/src/ClustalOmega/src/squid/msa.h . . . 280 C.3.50 ../msa/src/ClustalOmega/src/squid/msf.c . . . 280 C.3.51 ../msa/src/ClustalOmega/src/squid/phylip.c . . . 281 C.3.52 ../msa/src/ClustalOmega/src/squid/revcomp.c . . . 281 C.3.53 ../msa/src/ClustalOmega/src/squid/sqerror.c . . . 281 C.3.54 ../msa/src/ClustalOmega/src/squid/squid.h . . . 281 C.3.55 ../msa/src/ClustalOmega/src/squid/sre string.c . . . 282 C.3.56 ../msa/src/ClustalOmega/src/squid/ssi.c . . . 282 C.3.57 ../msa/src/ClustalOmega/src/squid/stockholm.c . . . 282 C.3.58 ../msa/src/ClustalOmega/src/squid/stopwatch.c . . . 283 C.3.59 ../msa/src/ClustalOmega/src/squid/stopwatch.h . . . 283 D Declaration of Authorship 285 D.1 Authors of the Master Thesis - Documentation . . . 285

D.2 Authors of the Master Thesis - Implementation . . . 287

D.2.1 Implementation of the functionality . . . 287

(20)

D.2.3 Implementation of the package build . . . 287

E List of Figures 289

F List of Tables 291

G List of Listings 293

(21)

1

Introduction

Bioinformatic scientists are faced with Multiple Sequence Alignment (MSA) analysis, which is an up-to-date problem. Many solutions and algorithms, like ClustalW, ClustalΩ and MUSCLE, were implemented in C or C++. One of the most common programming languages used in bioinformatics is R, an enhanced version of the statistical program-ming language S. Unfortunately, there is no interface for accessing those MSA algo-rithms via R.

Previously, users needed to handle multiple sequence alignment and further data processing with different software tools. To bridge the gap between those two worlds, an easy installable R package, which could be used on any operating system (except mobile devices) like Linux, Microsoft Windows or Mac OS X would have been prefer-able and was missing at this moment. These were the essential motives of this thesis.

(22)

1.1

Design of the Thesis

This subsection sums up the content of the given thesis.

In this chapter 1, all essential theoretical and technical preconditions of the thesis are presented. The preliminary work to this thesis and the definition of the thesis are given. Functional as well as non functional requirements are listed and a summary of the boundaries of this thesis is depicted. Subsequent to this overview, an abstract of all used resources is shown, furthermore, the line of action is presented.

Chapter 2 explains multiple sequence alignment. The first section is a short intro-duction to pairwise alignment. Global-, free-shift- and local-methods are explained and the according basic methods (like Needleman-Wunsch-, Smith-Waterman-, Gotoh-, and Hirschberg-algorithm) are addressed. The focus will be on computational com-plexity in order to understand the problematic computation of multiple sequence align-ments. The subsequent section is an introduction to multiple sequence alignment, followed by a section explaining the core problems of MSA. To hook up to those prob-lems, the next section shows approaches to MSA. In this context the emphasis is, at first on the exact approach, followed by the progressive and iterative approach. Af-terwards, the consistency-based approach is explained. Finally, the structure-based approaches are explained. To conclude this chapter, all known issues of these ap-proaches are presented in the last section.

Chapter 3 refers to chapter 2 and itemizes the original MSA algorithms ClustalW, ClustalΩ, MUSCLE and T-Coffee as well as other algorithms. Those algorithms are illustrated and explained in own sections and the last section presents further algo-rithms. Of course, the latter is not all-embracing as it is not the objective of this paper. The next subject is the usage of the implemented package msa, which is explained in chapter 4. First of all, the installation is considered, followed by a paragraph show-ing an illustration of the usage of the package with a few practical examples. Af-terwards, the functions for multiple sequence alignment are explained in more detail in a subsequent section. An idea of how to post process the results is depicted in three successive subsections, whereas the focus is on the integrated printing function

(23)

1.1 Design of the Thesis

msaPrettyPrint()and how to integrate those results with knitr or Sweave.

In the beginning of chapter 5 the underlying architecture of the package is de-scribed. Afterwards, all implementation aspects are considered, starting with the R-files. Associated with those files, the R to C/C++ interfaces via Rcpp are presented, and based on that, the most important changes compared to the original algorithms are highlighted. Immediately followed by the functional accomplishment. Changes for cross platform requirements are shown in an extra section. Additionally, the code changes for building the package are also listed and last but not least, the change requests of the Bioconductor board are specified.

The package tests are shown in unit 6. This chapter is divided into two blocks, starting with the RUnit-tests for all three algorithms (ClustalW, ClustalΩ and MUSCLE). Afterwards, the performance tests are shown in the second section. This section starts with benchmark tests in general, followed by two subsections where the used bench-marks (BAliBASE 3.0 and PREFAB 4.0) are described. Subsequently, the results of the direct comparison and the real condition comparison are listed. Each comparison is explained in a subsection, and they are illustrated for all three algorithms (as the results are mostly very similar for all three algorithms; the supplementary results for the modified algorithms, like variances, are attached in the appendix).

The evaluation of this project can be found in chapter 7. All appeared problems, all known unsolved issues and possible extensions and improvements of the package are highlighted. An outlook on the future prospectives is provided and finally, a conclusion of the thesis is given.

Appendix A gives an overview of the used substitution matrices. It starts with the default values, sums up all possible other matrices, depicts the layout of the matrices, shows a cross-check of some matrices in other algorithms and with other gap penalties and evaluates a possible usage in other algorithms.

Appendix B lists all parameters of the implemented algorithms MUSCLE, ClustalW, ClustalΩ and the not implemented T-Coffee. In this connection, the parameters are illustrated whether they are realized or not. In case of the latter, reasons are given,

(24)

why the implementation was not realized. Furthermore, this summary is granulated into the implementation in R, RCPP, and the corresponding C/C++ of the integrated algorithms. Additionally, a comment column for further aspects is added.

The necessary changes of the original source code are listed in appendix C. As the previous appendix, this chapter is also divided into three sections, one for each algorithm.

As a result of a collaborative work of two students, an overview on workload is given. This summary is available in appendix D.

1.2

Preliminary Work to the Thesis

In the curriculum of bioinformatics at the Johannes Kepler University Linz, a seminar on a closely related topic to the field of bioinformatics is mandatory. Among other seminar topics, MSA algorithms were available. Christoph Horejˇs-Kainrath presented the MUSCLE algorithm referring to ”MUSCLE: multiple sequence alignment with high accuracy and high throughput” (Robert C. Edgar, 2004a) and ”MUSCLE: a multiple se-quence alignment method with reduced time and space complexity” (Robert C. Edgar, 2004b). Enrico Bonatesta gave a lecture on the ClustalΩ algorithm referring to ”Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega” (Fabian Sievers et al., 2011).

Subsequent to the seminar, both students decided to get engaged in further re-search on a related topic for their project. They decided to work together and imple-mented a prototype of an R package, where they have shown a feasibility study with two multiple sequence alignment algorithms (ClustalW and MUSCLE) on a single plat-form (Linux). In this proof of concept, which was a complete success, all previous designs and projections were verified and for now, there should not be any other ob-stacles to the implementations of more multiple sequence alignment algorithms. The implementation of the prototype has shown, that an interface between MSA algorithms and the programming language R is feasible with intermediate expenditure.

(25)

1.3 Definition of the Thesis

1.3

Definition of the Thesis

A lot of preliminary work has been done for this thesis. This package is not a stand-alone implementation at all, but rather an enhancement which is based on the former project of both authors and the resulting prototype, which they have been created.

1.3.1

Implementation

The thesis was defined with the following requirements:

First, the full functionality of MUSCLE and ClustalW (e.g. guided tree) needs to be implemented. Afterwards, the implementation of T-Coffee and ClustalΩ has to be adapted. Including a check of all parameters in both algorithms, the code modifications in the programs and of course the adaption of the make-files, analogous to MUSCLE and ClustalW. Furthermore, all four algorithms should be implemented platform inde-pendent. Finally, all algorithms need to be implemented error-prone.

1.3.2

Testing

All changes need be tested on all platforms, like Linux, Microsoft Windows and Mac OS X. A test on operating systems for mobile devices is neither necessary nor useful, because of the unlikely use of algorithms on such devices. Furthermore, performance loss has to be checked.

1.3.3

Documentation

An accurate documentation has to be written. Everything needs to be documented within the package according to Bioconductor. In addition, the documentation should be designed in such way, that subsequent versions of the algorithms can easily be implemented.

(26)

1.4

Functional Requirements

The usage of R objects for in- and output sequences should be possible. Aside from this, the same parameters in all algorithms should be available in a common inter-face. A handover of external substitution matrices should be possible. Finally, a highly customizable plot of multiple sequence alignments within the package is desired. In addition to these tasks, there are a few more non functional requirements.

1.5

Non Functional Requirements

The intended result is to use only one meta package for all algorithms. Furthermore, all algorithms should be built modular and the interface needs to be designed extensible to implement additional algorithms. Updates of the implemented algorithms need to be incorporated easily. Loss of performance, due to the overhead of the new interface should be minimal. The software needs to meet the Bioconductor guidelines and in-teract with the Biostrings package. Moreover, all algorithms need to use the same naming conventions like in Biostrings package and it needs to be accepted by the Bioconductor board. All change requests from the Bioconductor board are required to be taken into account.

1.6

Boundaries of the Thesis

This thesis, the documentation of the R package and the included entire implemen-tation have clear specified boundaries. This thesis will not describe the differences between pairwise alignment and multiple sequence alignment in detail or give an ex-haustive list of pros and cons or a comparison of the existing alignment algorithms. The program code of the original algorithms is not explained either. The thesis is not an introduction to R or any of the Bioconductor packages used in this document. The implementation work ends after the package is released on the Bioconductor board

(27)

1.7 Resources and all further implementations will be done voluntarily or by other students or by members of the research staff.

1.7

Resources

Because of the development in a team of two, two different hardware systems were used, which can be seen in table 1.1.

Table 1.1: Overview Hard-/Software

System Hardware Acer AspireOne 721 Lenovo ThinkPad T500

Processor AMD Athlon II Intel Core®2 Duo CPU Neo K145 @ 1.80 GHz P8600 @ 2.40GHz× 2

Operating System 64bit Linux 32bit Linux

Ubuntu 14.04 LTS Ubuntu 12.04

Development Platform Eclipse Luna Eclipse Helios

4.4.0 Service Release 2

Data Exchange Platform Subversion Server

Project Tracking Atlassian JIRA 7.1

Used Algorithms ClustalW 2.1 MUSCLE 3.8.31

Licence GNU Lesser GPL Public Domain Software

Used Algorithms ClustalΩ 1.2.0 T-Coffee 9.03.r1318

Licence GNU General Public Licence GNU General Public Licence

Software Requirements R 3.1

Required Packages Biostrings BiocGenerics IRanges methods Rcpp S4Vectors tools Biobase knitr

Eclipse served as a common development platform for implementing R, C, C++

code, and even for writing this LATEX document. To share the code, a Subversion server

was in use. To track the progress and bugs in the project, Atlassian JIRA 7.1.0 was used. For this thesis and the resulting package, the algorithm MUSCLE is used in

(28)

ver-sion 3.8.31, the algorithm ClustalΩ is used in verver-sion 1.2.0 and the algorithm ClustalW has version 2.1. The T-Coffee version is 9.03.r1318. ClustalW is released under the GNU lesser GPL and MUSCLE is provided as public domain software. ClustalΩ and T-Coffee both have GNU General Public Licence.

The package was developed under R 3.1. It uses the R packages Biostrings

(>= 2.30.0) and methods, and imports BiocGenerics, tools, IRanges(>=

1.20.0), Rcpp (>= 0.11.1) and S4Vectors. The suggestions are Biobase

and knitr. If the user also wants to apply the integrated printing function

msaPrettyPrint(), LATEX and TEXshadeare additionally necessary.

1.8

Line of Action

In this subsection, a summary of the problem solving approach is described.

The starting point of this thesis was an existing prototype (resulting from the pre-vious project), where the two MSA algorithms ClustalW and MUSCLE were rudimen-tary implemented in an R package. It was launched by integrating the MSA algorithm ClustalΩ in the existing R package and making the existing prototype cross platform available.

At this point, a few problems occurred, especially the complete failure of the whole program, if one of the algorithms failed to compute, and it became apparent, that a refactoring is necessary. After having refactored the code, a compilation failure of one single algorithm had no more effects on the package itself and it can be built nevertheless.

As a next step, the unit tests for ClustalΩ and the two algorithms of the prototype were written and minor bugs were fixed. Because of the fact, that the tests had many repetitions with different values, we encountered another new problem, because of several periodically crashes. That is why we performed memory management checks using Valgrind (Nethercote & Seward, 2007) and discovered multiple memory leaks.

(29)

1.8 Line of Action After eliminating most of those memory leaks, which was interminable and time consuming, a big progress was made by the fact, that a breakthrough regarding the cross platform availability was managed. The prototype was running on all different platforms, including Windows, which was the most challenging platform.

Simultaneous to the later work described above, the implementation of all 150 T-Coffee-parameters in R began, but in the course of time, an issue occurred and it figured out, that this issue was a rather big problem. T-Coffee uses fork-commands, and R can not deal with such multicore commands. Furthermore, the cross platform availability was in jeopardy, because of the fact that the Bioconductor’s building envi-ronment uses MinGW as compilation for Windows, whereas T-Coffee assumes to be built with CygWin under Windows (Cedric Notredame, 2016b).

Because of an unpredictable realization effort, the code for T-Coffee was frozen and the package consists of three algorithms now.

In a next step, the argtable2-dependency of ClustalΩ was resolved and an

inte-grated TEXshade solution in R was implemented. Additionally, for not exceeding the

Bioconductor boundary, the package size was reduced. Supplementary, the make files for MUSCLE, ClustalW and ClustalΩ were adopted and simplified.

At this particular time, the last major problems were taken into account, a Windows specific adoption of the make file for ClustalΩ, the removal of cygwin and rtools depen-dencies for the Windows environment and the bugfix of a bash-bug for the Windows environment.

Solving those issues, the comparatively minor issues like the removal of all exit-calls in source files to avoid crashes of the R session were solved along the way.

To finalize the package, the implemented code was cleaned, refactored, and the necessary additional documentations like the vignette and the manual pages were added. After all this preliminary work, the release routine was started and nearly all issues were adjusted. Finally, a few loops later, the first official release as part of Bioconductor 3.1 was launched.

(30)

them were done by our supervisor Ulrich Bodenhofer, the maintainer of the software package. Those are:

• Version 1.1.1:

– fix of msa() function • Version 1.1.2:

– new print() function for multiple alignments that also allows for displaying alignments in their entirety (plus additional customizations)

– strongly improved handling of custom substitution matrices by

msaClustalW(): now custom matrices can also be supplied for nucleotide sequences which can also be passed via the ”substitutionMatrix” argument. The ”dnamatrix” argument is still available for the sake of backwards compatibility.

– strongly improved handling of custom substitution matrices by

msaMuscle()

– fix of improperly aligned sequence logos produced by msaPrettyPrint() – updated citation information

• Version 1.1.3:

– bug fix related to custom substitution matrices in the MUSCLE interface – correction and updates of documentation

• Version 1.2.0:

– Bioconductor 3.2 release • Version 1.3.0:

(31)

1.8 Line of Action • Version 1.3.1:

– fixes in Makefiles and Makevars files to account for changes in build system • Version 1.3.2:

– further fixes in Makefiles and Makevars files to account for changes in build system

– update of citation information • Version 1.3.3:

– added function for converting multiple sequence alignments for use with other sequence alignment packages

– corresponding changes in documentation • Version 1.3.4:

– added function for checking and fixing sequence names for

possi-bly problematic characters that could lead to LATEX errors when using

msaPrettyPrint()

– corresponding changes in documentation – minor namespace fix

• Version 1.3.5:

– adaptation of displaying help text by msa() function • Version 1.3.6:

– msaPrettyPrint() now also accepts dashes in file names

– added section about pretty-printing wide alignments to package vignette • Version 1.3.7:

(32)

• Version 1.4.0:

– Bioconductor 3.3 release • Versions 1.4.1 / 1.4.2:

– version number bumps for technical reasons related to Bioconductor build servers

• Version 1.4.3:

(33)

2

Multiple Sequence Alignment

Before explaining the implemented program code of the msa package, the global mul-tiple sequence alignment (MSA) is explained. This will make it possible to understand the basic ideas behind those algorithms and the included issues. In this chapter, sev-eral questions – which are affiliated with global MSA – will be clarified, like what is MSA in fact, what is the objective of MSA, which problems appear with MSA, what are the main differences of the algorithms, in particular how are those algorithms realized, or what is the subtle distinction to pairwise alignment. Local MSA and local MSA meth-ods, such as Gibbs sampler (S. Geman & D. Geman, 1984), Match-Box (Depiereux et al., 1997) or Macaw (Schuler, Altschul, & Lipman, 1991) as well as free-shift alignment will not be explained in this thesis.

(34)

2.1

Pairwise Alignment

For a better understanding of the next sections and chapters, which will describe MSA (section 2.2 - 2.5) and MSA algorithms (chapter 3), a really short excursion to pairwise alignment is necessary. The focus of pairwise alignment in this thesis lies on mem-ory requirements and computational complexity. It is not a complete introduction to pairwise alignment and the according algorithms.

As the name implies, a pairwise (pw) alignment is a pairwise comparison between two sequences (e.g. DNA, RNA, or protein sequences) and the calculation of the sim-ilarity of those sequences. The computation of the simsim-ilarity is biologically interesting, because similar sequences often have similar structures and thus similar functional-ity and consequently, there is a functional or evolutionary relationship (homology). In general, there are three strategies of doing such an alignment, which will be presented subsequently, conferring to Gusfield (1997) and Michael S. Waterman (1995).

2.1.1

Pairwise Alignment Strategies

Global Alignment

In a global alignment, all symbols of both sequences are incorporated. If both sequences consist of the same length and strong sequence homologies are ob-tained, the use of a global alignment is meaningful.

Free-Shift Alignment

The free-shift alignment (or also end-gap-free alignment) is a derivation of the global alignment, but compared to the global method, insertions and deletions (Indels) at the beginning or the end are not taken into consideration for the calcu-lation of the similarity score. If both sequences differ a lot in length, the compu-tation of a free-shift alignment is often much more reasonable as a compucompu-tation of a global alignment.

(35)

2.1 Pairwise Alignment

Local Alignment

A local alignment of two sequences does not incorporate all symbols of both sequences. For the computation of an ideal local alignment, only those parts of the sequences for which the similarity score is maximized need to be located. Another use case could be newly sequenced genomes. Genes are searched for sequences similar to well-known sequences of other organisms.

Furthermore, it should be mentioned, that local alignment methods tend to not align all the right way (under-align), whereas global alignment methods align too much by trend (”over-align”). In other words: local alignment sometimes do not recognize homologous residues or structural similarity, whereas global alignments sometimes allocate regions, which are not homologous or which are structural dissimilar (cf. R. C. Edgar, 2010).

2.1.2

Pairwise Alignment Algorithms

There are four algorithms for pairwise alignment, which will be described subsequently. In principle, all algorithms, which are described as global alignment methods, can also be used as free-shift-alignment algorithms. For this purpose, a small modification of the particular algorithm is necessary. The first row respectively the first column and the last row respectively the last column need to be initialized as zero (cf. Needleman & Wunsch, 1970).

All algorithms use dynamic programming and all methods calculate a similarity score. This score represents a measure of similarity. More specifically, the higher the score, the more similar the sequences. With two different sequences, a lot of different

alignments can be produced. An optimal alignment is achieved, if the maximum

similarity score is reached. Consequently, the optimization of those algorithms is the maximization of the similarity scores. Subsequent, the algorithms will be presented. The focus will be on runtime and memory requirements. To compute a pairwise

(36)

Smith-Waterman-Algorithmus

The Smith-Waterman-algorithm computes the optimal local alignment between two sequences. With the help of backtracking, those single alignments can be reproduced. Basically, in this case, there can exist several optimal alignments of the two sequences. The computational complexity is quadratic O(nm), and the memory space, which is necessary, is O(nm). If only the score of the optimal local alignment is computed, the requirements for memory are decreased to O(n) or rather O(m). The Smith-Waterman-algorithm is a variant of the Needleman-Wunsch-algorithm (cf. T. F. Smith & M. S. Waterman, 1981).

Needleman-Wunsch-Algorithm

The Needleman-Wunsch-algorithm produces the global alignment of two se-quences. This alignment can be read out with the help of backtracking. Starting with the element (n, m) of the matrix M , the path can be retraced and the opti-mal global alignment can be reconstructed. Analogous to the Smith-Waterman-algorithm, there can also be more than one optimal path, delivering the same optimal similarity scores. Because of the cubic runtime, with arbitrary gap costs

and computational complexity O(nm2+ n2m), the Needleman-Wunsch-algorithm

is not used in practice. In case the costs are limited to standardized costs, the

optimal alignment can be computed in quadratic runtime O(n2) (cf. Needleman

& Wunsch, 1970).

Gotoh-Algorithm

The Gotoh-algorithm is a global alignment strategy as well, but it is a further development of the Needleman-Wunsch-algorithm. The advancement is the in-troduction of affine gap costs (gap open penalty and gap extension penalties). The computational complexity of this method is O(mn) and the requirements for

memory capacity is O(n∗ m). If only the score of the optimal alignment is

(37)

2.2 Objectives of MSA

Hirschberg-Algorithm

The Hirschberg-algorithm is not a new algorithm, but in fact a clever improvement of the Needleman-Wunsch- and Gotoh-algorithm. The memory requirements for this global alignment algorithm have been linearised, whereby the consumption

of space has decreased to O(min{m, n}). As well, this method has a runtime of

O(mn)(cf. Hirschberg, 1975).

2.1.3

Conclusion

It becomes apparent, that all presented pairwise strategies and algorithms have an immense demand for memory and are really complex in computing. In case of really long sequences, all algorithms reach their limits, even if there are only two sequences to be aligned. It is obvious, that an increase of the quantity of sequences will compli-cate the problems mentioned above. For this reason, the next chapter will describe the objectives of MSA, the main problems, the approaches to solve these problems, and the known issues of still unsolved approaches.

2.2

Objectives of MSA

First of all, for a multiple sequence alignment (MSA), three or more sequences of proteins or nucleic acid (DNA or RNA) are aligned and a consensus sequence is com-puted. Figure 2.1 shows a simple alignment of sequences from different species.

Figure 2.1: Simple alignment of sequences from different species, computed with the online version of MUSCLE (EMBL-EBI, 2016c)

(38)

MSA is easy, if all the sequences are very similar, but it also can be very difficult, if the sequences are very different. If the latter occurs, many gaps are necessary. Some-times, a manual supervision or fine tuning can lead to an improvement (cf. Dandekar & Koenig, 1997).

One of the main objectives of MSA is to find homologous sequences, which are aligned in the same columns of the sequence. In other words, some genes remain conserved across different species. In most cases, it is assumed that the analysed sequences have an evolutionary relationship, and therefore share the same origin and have the same ancestors. The results will be used for the production of phylogenetic trees, which again may give hints concerning the relational structure of the sequences (cf. Dandekar & Koenig, 1997). Generally, homologous regions can be seen with bare eye, if we compare some sequences. Problems arise, in case of substitutions or gaps (insertions or deletions) in these regions, meaning that genes are mutated and not completely equal to genes from other sequences. To overcome the problem of gaps, we need a mathematical model to quantify the quality of an alignment, i.e. the sum-of-pairs scoring model is needed. Additionally, the mathematical model may need some parameters like gap opening or gap extension penalties (cf. Altschul & Erickson, 1986). The consensus and also the mismatches in a multiple sequence alignment can be really meaningful. Matching symbols signify that there must be conserved regions in those locations, responsible for the secondary structure. In contrast a mismatch is mostly located in loops. Loops are very useful to derive the evolutionary history (cf. Castresana, 2000). In the end, with the reckoning of the tertiary structure, you can assess the alignment, which was not verifiable (in the strict sense) until now (I. Elias, 2006).

The output of an MSA analysis is not limited to only one problem in bioinformatics. Multiple alignments are central to most bioinformatic techniques (cf. C. Notredame, 2002) and the usages for MSA are manifold (Pevsner, 2009):

• Homologs from distantly related members can be detected, because MSA is more sensitive than pairwise alignment.

(39)

2.3 Problems of MSA • The output of a database search (such as a BLAST search) can be better

anal-ysed with MSA to reveal conserved residues or motifs.

• MSA can provide insight into many biological questions like evolution, structure or function.

• Phylogenetic trees are built upon MSA.

• Define primers for polymerase chain reaction (PCR).

• Assembling short DNA sequences to consensus sequences for genome projects. • SNP analysis.

• Structural prediction.

2.3

Problems of MSA

To take the comparison with pairwise alignment up, for a multiple sequence alignment (MSA), three or more sequences of proteins or nucleic acid are aligned. Applying pairwise alignment – as the name suggests already – only two sequences are taken into account. Multiple sequence alignment could be considered as a generalization of the pairwise alignment problem. It is much harder (conceptually and computationally) to compute a multiple alignment than a pairwise alignment.

While the optimal alignment of two sequences can be computed exactly in poly-nomial time (O(nm), n and m are the length of the sequences), the required compu-tational time for an exact approach of a multiple alignment for N sequences with the

average sequence length L is O(2NLN) (cf. Pevsner, 2009). Other sources specify

L as the length of the longest sequence in the alignment (cf. Michael S. Waterman,

1995), but because both notations show an exponential growth, this detail is of minor interest. Isaac Elias (2003) has proved, that this optimization problem is NP-hard with the most commonly used Sum-of-pair (SP) score, and in a later sophisticated execu-tion, he has even shown, that it is NP-complete (cf. Michael L ¨assig, 2002).

(40)

Furthermore, ”a set of homologous biological sequences has been produced through an evolutionary process, and when considering more than three sequences the multiple alignment should pay regard to the phylogenetic tree connecting the se-quences. The tree alignment problem can be solved in exponential time using dynamic programming, but this solution is inadequate for most biological problems” (L ¨oytynoja & Goldman, 2005).

There are many approaches to solve those problems crystallized above, and the first theoretical concept of such a solution was formulated in 1975 by the canadian mathematician Sankoff (1975). A lot of scientific work followed and will be presented in detail in the next section (2.4). The corresponding algorithms are presented in chapter 3.

2.4

Approaches of MSA Algorithms

According to Pevsner (2009), five different approaches to multiple sequence alignment exist. These approaches will be discussed briefly. Other authors like Chuong B. Do and Katoh (2008) divide the approaches into different groups. Sometimes, the algo-rithms can be assigned to more than one group, therefore, a distinct separation from each other is nearly impossible. Additionally, many of these algorithms are variants. This means that the basic ideas are equal, but the realization of them differs.

2.4.1

Exact Approach

In principle, the optimal alignment for any sequences could be computed with dynamic programming. In reality, however, such an exact approach is not feasible regarding time or space for more complex problems (cf. Michael S. Waterman, 1995). As stated

already, the required computational time is O(2NLN). Therefore, these approaches

can only be used for small number of sequences. An exact computation of smaller problems is possible, but to align a biological and evolutionary meaningful alignment, longer sequences are necessary. Only if these preconditions are complied, real

(41)

sim-2.4 Approaches of MSA Algorithms ilarities or differences in sequence, structure or function can be derived (cf. Gusfield, 1997). Carrillo and Lipman (1988) described this problem and introduced the so called Carillo-Lipman-bound. That’s why heuristics are used, for example in the so-called progressive strategies, but these heuristic approaches can not guarantee to find an optimal solution (cf. Michael S. Waterman, 1995).

2.4.2

Progressive Approach

One of the first, oldest and most widely used methods for MSA is the so called pro-gressive approach. This basic concept was first introduced by Hogeweg and Hesper (1984) and refined by Feng and Doolittle (1987). It is an extension of the pairwise alignment algorithm and all progressive solutions have a few basic steps in common. The first step is to calculate pairwise sequence alignment scores between all the se-quences which will be saved in a distance matrix. From this point, a first phylogenetic tree (the so-called ”guide tree”) or phylogenetic star (the so-called ”guide star”) is cal-culated by cluster analysis (for example with the Neighbor-Joining-algorithm or UP-GMA). Depending on the chosen grouping method, it is referred to a star alignment or tree alignment. After receiving such a guide tree or star, the algorithm starts with the most similar sequence pair and as the names imply, those algorithms progressively align one sequence per step with the existing alignment. The new sequence, which is added to the alignment, is chosen by computing means out of the guide tree respec-tively guide star. The algorithms stop after adding the last sequence to the alignment (cf Thompson, Plewniak, and Poch, 1999a and Robert C. Edgar, 2004b).

An itemized overview of those algorithms could be described as following (Mount, 2004):

1. Compute the pairwise alignments for all sequences with dynamic programming (more exact) or heuristics (faster)

(42)

3. Use the guide tree or star to identify the next sequence (take the most closest sequence)

4. Align the new sequence to each of the previous sequences 5. Return to step 3 or stop

This one-by-one technique has the advantage of being very fast (even faster than the adaptation of pair-wise alignments to multiple sequences, which is not scalable for a higher number of sequences and becomes very slow), combined with reason-able sensitivity. Another advantage of such an approach is the (relative) simplicity of implementing such algorithms. And last but not least, gaps are kept within a limit, be-cause of the fact, that the first alignment starts with the most similar sequences. That is a positive and necessary fact, because gaps can not be vanished by this method (Robert C. Edgar, 2004b, cp). In general, the progressive alignments methods are compromises between needed time and accuracy (cf. Thompson et al., 1999a).

There are also many disadvantages of progressive algorithms. A major disadvan-tage of such algorithms is the reliance on a good alignment of the first two sequences. If there are errors in this first alignment, those errors are propagated throughout the rest of the MSA (cf. Thompson et al., 1999a). Complementary, if a gap is included, it will be penalized in step one and four, which results in an over-alignment (in other words: the output are groups without gaps and blocks with many gaps) (cf. Robert C. Edgar, 2004b). The next disadvantage is that these algorithms are greedy and will not guarantee an optimal solution to provide the most accurate alignments. Additionally, the whole alignment changes if the order of the sequences are changed. The results strongly depend on the initial order. Another major disadvantage is the fact, that the guided tree is only an approximation of the real phylogenetic tree, because this tree is often unknown (cf. Thompson et al., 1999a).

(43)

2.4 Approaches of MSA Algorithms Attempts to solve these disadvantages are semi-progressive approaches (like e.g. in in PSAlign (cf. Sze, Lu, & Yang, 2006)) or iterative algorithms. A few further exten-sions of the progressive approach like a weight function (e.g. in ClustalW) to counteract those problems exist (cf. Thompson, Higgins, & Gibson, 1994).

2.4.3

Iterative Approach

Barton and Sternberg (1987) presented the first iterative concept. The basic idea behind iterative algorithms is the assumption, that an already existing suboptimal so-lution can be modified in some way so that the resulting soso-lution is optimal. Iterative approaches compute an alignment with a progressive alignment strategy (which is a non optimal solution as described in the previous section). Afterwards, the alignment is modified until a solution converges and no further enhancement are possible. It corrects one of the disadvantages of progressive alignment strategies, more precisely the problem of initializing the alignment and the propagation of errors throughout the rest of the MSA (cf. Lassmann & Sonnhammer, 2005a).

The reiteration of the MSA starts with a pairwise realignment of sequences within subgroups and leads to a realignment of the subgroups. The subgroups are randomly selected or with further information, for example the sequence relations known from the guide tree. Basically, the iterative approach is an optimization method, which can be combined with other (machine learning) approaches, such as genetic algorithms and Hidden Markov Models. Furthermore, the iterative algorithms are often subdivided into deterministic (e.g. Round-Robin, or single-type-, double-type-, or tree-dependant-partitioning) or stochastic methods (e.g. simulated annealing, or evolutionary algo-rithms) (cf. Thompson et al., 1999a). The disadvantage of iterative approaches are computational time and additionally, paired with the required memory complexity. Fur-thermore, a disadvantage is inherited from optimization methods: the process can get trapped in local minima. And last but not least, the parallelization of iterative algorithms remains difficult (cf. Mount, 2004). For those reasons, other algorithms are considered, such as consistency-based or structure-based approaches.

(44)

2.4.4

Consistency-Based Approach

The goal of consistency based algorithms is to find the multiple alignment which coin-cide the most with all possible optimal pairwise alignments. This principle works well with biological observations as well. It is solvable with exact methods or heuristics. At

bottom, the consistency-based approach uses the law of implication (A → B ∧ B →

C ⇒ A → C) and turn it around.

For better comprehension an example is given: Three sequences, called x, y and

z are given. A pairwise alignment x − z and a further pairwise alignment z − y is

computed. A match in the position i of x and the position k of z occurs at the same

time as a match in the position k of z with the position j of y (xi = zk ∧ zk = yj). A

pairwise alignment with x and y, the position i in x has to match with the position j in y(xi = yj).

Consistency-based methods use this principle backwards. An alignment of two sequences x and y take care of the alignment with z (as indication). To guide the pairwise alignment between x and y in the right direction, similar to the single steps in

a progressive alignment. They adjust the score for a xi = yj and thus incorporate a

sequence information in the pairwise alignment. Consistency-based alignments follow the pattern prevention is better than cure (cf. C. B. Do, Mahabhashyam, Brudno, & Batzoglou, 2005).

The consistency-based approach also has a disadvantage. As long as all pairwise alignments are done the proper way, the theoretical concept works well. Figure 2.2 shows an example of a good alignment on the left site. But as soon as one of the pairwise alignments exhibits an inconsistency, the theoretical root idea does not work any more. Figure 2.2 shows this problem on the right side.

Kececioglu (1993) recognized this problem and therefore mapped the MSA prob-lem to a graph probprob-lem (which is unfortunately also NP-complete) and called it the ”Maximum Weight Trace Problem”. It is based on an alignment graph, where the nodes represent the symbols of sequences and the edges illustrates pairwise align-ments. These edges are weighted by quality. To find a fitting alignment, a subset of

(45)

2.4 Approaches of MSA Algorithms

Pairwise Alignment 1 (Seq1, Seq2)

Seq1 A

Seq2 B

Pairwise Alignment 2 (Seq1, Seq3)

Seq1 A

Seq3 D C

Pairwise Alignment 3 (Seq2, Seq3)

Seq2 B

Seq3 D C

Pairwise Alignment 1 (Seq1, Seq2)

Seq1 A

Seq2 B

Pairwise Alignment 2 (Seq1, Seq3)

Seq1 A

Seq3 D C

Pairwise Alignment 3 (Seq2, Seq3)

Seq2 B

Seq3 D C

Multiple Alignment Seq1, Seq2, Seq3

Seq1 A

Seq2 B

Seq3 D C

Multiple Alignment Seq1, Seq2, Seq3

Conflict with Pairwise Alignment 3!

Seq1 A

Seq2 B

Seq3 D C

Multiple Alignment Seq1, Seq2, Seq3

Conflict with Pairwise Alignment 2!

Seq1 A Seq2 B Seq3 D C

Fully Consistent

⇐⇒

More Reliable

Partly Consistent

⇐⇒

Less Reliable

Figure 2.2: Problems with inconsistencies in consistency-based algorithms (C ´edric Notredame, Higgins, & Heringa, 2000)

(46)

edges is necessary (with maximum weight), which is still a meaningful trace (which means it can be reproduced without conflicts).

2.4.5

Structure-Based Approach

Structures diverge slower than sequences (cf. Chothia & Lesk, 1986). This biological effect is used in structure-based approaches. Those alignments are usually specific to proteins, but RNA sequences can be computed with this approach as well. They enrich the sequences with information about the secondary and tertiary structure of proteins (or RNAs). With this structural information, these algorithms have a larger amount of position-specific information, which is used for computing specific positions in an alignment and which ends in a more improved result (cf. Simossis & Heringa, 2005). R. F. Smith and T. F. Smith (1992) was one of the first, who recognized this approach.

The inclusion of structural information is de facto the only similarity which all state-of-the-art algorithms have in common. All other aspects depend on the realizations and implementations. For example, the moment of accumulation differs a lot among the particular algorithms. Some programs add the information in a first step (for exam-ple 3D-Coffee), whereas others enrich the sequences after the first step of the align-ment (like CE) (cf. Shindyalov & Bourne, 1998). Some algorithms apply the structural information during the computation of the alignment (such as 3D-Coffee/Expresso or PRALINE, Simossis and Heringa, 2005, see sections 3.4.2 and 3.5), whereas other algorithms compute the alignment without this structural information and use the infor-mation afterwards to optimize the alignment (e.g. CE, Shindyalov and Bourne, 1998). Others again (like MAFFT, Katoh, Misawa, Kuma, and Miyata, 2002, see section 3.5) perform a hybrid approach, the additional information is not incorporated in the orig-inal sequences. Instead, enriched homologue sequences are added to the origorig-inal set, further alignment steps are done with the additional information (to order those sequences the right way) and at last, additional sequences are removed again and the alignment is computed in a final step (cf. Simossis & Heringa, 2005).

(47)

2.4 Approaches of MSA Algorithms In such approaches, the measure methods itself are different. They vary from the root mean square distance deviation (RMSD), modified RMSDs like iRMSD (intra-molecular RMSD (Armougom, Moretti, Keduas, & Notredame, 2006), like in the special modes 3D-Coffee and Expresso (Cedric Notredame, 2016b)) of T-Coffee, tRMSD (”a structure based clustering method using the iRMSD to drive the clustering” (Cedric Notredame, 2016b)), methods which use APDB (O’Sullivan et al., 2003), or a TM-score rotation matrix (used in TM-align, Zhang and Skolnick, 2005).

Last but not least, it should be mentioned, that structural approaches can be divided by the performed and computed alignment steps (horizontal-first methods, vertical-first methods, or consensus-first methods) (cf. Ma & Wang, 2014).

As it is obvious, this field of research is large, and that is why this subsection can not give an exhaustive overview. A really thoroughly and up-to-date compendium (and comparison) of structural alignment algorithms is given in Ma and Wang (2014), 22 structural aligners are listed and explained.

Alignments which are constructed with structural information have one advantage in common: They are able to detect more distant relationships between sequences, wheras standard procedures are unable to do so. ”As a result, the cases that benefit the most are those that evolution has changed so extensively (< 30% identity) that the homology (common ancestry) between them is almost undetectable when compared directly.” (Simossis & Heringa, 2005)

The biggest disadvantage of this approach is the necessity of an availability of the corresponding structural information (gained with X-ray crystallography, NMR spec-troscopy, or dual polarisation interferometry (cf. Cross et al., 2003)). The costs for extracting such structures are expensive, thus this disadvantage is amplified. For this reason, the structures are often not available and thus, the use of secondary or ter-tiary structure information is partially impossible (hence the complete algorithms). In addition, the structure-based alignment divides the scientific community: One the one side, those, who consider this approach as the non plus ultra, like Y. M. Huang and Bystroff (2006): ”So far, the most satisfying way of aligning remote homologues has

(48)

been to use structural information whenever possible”. On the other side, the opposite is more critical about this evolution, as the statement of Armougom, Moretti, Keduas, and Notredame (2006) shows: ”The use of structural information, however, carries its own peril, and while the sequence analysis community tends to consider structure based alignments as unambiguous and unquestionable gold standards, a closer look reveals a much less clear cut situation.”.

2.5

Known Issues of MSA

There are issues, which remain unsolved. Because of the fact that computing a MSA is NP-complete, heuristics are used. However, especially if the protein sequences are very long, the possibility to align complete protein sequences is often not given. Be-sides, those heuristics can not guarantee an optimal solution, and furthermore, the alignments are hard to evaluate in matters of quality. A possible improvement can be derived with the structure-based approach, but in contrast to sequences, which are cheap and common, structures are rare and expensive (C. Notredame, 2007). An-other problem is that ”multidomain protein families evolve not only through mutation of individual amino acids, but also through operations such as domain duplications and domain recombinations” (Doolittle, 1995). To date, there are only sporadic ap-proaches for alignments of rearranged sequences, e.g. MDAT (Kemena, Bitard-Feildel, & Bornberg-Bauer, 2015) for multi-domain proteins with shuffled domains. All these problems with Multiple Sequence Alignment illustrate that MSA can still be consid-ered as one of the most challenging problems in bioinformatics, even 40 years after Sankoff’s first identification.

(49)

3

MSA Algorithms

This chapter provides a survey of the original MSA algorithms ClustalΩ, ClustalW, MUSCLE, and T-Coffee, which were intended to be implemented in the R package. Further algorithms are also mentioned, but the focus will be on the integrated ones.

Attention is drawn to the fact that the presented algorithms in this chapters are pure multiple sequence alignment algorithms. Pre- or post-processing algorithms are not represented in this context, as well as additional structural sequence aligners like FUGUE (Shi, Blundell, & Mizuguchi, 2001) or DALI (Holm & Sander, 1995) are not represented either.

(50)

3.1

ClustalW

The ClustalW algorithm was first published by Thompson et al. (1994) as an enhance-ment of ClustalV and developed since then. The current version 2.1 was released on 2010-11-17 and is one of the most accepted and used MSA algorithms (cf. Larkin et al., 2007). The programming language used is C++, because the command-line ver-sion of ClustalW was extended by a GUI-based derivative. The whole project is now called Clustal2 or ClustalW2. The name of the graphical version is ClustalX (Higgins, 2014). It contains just as much functionalities as ClustalW. Therefore all addressed features of ClustalW in this chapter also apply to ClustalX.

The most imported improvement of ClustalW (in comparison to ClustalV) was the introduction of individual weights. The ’W’ in ClustalW can be seen as increment of ’V’ as well as an abbreviation for ’weights’ (cf. Larkin et al., 2007). These ”weights are assigned to each sequence in a partial alignment in order to down-weight near-duplicate sequences and up-weight the most divergent ones” (Thompson et al., 1994). For a better understanding of this approach, a few examples are given of how this weighted scheme is realized. All examples are taken from Thompson et al., 1994:

• It has been observed, that gaps tend not to be closer than roughly eight residues on average. Knowing this fact, the gap opening penalty in ClustalW increase within eight residues of existing gaps.

• Loops or random coil regions are usually indicated by short stretches of hy-drophilic residues (e.g. 5 or more). That is why the gap opening penalties are locally reduced in these stretches.

• In addition, the gap penalties in existing gaps are lowered as well.

These example rules are not exhaustive, but they roughly illustrate, how weighting works. The complete summary of the hierarchical gap penalty modification rules are listed in Thompson et al. (1994).

Referenzen

ÄHNLICHE DOKUMENTE

– Semantische Elemente (header, section, summary, details, etc.). – Neue Formulare (number, range,

Beim Anklicken soll der Text der Eingabe ausgegeben werden. Falls leer, .Equals, soll eine Meldung erscheinen

■ Show(Text, Caption, MessageBoxButtons, MessageBoxIcon, MessageBoxDefaultButton, MessageBoxOptions,

Excel durch das Binary Interchange FileFormat (BIFF). • Das BIFF ist ein Format, welches Excel als

Aufrufprogramm erstellen: gemeinsamer Ordner Projektname: testdll.. DLL Projekt erstellen:

■ Xubuntu - mit dem schlanken Xfce als grafischer Oberfläche - besonders für ältere Rechner geeignet. ■ Edubuntu - eine speziell angepasste Version für Schulen, mehr im

ESC ddp Übernimmt eine Textzeile in den Standardpuffer und löscht diese und fügt diese nach der nächsten Zeile wieder ein. ESC p Wenn der Standardpuffer eine Zeile enthält, wird

• Die zweite Bedingung if (instance == null) würde damit zu true evaluieren, ohne dass bisher der Konstruktor (Zeile 6) aufgerufen worden wäre.. Das double-checked locking ist