(3.1) In SILVA, BLAST (31) is used to compare each sequence to a database of
3.1.3 Design and Implementation
Tools and Pipeline
The SILVA sources are divided into three libraries: the database abstraction library, the IO / tool library, and the aligner library. The database abstraction library provides an in-memory representation of the used data and an interface class that defines an infrastructure to persistently store the data on disk and to load the data from disk. The IO / tool library implements the interface and uses the MySQL relational database management system to store the data. It also implements the importers, the ARB exporter, the se-quence check, and the chimera check modules. The aligner library provides the implementation of the aligner. It overlaps with the two other libraries in certain parts because it is designed to also work independently of SILVA.
Database / Data model
The database as well as its in-memory representation is closely modeled based on the EMBL file format
5. The central class and table is the
SequenceEntry. It holds most meta data about an entry that is foundin the header section of the EMBL file format: its primary INSDC accession number, a list of secondary accession, the sequence version as specified in the entry, the dates the entry was submitted, imported into EMBL, and when it was last modified. Additional, selected feature qualifiers from the source feature of the feature table section of an EMBL entry are also imported, as well as meta data provided by third parties. See Table B.2 in Appendix B for a complete list of data imported into SILVA databases. Publications which are also part of the header are represented by their own table and class (Publication). RRNA sequences described in the
feature tablesection of the EMBL format are stored in the
Regiontable / class. A region may belong to more than one multiple sequence alignment
5http://www.ebi.ac.uk/embl/Documentation/FT\_definitions/feature\_table.html
Figure 3.3: The design of the SILVA database. The table SequenceEntry (yellow) is the central data object which connects the taxonomic information (blue), sequence information (green), and meta data (purple) stored in the SILVA databases. The meta data tables are dynamically created when the associated information is imported and may not exist in all databases. The information contained in these tables is also added to the content of the associated fields fields in the SequenceEntry table.
Therefore, these tables are only used to document the changes made to entires in the SequenceEntry table. Tables depicted in gray are organi-sational tables. Their names and the names of the meta data tables are in lower case letters, to further indicate their ‘temporary’ nature.
defined in table / class
MSA. The alignment of the same sequence may differbetween different MSAs, therefore, the table / class
AlignedRegionwas introduced to hold the aligned sequence, information about the alignment reported by the aligner, and a link to the MSA to which a region belongs. External references found in the header, publications, and regions are stored in the
Referencetable / class.
The
one-to-manyrelation between Region and MSA was chosen to be able to easily compare the alignments created by multiple aligner runs with different parameter sets. A second reason is to be able to store the alignments curated by different experts. It was initially planed to store the SEED, used to align new sequences, in the SILVA databases and to provide an interface to extend and to enhance the alignment of the SEED. The one-to-many relation was changed into a
one-to-onerelation because this interface has never been realised and the idea to store the SEED in the database has been dropped. In the current SILVA pipeline the different MSAs are used differentiate between the possible states of a region in the database.
When a sequence entry is first imported all its regions are assigned to the MSA
imported. The quality check module then assigns the regions to different MSAsbased on their sequence quality,
ambiguous, bad length, homopolymeror
vector.If a region is eligible for alignment then it is assigned to the MSA
unalignedor
to the MSA
unaligned rnammerif the region was predicted by RNAmmer (69).
The aligner will assign a region to the MSA
auto-alignedif the region could be aligned. Otherwise, it is assigned to
auto-aligned-rejected. If a sequence could bealigned but the number of aligned bases is below the chosen threshold it is still assigned to the MSA auto-aligned. Those sequences are excluded by the exporter when the data is exported into the different formats. Further MSAs used to mark
‘unwanted’ sequences are:
blacklist, ignore,and
overlaps. Regions are assigned tothe MSA blacklist based on a list of primary accession numbers manually curated by Dr. Wolfgang Ludwig and Prof. Dr. Frank Oliver Gl¨ ockner. It also contains accession numbers provided by EMBL. Sequences predicted by RNAmmer that overlap with sequences contained in EMBL are assigned to the MSA overlaps because and are, therefore, ignored.
Taxonomies associated to each entry are stored in table
taxonomy. Amap-ping between the taxonomic paths stored in table taxonomy and entries stored in table SequenceEntry are provided in table
taxmap. The concept behind thesetables is an adapted version of the
path enumeration modeldescribed by Celko in (72). Each entry in tables taxonomy and taxmap also hold, additionally to the taxonomic information, the name of the taxonomy. Therefore, multiple tax-onomies can be stored in the same table. Currently, each entry in the table SequenceEntry is associated to the taxonomies of EMBL, Greengenes, and RDP.
The design of the database is depicted in Figure 3.3.
Website
The web site is implemented using the programming languages HTML, JavaScript, and PHP. It uses the typo3
6content management system. A content management system allows content providers to easily modify web pages without the need to know details about web programming. For programmers, that work on the server side of a web site, it offers a framework for web site development (typo script). As such, the taxonomy browser, the search page, the cart, the list, and parts of the download page are implemented using this framework.
The web site uses a denormalised version of the SILVA database and merges information from the meta data tables with fields in table SequenceEntry that are not used for querying. It only contains sequences that were automatically aligned and suffice the quality standards for the Parc databases. It is identical to the Parc databases. Therefore, the discrimination between regions and aligned regions is not necessary and the two tables are merged. The table MSA is additionally no longer need and has been dropped. All Meta data tables are currently purged from the database. These modifications to the database design were made to improve the performance for a read only query pattern.
Programmin Languages & Build Dependencies
The SILVA build system is based on the GNU Autotools collection: Autoconf
7, Automake
8,and libtool
9. Hence, it follows the classical
./conf igure&&
make&&
make installapproach that numerous open source UNIX projects use. The tool binaries are
imple-6http://www.typo3.org
7http://www.gnu.org/software/autoconf/
8http://www.gnu.org/software/automake/
9http://www.gnu.org/software/libtool/