• Keine Ergebnisse gefunden

NM_mature_precursor table, the mature sequences are connected to the table_miRNA_precursor. This NM_table accounts for the possibility that a mature microRNA can occur in more than one precursor, but also that one precursor has naturally more than one mature sequence. In a direct connection to the precursor table, I placed the table_species, holding the species information of the 3-letter code and complete name.

Outgoing from the table_miRNA, the table_miRNA_expression is attached. It contains the previously computed expression value and is directly connected to the table_condition. This enables the storage of various expression values of the same microRNA, but for different conditions. Similarly, the table_isomiR is connected to the table_miRNA and also has an additional table_isomiR_expression that hold the expression values of the microRNA isoforms, being also connected to the table_condition. This again enables the storage of various microRNA isoforms in different conditions. Also connected to the table_miRNA is the table_targetprediction. This table contains the information, derived from miranda’s target prediction run in the microPIECE pipeline. It is further connected to the table_clip, holding the information from the CLIP transfer. The table_targetprediction is also connected to the table_mRNA, having the mRNA IDs from NCBI, but more importantly the coding sequence start and stop positions, as well as the strand information. This would also enable a SQL query if the miRNA bound at the 3’ or 5’

UTR or to the coding sequence. Through the NM_mRNA_annotation table, the table_mRNA is connected to table_annotation and table_annotation_type.

Those tables can hold further information on the mRNA, like GO classes or other sources.

Figure 17 Database scheme The central table is the table_miRNA in dark green. It is connected, mainly with other tables, holding information about the miRNAs derived from the microPIECE pipeline in light green. Orange indicate tables that are enhancing the information, like species, further mRNA annotations or GO classes. Blue is for the mRNA information and yellow shows the information from the CLIP experiment. The dot-bordered tables (table_mRNA_expression, table_genome, table_chromosome) are included, but not officially considered in the microPIECE pipeline, yet.

The database is filled by a custom PERL script (see supplemental material chapter 13.7 for pseudo codes) that takes the data from the microPIECE pipeline and further sources. The script transforms this data into SQL statements and a final output for pushing it into the previously described database skeleton. It takes a config file as input, which includes the species name and its 3-letter code, as well as the genome name and download source. Further it has the microRNA arm types and conditions included. From this, the SQL statements for table_genome, table_species, table_miRNA_type and table_condition are created. With table_genome, putative versioning could be established, if an analysis is run again on another genome version or source. The table_chromosome stores information about the chromosome name and length and could be used for further studies on miRNA location statistics.

The script also parses the genome file to generate chromosome-name and chromosome-length tuples and derives the foreign key from the genome name of the table_genome. The next part, creates SQL statements for the microRNA tables and its connecting NM_mature_precursor table. First, the table_miRNA_precursor is filled and takes the species 3-letter code from table_species as foreign key. Then the table_miRNA

statements are created by filling the sequence and microRNA miRBase.org information into the SQL statements. The microRNA type IDs are derived from table_miRNA_type. For the NM_mature_precursor table, the IDs of table_miRNA and table_miRNA_precursor are pairwise assigned as foreign keys. The table_miRNA_position statements are created by parsing the position file and deriving the genome ID from table_genome as foreign key. The SQL statements in table_miRNA_expression are genereated by parsing the expression file and deriving the miRNA ID from table_miRNA as foreign key, as well as the condition ID from table_condition as foreign key. In the case of table_isomiR and table_isomiR_expression, the microRNA isoforms file is parsed and microRNA ID from table_miRNA is used as foreign key in the table_isomiR, whereas the condition ID from table_condition is used as foreign key in table_isomiR_expression.

The table_homologs is filled by the information, included in the homologs output file. It takes the miRNA ID from table_miRNA as foreign key. The table_mRNA data is taken from the .gff file, calculating the relative coding sequence start and stop positions from the parsed exon-CDS relationship. The target prediction file is parsed and for the table_targetprediction, the miRNA and mRNA IDs are taken from table_miRNA and table_mRNA as foreign keys. The table_clip foreign key of table_targetprediction is created during the statement creation dynamically. The comma-separated annotation file (Protein ID, annotation ID, annotation detail, annotation source, annotation source ID), previously prepared by the user with desired mRNA annotations, is parsed at the end and converted into SQL statements. The NM_mRNA_annotation table is filled with foreign keys from table_mRNA and table_annotation.

P 6










Outgoing from the previously described scripted workflow, the script collection was transferred into the actual microPIECE pipeline with software-testing, example datasets and dockerized environment, representing a state-of-the art software publishing. An overview is described in my following manuscript, published in the Journal of Open Source Software (JOSS). This chapter is then followed by a detailed description of the individual steps of the current pipeline version. The pseudocode scripts are available in the supplemental material 13.8.

Daniel Amsel, André Billion, Andreas Vilcinskas and Frank Förster.

“microPIECE – microRNA pipeline enhanced by CLIP experiments.”

JOSS - Journal of Open Source Software, 3(24), 616.


For this publication I created the basic scripts as scripted workflow and participated in re-writing the code into the actual pipeline. I furthermore created testcases for program modules, wrote code documentations, GitHub repository information. I also wrote the manuscript and draw the figure.

The pipeline is available via GitHub: