Discussion - Methods for explaining biological systems and high-throughput data

5.5 Discussion 77

from the simulation and it could be possible to also model a time-dependent mechanism to explain the effects observed at 42^◦C.

5.5 Discussion 79

applied to trainings set test set

37^◦C 42^◦C 37^◦C 42^◦C

fit data t10 t30 all t10 t30 all t30 t30 t30 t30

equilibrium

normal 0.82 0.80 0.67 0.58 0.49 0.34 NA NA NA NA

stalled 0.83 0.83 0.69 0.60 0.53 0.36 NA NA NA NA

fract. soluble 0.78 0.85 0.77 0.75 0.76 0.69 NA NA NA NA fract. total 0.78 0.83 0.73 0.64 0.71 0.59 NA NA NA NA fract. total+stalled 0.79 0.84 0.75 0.66 0.65 0.55 NA NA NA NA

degr.groups

normal 0.88 0.86 0.76 0.69 0.69 0.54 0.79 0.82 0.73 0.63 stalled 0.88 0.89 0.79 0.68 0.70 0.53 0.80 0.82 0.71 0.64 fract. soluble 0.85 0.85 0.80 0.76 0.78 0.72 NA 0.83 NA 0.64 fract. total 0.83 0.87 0.79 0.67 0.73 0.61 NA 0.81 NA 0.63 fract. total+stalled 0.84 0.86 0.78 0.66 0.72 0.60 NA 0.81 NA 0.64

degrad.fit normal 0.83 0.99 0.81 0.74 0.98 0.67 0.77 0.82 0.72 0.65 stalled 0.82 0.99 0.80 0.73 0.98 0.67 0.77 0.81 0.73 0.66 fract. soluble 0.89 0.98 0.89 0.89 0.97 0.86 NA 0.73 NA 0.53 fract. total 0.88 0.99 0.87 0.82 0.97 0.78 NA 0.71 NA 0.50 fract. total+stalled 0.89 0.99 0.88 0.83 0.97 0.79 NA 0.71 NA 0.51

synthesisfit normal 0.85 1.00 0.84 0.72 0.91 0.70 0.79 0.81 0.72 0.62 stalled 0.84 1.00 0.84 0.70 0.91 0.68 0.78 0.81 0.71 0.62 fract. soluble 0.94 0.99 0.94 0.94 0.99 0.94 NA 0.62 NA 0.47 fract. total 0.91 1.00 0.91 0.86 1.00 0.86 NA 0.68 NA 0.45 fract. total+stalled 0.91 0.99 0.91 0.85 1.00 0.85 NA 0.67 NA 0.47 Table 5.1: Overview of simulation results for both 37^◦C and 42^◦C. Each cell gives the fraction of proteins whose predicted interval given noise overlaps with the observed interval of the replicates for the two time points and those proteins with overlapping intervals for both time points. The last four columns show the results when the parameters that were fitted for the corresponding dataset are applied to the two independent test sets. As the fractionated measurements are one of the two test sets these results are omitted. Note that different subsets of proteins are available for the different proteome datasets (normal/stalled and fractionated) which makes them incomparable and that the fractionated measurements are only available for 30 min and had to be infered for 10 min.

Chapter 6 YESdb: Interactive Integrated Analysis of Stress Datasets

Motivation

In the last chapter we described a very specific approach for the integrated analysis, that is used to model in detail a specific part of the central dogma of biology using a specific set of measurements. Here, a much more general approach is used to integrate multiple datasets that can be used to analyze various research questions using different kinds of measurements.

Here, we describe a Petri-net based workflow system that uses fundamental operations to define, combine and characterize sets of interesting genes from multiple datasets. This allows to tackle various research questions, such as the differences between different stimuli or which technical biases exist on different experimental platforms.

While several databases such as GEO, SRA or PRIDE exist that contain large col-lections of publicly available high-throughput datasets, the direct use of such integrative approaches in these large-scale databases is hindered by the need to find and preprocess the available datasets for the given research question. These huge repositories often provide both raw and processed data, but not the differential data that is most suitable for an integrative analysis.

YESdb is a database that contains preprocessed differential datasets of the yeast stress response. The datasets are annotated with the kind and strength of the applied stress, the strain and experimental technique that were used and the time at which the measurement was taken as well as the publication date. A web interface allows to quickly find relevant datasets that match a given combination of these annotations and analyze them using the workflow system.

The results of each step in such a workflow can be visualized in an interactive report that can also contain workflow independent visualization that e.g. characterize the selected datasets. This way, comprehensive reports can be created that can also be saved and shared.

Publication

The content of this chapter is submitted to Database [10]. Here the manuscript is refor-matted.

Author Contributions

Evi Berchtold designed and implemented the database, performed the analysis and wrote the paper. Gergely Csaba searched the meta-data of GEO and SRA to find the relevant datasets and preprocessed the RNAseq data. Ralf Zimmer supervised the project and edited the paper.

Availability

YESdb is available at

https://services.bio.ifi.lmu.de/YESdb

6.1 Introduction

More and more high-throughput data is made publicly available in databases like GEO [6], SRA [62] or PRIDE [98]. Published data can be used to complement newly measured data in various ways. Meta-analyses integrate diverse datasets from different studies, tissues or species to draw unbiased conclusions. While meta-analyses usually focus on data from the same or similar platforms, another way to benefit from published data is to integrate datasets from the same or a similar condition measured on different platforms (e.g.

RNAseq and microarray data). Systematic biases of one platform can thus be identified and corrected for. Similarly, datasets that measure different levels (e.g. expression and protein levels) of the same condition can be combined to obtain a more complete picture of the changes in the cell.

Even though the integration of multiple datasets can improve the analysis many stud-ies ignore published data that could be integrated in their analysis. The first hurdle for integrative analyses is of course to find data that fits, which often involves reading detailed experimental descriptions to uncover how similar the conditions are. Furthermore, inte-grative analyses are often hindered by the need to preprocess the raw data that is stored in the public databases. Especially when the published data is measured on a different platform, a different preprocessing workflow has to be used.

To facilitate the use of published data some databases offer analysis possibilities directly.

GEO introduced the GEO2R tool which allows to use GEO datasets directly in R analyses.

This is a very powerful tool but limited to users that are familiar with the R programming language. Other databases such as MEM [2] and SPELL [51] also allow the user to do some analyses directly on their website, but they focus mainly on co-expression studies.

6.2 Data 83

Im Dokument Methods for explaining biological systems and high-throughput data (Seite 95-101)