• Keine Ergebnisse gefunden

Most proteomic experiments aim to find differences between two or more conditions. The goal is to identify proteins (or peptides in case of PTM studies) which show a significant up- or down-regulation within two or more classes, such as treated vs untreated or normal vs disease. These proteins can act as biomarkers for subsequent sample classification or targets for potential therapies. However, due to the diversity of experimental designs and analysis steps, most of the

tasks necessary cannot be performed in a fully automated fashion33,98 and describing all possibly methods and tools will go beyond the scope of the thesis. Instead, this section focuses on the most basic tasks needed to analyze and interpret the results of a proteomic experiment (Figure 1.16).

Figure 1.16 | Computation and statistics. After the identification and quantification of peptides and proteins, typical steps for data analysis include: peptide and protein significance analysis (left panel), class discovery and prediction (middle panel), and data integration and for instance pathways and signaling analysis (right panel). Especially the last step is affected by uncertainty in protein identities and the incomplete sampling of the proteome. Figure from33.

Protein intensities are subject to technical (e.g. different sample loading, MS performance) and biological (e.g. cell culture, amount of sample) variations. To account for that, different normalization and batch removal techniques are available, ranging from simple median centering or quantile and total sum normalizations to more sophisticated probabilistic and regression models33,204-207. While some methods are designed for intensities only, some others are also applicable to normalize ratios (e.g. median centering). A general assumption to enable normalization is that most of the peptides and proteins do not exhibit a significant change, thus on average no difference between two samples is expected. While this holds true for most experiments, the applicability and choice of normalization method is experiment and experiment design dependent.

After normalization, a typical next step is the identification of differentially regulated peptides and proteins. The ability to identify differential features is dependent on the reproducible identification and quantification of peptides and protein across samples. However, the lack of information does not imply the absence of the peptide or protein in a sample. Especially low abundant features often lead to poor fragmentation spectra due to ion counting statistics. This is further aggravated by the incomplete sampling of the proteome, resulting from the semi-stochastic nature of the DDA approach in combination with the complexity of the analyte (even in cases where purification steps were performed). While processing pipelines try to circumvent this as much as possible by matching unidentified features across samples and missing value imputation can reduce this issues even further, missing values may introduce additional biases as it is typically unclear which distribution of intensity values represents the features best. This is

because the feature is actually absent, or at least below the limit of detection, or the feature is present but not identified or matched. As for most post-processing steps, the applicability of missing value imputation strongly depends on the experimental design, the underlying hypothesis and acquisition method208,209.

Even without these issues, the identification of significantly changing peptides and proteins is a challenging task and relies on a carefully designed experiment. After log-transformation to ensure a Gaussian distribution of intensities and ratios, common statistical methods such as t-tests, ANOVAs and f-tests can be applied to assign p-values to proteins. Each statistical test is (largely) independent, thus multiple testing correction is essential to control the type I error210,211. After assigning q-values to peptides and proteins, differentially changing features are selected by filtering for both practical (fold change) and statistical (p- and q-values) significance (Figure 1.16 left panel).

After identifying proteins which show a significant regulation across samples, the discovery of classes using supervised or unsupervised methods such as clustering or machine learning refers to the process of identifying features which show a similar trend and reproducibly divide samples into classes (e.g. normal and disease; Figure 1.16 middle column)32,35,212. This process aims to identify peptides and proteins which can be used for subsequent class prediction213.

Furthermore, sets of proteins which show a similar quantitative profile across biological samples can also be used to generate new functional and biological insight. For example, proteins, which show a similar trend upon drug treatment, are likely part of a larger functional network and pathway. Molecular, functional and biological enrichment analysis of significantly differentially regulated proteins214-216 enables the determination of common features, which can be used to annotate groups of proteins. Briefly, the fraction of proteins assigned to a specific molecular, functional or biological process are compared to the expected fraction. If, e.g. kinases, show a significant enrichment within the differentially regulated proteins in comparison to the entire proteome, kinases might be a key factor for the response.

Integrative analysis with other ‘omics’-data217,218, such as transcriptomics, genomics and metabolomics, and correlation to previously conducted studies219 can provide additional insights into the underlying biology (Figure 1.16 right panel) and broaden our understanding of the molecular processes on a system-wide level. Even though most underlying mechanisms are still not fully understood, each omic technology in itself offers specific advantages and disadvantages, enabling scientist to retrieve confident information about e.g. mutation status, gene expression and activation status of proteins. However, these steps require thoroughly and extensively described resources and knowledge bases, which are in turn dependent on well-performed, freely available and detailed data.

4 Proteomic and annotation resources

Our ability to analyze and interpret the results of an MS-based proteomic experiment is heavily dependent on prior knowledge. During the analysis, protein sequence databases are used to transform raw MS data into peptide and protein result lists. Subsequently, known functions and annotations of proteins220,221, their interaction or contribution to complexes and metabolic or signaling pathways222-227 allow us to draw functional and biological conclusions.

The integration of different experiments enables the continuous increase in knowledge on how biological systems work and act upon different stimuli. Thus, there is great potential in storing and providing access to as many well annotated experiments as possible219. For this, both raw and result files, containing identification and quantification data, have to be archived and made available to also non expert researchers. This will also help other disciplines to build and validate their own findings and further advance their hypotheses. Furthermore, the integration of multiple types of data could lead to novel findings, which could not have been uncovered by a single lab or experiment alone.

It has become good practice to share experimental data to support novel findings38,219. However, due to the ever-increasing amount of data generated, organizing and storing raw data has become a challenge, especially in the field of proteomics. How to best store these was a long discussion in the scientific community228 and gave rise to many data repositories and databases38. While there are many challenges associated with the storage of data, especially in terms of annotating experimental factors and conditions used to study biological systems, even the integration of comparatively simple studies can broaden our knowledge. For example, the integration of multiple isolated studies measuring the expression of proteins (full proteomes) in model systems (e.g. cell lines) can help researchers to design better experiments by providing an expression map of proteins.

This section aims to introduce some of the aspects discussed here by describing state of the art resources available to analyze and interpret MS-based proteomics data. Last but not least, a brief overview over databases and repositories used in the field of proteomics to share both raw and result files is given.