Post-processing of LC-MS2 data after database analysis

4. Methods

4.11 Post-processing of LC-MS2 data after database analysis

4.11.1 Optimization of LysN digestion with a synthetic peptide (section 5.2.1.2)

LC-MS2 data were analyzed by the Proteome Discoverer software (version 2.2.0.338) with the settings described in section 4.10.1. Oxidation at methionine was set as a dynamic modification.

The chromatographic peak areas of high confident peptide groups were used to calculate the relative cleavage efficiency. Prior to these calculations, the abundances of peptides varying only in their oxidation status were summed up. Afterwards, the relative cleavage efficiency was calculated as the ratio of the abundances of the cleaved and the non-cleaved precursor peptide multiplied by a factor of 1000.

4.11.2 LysN and LysArginase digestion of BSA and complex protein mixtures (section 5.2.1.3)

After processing the LC-MS2 data of the BSA samples digested with LysN or LysArginase, the MaxQuant “Evidence” files were filtered for entries with a BSA protein accession and a PEP smaller than 0.01. For each raw file, the filtered entries were copied into a new Excel sheet.

Redundant entries were removed based on the “Modified Sequence” column, thus generating the peptide table. Peptides with the same sequence only varying in the carbamylation and/or oxidation status were considered as one single entry. In contrast, peptides differing in their biotinylation- and acetylation-site (protein N-terminus) were kept separately. Subsequently, the modified peptide table was used for determining the number of biotinylated and non-biotinylated peptides as well as the proportion of missed cleavage sites (MCS) in each sample.

For evaluation of the biotinylation position, the information given in the “Modified Sequence”

and in the “Biotinylation [K] Probabilities” columns were used.

In case of the complex HEK protein sample, only evidences (retrieved from MaxQuant

“Evidence” file) with a quantification value in all three replicates/SILAC channels and a PEP smaller than 0.01 were considered for further data analysis. For each sample group, the filtered entries were copied to a new Excel sheet. Information about carbamylation and oxidation sites were removed in the sequence entries of the “Modified Sequence” column (e.g. KVTNLLM(ox)L was modified to KVTNLLML). In order to generate the peptide table, redundant entries were removed based on this column. Afterwards, this peptide table was used for data analysis similar to the procedure mentioned above. The boxplots depicting the median protein sequence coverage were generated in Excel by using the “Protein Groups” file. For each enzyme, only proteins identified with more than one unique peptide and with a quantification value in all three replicates of the NHS-biotin treated and non-biotinylated samples were used for data visualization.

4.11.3 Evaluation of different derivatization methods regarding their reactivity, reproducibility and selectivity (section 5.2.2.2)

Raw data were processes by using the Proteome Discoverer software (version 2.3.0.523) (see section 4.10.1). For each derivatization technique, the list of standard dynamic modifications was extended by single and double propionylation, acetylation or reductive alkylation (at lysine and peptide N-terminus). Peptides with a Percolator q-value smaller than 0.01 were used for further analysis. Post-processing was performed based on the PD “Peptide Isoforms” table by using a self-written R-script (R-Studio, R-packages: “dplyr” (version 0.8.3), “stringr” (version 1.4.0)) (Hadley et al. 2019; Hadley 2019). Peptides featuring an “sp” entry in the “Master Protein Accession” column (contaminants) were removed. Afterwards, information about the derivatization site were fused to the peptide sequence thus yielding the “Modified Sequence”

Column. For each replicate, redundant entries of the “Modified Sequence” column were removed. Subsequent to this, the non-redundant peptide lists were used in order to determine the reactivity, reproducibility and selectivity of the different derivatization methods.

4.11.4 Evaluation of the serial digestion and the N

^α

-selective derivatization workflow (section 5.2.1.4 and 5.2.2.3)

Data processing was performed with the Proteome Discoverer software (version 2.2.0.338) according to the procedure described in section 4.10.1. Raw files belonging to the same step within the workflow were processed together. In addition to the standard settings, methylation at lysine and arginine were set as dynamic modifications. After NHS-biotin treatment, the list of variable modifications searched during data processing was extended by biotinylation (at lysine and peptide N-terminus). For N^α-selective derivatized samples, single and double alkylation (at lysine and peptide N-terminus) were searched as dynamic modifications.

4.11.4.1 Investigation of the biotinylation efficiency

PSMs (retrieved from PD “PSM” file) featuring a PEP smaller than 0.01 were used for evaluation of biotinylation efficiency. The percentages of biotinylated PSMs and those with at least one unmodified primary amine was determined. In case the samples were derivatized by reductive alkylation prior to the incubation with NHS-biotin, the proportion of alkylated PSMs was also investigated.

4.11.4.2 Investigation of enhanced lysine and arginine PTM detection

Peptide information were retrieved from the PD “Peptide Group” table. Data were filtered for peptides featuring a PEP smaller than 0.01 and information about the methylation sites were fused to the amino acid sequence (modified sequence column) by using a self-written R-script.

Peptide species (entries of the modified sequence column) detected in the single samples/replicates were copied to separate Excel sheets. Non-assignable methylation sites were considered as ambiguous modifications. Based on the modified sequence column, redundant peptide entries were removed. Subsequent to this, the non-redundant entries of the modified sequence column were used to analyze the number of arginine and lysine methylation sites in each replicate and sample.

Evaluation of the number of methylated PSMs was performed by using the PD “PSM” table. For this purpose, the position of the methylation site was deleted in the “Modifications” column.

The table was filtered for PSMs featuring a PEP smaller than 0.01. Afterwards, the modified entries of the “Modifications” column were used for determining the number of arginine and lysine methylation sites in each raw file.

4.11.4.3 Investigation of the presence of C-terminal methylation sites after tryptic digestion (serial digestion workflow)

The presence of methylation sites at the C-terminus after tryptic digestion was investigated based on the non-redundant peptide list used for determining the number of arginine and lysine methylation sites in samples of the two workflows (see section 4.11.4.2).

4.11.4.4 Investigation of the number of lysine and arginine containing peptides in samples of the N^α-selective derivatization workflow

The number of arginine and lysine containing peptides was determined by using the non-redundant peptide lists generated for investigating the amount of methylation sites in the different samples/replicates of the two workflows (see section 4.11.4.2).

4.11.4.5 Generation of Venn diagrams for sample comparison

Proportional Venn diagrams were generated with R-studio by using the “eulerr” package (version 6.0.0) (Larsson 2019). The same software was used to create non-proportional Venn diagrams with the “VennDiagram” package (version 1.6.20) (Hanbo 2018).

4.11.5 Analysis of phospho-data sets (section 5.1.2)

4.11.5.1 Post-processing of MaxQuant phospho-data

Post-processing of the MaxQuant data was performed with a self-written R-script. Evidences featuring a PEP smaller than 0.01 were retrieved from the MaxQuant “Evidence“ file. For singly phosphorylated evidences, the phosphorylation site with the highest probability (rounded to the first decimal) was fused to the amino acid sequence. In case the evidence was identified to be doubly phosphorylated, the two highest phospho-site probabilities (rounded to the first decimal) were considered. If the same probability was assigned to two or more phospho-sites, the information about all positions were kept. For each replicate, the evidences were split to separate tables. Afterwards, redundant entries were merged based on the phospho-site and probability containing sequence column previously generated. During this procedure, intensity values of redundant entries were summed up. Subsequently, the intensities of each channel were normalized on the median. In the end, the data in the single replicate tables were merged to one file and analyzed by the “qbaR” R-package (see section 4.11.5.3). Highly regulated phosphopeptides were determined by filtering the normalized data for peptides found in at least

two out of three DMSO replicates without any additional identification in the stimulated samples and vice versa.

4.11.5.2 Post-processing of Proteome Discoverer phospho-data

Phospho-data processed with PD were analyzed based on the “Peptide Isoform” table (maximum q-value 1%). Peptides featuring an “sp” entry in the “Master Protein Accession”

column were removed. A self-written R-script was used in order to fuse the identified phosphorylation site to the amino acid sequence. Peptide entries without abundance value in all conditions/SILAC channels were removed. Afterwards, the data set was analyzed by the

“qbaR” R-package (see section 4.11.5.3). Highly regulated phosphopeptides were determined according to the procedure described in section 4.11.5.1 .

4.11.5.3 Statistical analysis of phospho-data

For phosphoproteome comparison, statistical analysis was performed after filtering the data for phosphopeptides found in at least two out of three replicates of the treated samples and the control. In addition to that, the complete data set without further filtering was analyzed (see Appendix section 7.1). Statistics were performed with an R-script provided by Dr. Farhad Shakeri from the Core Unit for Bioinformatics Data Analysis of the University Hospital of Bonn (R-packages: “qbaR” (version 1.2.3), “FactoMineR” (version 1.42), “ggplot2” (version 3.2.1),

“data.table” (version 1.12.6), “tidyverse” (version 1.2.1)) (Lê et al. 2008; Wickham 2016, 2017;

Dowle and Srinivasan 2019). In this workflow, the limma-package (version 3.40.6.) was used in order to correct p-values for multiple testing (Ritchie et al. 2015).

4.11.6 Statistical analysis of proteome comparison data (section 5.1.5)

Raw data from proteome comparison approaches were analyzed with PD (version 2.3.0.523) (see section 4.10.1). Grouped protein abundances were used to calculate protein ratios.

Statistical analysis was performed by ANOVA (individual proteins) implemented in the PD workflow. Proteins identified with at least two unique peptides (maximum q-value 1 %) in more than one replicate of the wt and SNAPIN KO samples were used for data visualization. The volcano plot was generated with the R-software (R-packages: ggplot2, data.table). The PCA plot (see Appendix section 7.4) was retrieved from PD.

4.11.7 Analysis of Co-IP data (section 5.1.10)

LC-MS2 data from the Co-IP experiment were processed with PD (version 2.3.0.523) according to the procedure mentioned in section 4.10.1. In addition to the standard databases, data were also searched against a protein list containing sequence information about the SNAPIN wt, S133E and S133A isoforms. In this additional database, the SNAPIN isoforms were listed without sequence information of the Myc-Tag and the linking region between the Tag and the protein.

Non-normalized abundances of proteins found with at least two peptides (max. q-value 0.01) were used for post-processing data analysis.

For each protein, the mean and standard deviation of the non-normalized abundances was calculated for the three replicates of the SNAPIN KO, SNAPIN S133E and SNAPIN S133A samples.

Subsequent to this, the mean values were used to calculate the S133E/KO and S133A/KO ratios.

In case the ratio could not be calculated due to missing values in the KO samples, a value of 200 was set. Each protein with a ratio higher than 10 and found in at least two S133A or S133E replicates was assumed to be a potential interaction partner of the particular isoform. Data visualization was performed with the R-software (R-package: ggplot2). Here, the coefficient of variation (CV) (calculated as the standard deviation abundance divided by mean abundance) of each protein identified in more than one S133E or S133A replicate was plotted against its log2

sample/KO ratio.

In order to compare the abundances of the potential interaction partners shared between both isoforms, the protein levels of each channel and sample group were normalized on the corresponding SNAPIN quantities. Afterwards, the mean protein abundances of the different replicates were calculated and used to determine the S133A/S133E ratio. Data visualization was performed with R-studio (R-package: ggplot2). The ratios of the mean values were depicted as a bar chart. Abundance ratios determined for single replicates were indicated as dots.

Im Dokument Proteomic identification of posttranslational modifications: cAMP-induced changes of phosphorylation and investigation of novel approaches detecting posttranslational modifications at lysine and arginine residues (Seite 77-83)