Introduction

The turnoveR package has been developed to analyse isotopically labeled proteomics datasets with the goal of quantifying and visualizing protein turnover and protein degradation in an accessible, efficient and reproducible manner, and to easily contextualize the data with direct links to the uniprot and KEGG online databases. To install the package, see the Installation Instructions.

Experimental datasets intended for use with this package are generated from a series of samples taken from a steady-state bacterial culture. Samples are taken at different timepoints following an isotopic label “pulse”.

Functions in the turnoveR package accept mass spectrometry datasets that have been pre-processed to estimate areas of the “heavy” and “light” isotopic labeled fractions for each protein (by programs such as Massacre).

This vignette illustrates the use of the turnoveR pakage for the processing of an example dataset. The following flow chart illustrates the overall structure and work flow of the whole package. For more details and a higher resolution version, see the Package Structure Vignette.

Input Files

protein_sums.csv or psms.csv:

  • a file containing a sum of how many times each protein was identified in each sample of a dataset
  • Preferred file type: .csv
  • Important column names:

svm_results.csv:

  • typically a massacre output file curated to remove poorly fitted curves
  • Preferred file type: .csv
  • Important column names:

metadata.xlsx:

  • an excel sheet with sample timepoints, other sampling information
  • Preferred file type: .xlsx
  • Important column names:

Getting started

Load the turnoveR package to get access to all functions. The exported functions of the package all start with the prefix tor_ to make easy for auto-completion. To get a list of all available turnoveR functions, simply start typing tor_ in the RStudio console after the package is loaded. Additional packages that are helpful for data processing is the core set of packages loaded by the tidyverse (e.g. dplyr, tidyr, ggplot2) and the non-core readxl for reading Excel files. Lastly, the plotly library makes it easy to make plots interactive which can be very useful for data exploration.

library(turnoveR)
library(tidyverse)
library(readxl)
library(plotly)

# data base path
path <- file.path("vignettes", "vignette_data")

Reading in SVM files

One way to get started is to load in an SVM data files. This pre-processed proteomics data (supported vector machine - SVM - evaluated spectral output from Massacre) is read in through the function tor_read_svm_data_file. Note that some of the column names are standardized upon data loading for compatibility with downstream processing. Also note that most functions provide a quick summary message of what is happening. If descired, this can be turned off by specifying quiet = TRUE in each function call.

svm_data <- 
  # read svm data file
  tor_read_svm_data_file(
    filepath = file.path(path, "svm_pred_results_0.03gr.csv")) 
#> Info: successfully read 27220 records from SVM file 'svm_pred_results_0.03gr.csv'
# show first 100 records
svm_data %>% head(100)

Reading in Massacre output

Alternatively, massacre output files can be read and evaluated directly (see flow chart for details). Functionality and description coming soon…

Data Preparation

Spectral Quality filtering

Typically the first step is to remove data with poor quality spectral fits. In the case of SVM pre-processed data, the quality probability column is svm_pred and should be filtered with a sensible cut-off (here we use 75% from the SVM estimate of being a good fit).

data_filtered <- 
  svm_data %>% 
  # spectral quality filtering
  tor_filter_peptides_by_spectral_fit_quality(svm_pred > 0.75)
#> Info: kept 6961 of 27220 (25.6%) peptide measurements during spectral fit quality filtering (condition 'svm_pred > 0.75')

Recoding protein IDs

Sometimes not all protein IDs are up to date and need to be recoded to make sure they can matched to database records.

data_recoded <-
  data_filtered %>% 
  tor_recode_protein_ids(file.path(path, "rename_prot.xlsx"))
#> Info: renamed 5 protein entries for 1 different proteins.

Adding metadata

Often it is necessary to add additional metadata to the data set (e.g. the times of the individual samples for later time course fitting). If this is the case, the metadata should include information on sampling timepoints, and is accepted in an excel format.

# read metadata
metadata <- read_excel(file.path(path, "metadata_CMW.xlsx"))

# show metadata
metadata
# add to data
data_w_metadata <-
  data_recoded %>% 
  tor_add_metadata(metadata, join_by = "sample")
#> Info: adding metadata to mass spec data, joining by 'sample'...8 metadata entries successfully added to 6961 data recors, 0 could not be matched to metadata

Calculations

Labeling Rate, Degradation and Dissipation

The calculation of the labelling rate is in two steps to more easily export/examine the data in between. First is the calculation of the labeling fractions from the light an heavy signals using tor_calculate_labeled_fraction. This function returns the fraction of labeled and unlabeled peptides for each sample.

With the labeled and unlabeled fractions known for each peptide at each timepoint, it is possible to fit an exponential to either individual peptides or all peptides in a protein to calculate a labeling rate. The fuction tor_calculate_label_rate thus uses the calculated labeled fractions to fit a curve describing isoptope labeling over time. Peptides can be combined to return a labeling rate for the entire protein by choosing the option combine_peptides = TRUE in the function arguments.

The calculated labeling rates can then be used to calculate degradation rates and dissipation rates using the function tor_calculate_degradation_dissipation. Protein degradation is calculated as the labeling rate minus the growth rate (deg_rate = label_rate - growth_rate). Protein dissipation is calculated as the degradation rate divided by the labeling rate, and is displayed as a percent (dissipation = (deg_rate / label_rate)*100).

For this last step, it is critical to know the growth rate of the experiment. The growth rate can be calculated using the culture volume and media flow rate and turnoveR provides the tor_calculate_growth_params function to simplify this step.

# calculate growth parameters
growth_params <- 
  tor_calculate_growth_params(
    flow_rate = 0.46 * 60, # [g/hour]
    flow_rate_se = 0.02 * 60, # estimated error [g/hour]
    volume = 894.4, # [g]
    volume_se = 20 # estimated error [g]
  )
growth_params %>% rmarkdown::paged_table()
# data calculations
data_w_calcs <- 
  data_w_metadata %>% 
  # calculate labeled fraction
  tor_calculate_labeled_fraction() %>% 
  # calculate the label rate
  tor_calculate_label_rate(
    time_col = "hours", 
    min_num_timepoints = 3,
    combine_peptides = TRUE) %>% 
  # degradation rate and dissipation
  tor_calculate_degradation_dissipation(
    growth_rate = growth_params$growth_rate,
    growth_rate_se = growth_params$growth_rate_se) 
#> Info: calculated labeled/unlabeled fraction for 6961 peptides
#> Info: processing data for 421 proteins, this may take a few seconds... 336 of the proteins could be fit to a labeling curve, 85 did not have enough time points
#> Info: calculated degradation rate and dissipation for 336 records
# show results (remove complex data column for printout)
data_w_calcs 

Visualization

The data quality can be visualized using a few basic graphing functions. To see a histogram plotting the labeling rates of all the proteins in an experiment (or in multiple data sets combined), use tor_plot_label_rate_hist. Similar plots could be generated for deg_rate and dissipation.

data_w_calcs %>% tor_plot_label_rate_hist()

The function tor_plot_label_rate_error displays the residual standard error for each calculated label rate to easily evalute what quality cutoffs might be useful.

data_w_calcs %>% tor_plot_label_rate_error()

To visualize what the labeling curves and least squares fit actually look like, use tor_plot_labeling_curves to see the labeling curves for a randomly selected protein or proteins. The argument plot_number allows the user to select the desired number of curves to display.

data_w_calcs %>% tor_plot_labeling_curves(plot_number = 4, random_seed = 123)

Data Quality

The output from the calculation function should be filtered to remove curves with missing fits (not enough data) and poor curve fits. It is always helpful to also look at the data that is getting discarded to see if something is amiss. Any filtered dataset can easily be passed to any of the plotting functions (e.g. tor_plot_labeling_curves) for a closer look.

Not enough data

Let’s take a look at some of the proteins that did not have enough data for a fit (which we defined earlier to be at least 3 separate time points) .

# show all records that don't have enough data
data_w_calcs %>% 
  tor_filter_label_rate_fits(!enough_data) 
#> Info: fetching 85 of 421 entries based on filter condition '!enough_data'
# visualize some of these problematic records
data_w_calcs %>% 
  tor_filter_label_rate_fits(!enough_data) %>% 
  tor_plot_labeling_curves(random_seed = 123)
#> Info: fetching 85 of 421 entries based on filter condition '!enough_data'

Low quality fits

Let’s take a closer look at the label rate estimates of low quality fits and use plotly to make the plot interactive. Mouse over individual data points to see the protein ID.

data_w_calcs %>% 
  tor_filter_label_rate_fits(fit_rse > 0.05) %>% 
  tor_plot_label_rate_error() %>% 
  ggplotly()
#> Info: fetching 66 of 421 entries based on filter condition 'fit_rse > 0.05'

This could likewise be used with tor_plot_labeling_curves, which reveals a little more detail about wy the quality is bad (and potentially allows reconsidering the exclusion or closer examination of some peptides that behave clearly differently from the rest of the protein). Use the mouseover to see the peptide sequences for each data point.

data_w_calcs %>% 
  tor_filter_label_rate_fits(fit_rse > 0.05) %>% 
  tor_plot_labeling_curves(random_seed = 11) %>% 
  ggplotly()
#> Info: fetching 66 of 421 entries based on filter condition 'fit_rse > 0.05'

High quality fits

Lastly, it is useful to continue with just the high quality fits to analyze the most robust part of the data set. Here we focus on everything that had enough_data AND the residual standard error of the fit is smaller than 5% (here in two separate statements for clarity but could be combined into one). Also, focusing just on the key data columns going forward.

data_hq <- 
  data_w_calcs %>% 
  tor_filter_label_rate_fits(enough_data) %>% 
  tor_filter_label_rate_fits(
    fit_rse <= 0.05,
    select = c(matches("prot"), matches("rate"), matches("dissipation"))
  )
#> Info: fetching 336 of 421 entries based on filter condition 'enough_data'
#> Info: fetching 270 of 336 entries based on filter condition 'fit_rse <= 0.05', keeping 10 of 18 columns.

Adding Information

Most steps in the information section can be performed in arbitrary order, however, it is generally advisable to follow the flow chart to make sure the few information that does build on one another is available when each function is called (if anything is missing, turnoveR will complain and point to the missing information).

Adding uniprot information

In order to get the most up-to-date protein information, all proteins for the experimental organisms (here E. coli) is queried directly from the uniprot online database and matched to the mass spec data. This adds useful information including the recommended gene and protein names and the molecular weight of the protein. If the organism uniprot ID is not known yet, the tor_fetch_uniprot_species function can be helpful as shown here.

# look for taxon ID for K12 strain
uniprot_species <- tor_fetch_uniprot_species("strain K12")
#> Info: querying uniprot database for taxa with 'strain K12' in the name... retrieved 5 records
uniprot_species
# retrieve uniprot info for the K12 strain
uniprot_data <- tor_fetch_uniprot_proteins(taxon = 83333)
#> Info: reading uniprot proteins of taxon 83333 from cached file (use read_cache = FALSE to disable)... retrieved 4497 records
uniprot_data %>% head(20)
# add the uniprot information
data_w_uniprot <-
  data_hq %>% 
  tor_add_uniprot_info(uniprot_data)
#> Info: adding uniprot information to mass spec data, joining by 'uniprot_id'...266 records joined successfully, 4 could not be matched with uniprot records
# check on the ones with missing uniprot info
data_w_uniprot %>% filter(missing_uniprot)

Adding protein counts information

The protein_sums.csv or psms.csv file should contain the protein IDs and corresponding counts for all proteins identified in each sample and is read by tor_read_protein_counts_data in csv format and then added to the data set using tor_add_protein_counts_info, which also calculates the relative protein counts, relative protein mass (based on the molecular weight of each protein retrieved from uniprot in the previous step). The relative protein abundance information can be combined with the dissipation and degradation values to calculate a weighted degradation and dissipation rate for each protein. This weighted calculation uses the relative mass of each protein identified in the samples to better describe overall cellular investment in protein turnover for each protein.

# read protein count info
protein_count <- tor_read_protein_counts_data(file.path(path, "psms.csv"))
#> Info: reading protein counts data from 'psms.csv'... read 913 records
# add to dataset
data_w_counts <- data_w_uniprot %>% 
  tor_add_protein_counts_info(protein_count)
#> Info: protein counts added and weighted rates calculated for 270/270 datasets
# look at some of the data
data_w_counts %>% 
  select(gene, prot_name, prot_rel_mass, starts_with("deg"), starts_with("diss")) %>% 
  head(10) %>% 
  rmarkdown::paged_table()

Adding KEGG pathway info

Coming soon…

1-step processing

The entire set of operations listed section-by-section above can also be easily performed in a single step piping (%>%) from one operation to the next.

# data base path
path <- file.path("vignettes", "vignette_data")

# growth params
growth_params <- 
  tor_calculate_growth_params(
    flow_rate = 0.46 * 60, # [g/hour]
    flow_rate_se = 0.02 * 60, # estimated error [g/hour]
    volume = 894.4, # [g]
    volume_se = 20 # estimated error [g]
  )

# data processing in one pipeline
data_one_pipe <-
  # read SVM file
  tor_read_svm_data_file(
    filepath = file.path(path, "svm_pred_results_0.03gr.csv")
  ) %>% 
  # spectral quality filtering
  tor_filter_peptides_by_spectral_fit_quality(svm_pred > 0.75) %>% 
  # recode protein ids
  tor_recode_protein_ids(file.path(path, "rename_prot.xlsx")) %>% 
  # add metadata
  tor_add_metadata(
    read_excel(file.path(path, "metadata_CMW.xlsx")), 
    join_by = "sample"
  ) %>% 
  # calculate labeled fraction
  tor_calculate_labeled_fraction() %>% 
  # calculate the label rate
  tor_calculate_label_rate(
    time_col = "hours", 
    min_num_timepoints = 3,
    combine_peptides = TRUE
  ) %>% 
  # degradation rate and dissipation
  tor_calculate_degradation_dissipation(
    growth_rate = growth_params$growth_rate,
    growth_rate_se = growth_params$growth_rate_se
  ) %>% 
  # focus on high quality fits
  tor_filter_label_rate_fits(enough_data & fit_rse <= 0.05) %>% 
  # add uniprot data
  tor_add_uniprot_info(
    tor_fetch_uniprot_proteins(taxon = 83333)
  ) %>% 
  # add protein count info
  tor_add_protein_counts_info(
    tor_read_protein_counts_data(file.path(path, "psms.csv"))
  )
#> Info: successfully read 27220 records from SVM file 'svm_pred_results_0.03gr.csv'
#> Info: kept 6961 of 27220 (25.6%) peptide measurements during spectral fit quality filtering (condition 'svm_pred > 0.75')
#> Info: renamed 5 protein entries for 1 different proteins.
#> Info: adding metadata to mass spec data, joining by 'sample'...8 metadata entries successfully added to 6961 data recors, 0 could not be matched to metadata
#> Info: calculated labeled/unlabeled fraction for 6961 peptides
#> Info: processing data for 421 proteins, this may take a few seconds... 336 of the proteins could be fit to a labeling curve, 85 did not have enough time points
#> Info: calculated degradation rate and dissipation for 336 records
#> Info: fetching 270 of 421 entries based on filter condition 'enough_data & fit_rse <= 0.05'
#> Info: reading uniprot proteins of taxon 83333 from cached file (use read_cache = FALSE to disable)... retrieved 4497 records
#> Info: adding uniprot information to mass spec data, joining by 'uniprot_id'...266 records joined successfully, 4 could not be matched with uniprot records
#> Info: reading protein counts data from 'psms.csv'... read 913 records
#> Info: protein counts added and weighted rates calculated for 270/270 datasets

# then continue with plotting and analysis
data_one_pipe %>% tor_plot_label_rate_error() %>% ggplotly()

Analysis

Example: looking at the degradation data

To take a quick first look at the important proteins contributing to protein turnover, it is helpful to list the proteins with the highest weighted degradation rates / dissipation in the experiment. For a list of the most turned over proteins (regardless of pool size), one would look at the hightest dissipation instead.

# get the total degradation and dissipation per generation
data_w_counts %>% 
  summarize(
    total_deg_rate = sum(deg_rate_weighted, na.rm = TRUE),
    total_dissipation = sum(dissipation_weighted, na.rm = TRUE)
  ) %>% 
  rmarkdown::paged_table()
# list top proteins by dissipation
data_w_counts %>% 
  arrange(desc(dissipation)) %>% 
  select(prot_name, everything()) %>% 
  head(20)
# list top proteins by relative mass
data_w_counts %>% 
  arrange(desc(prot_rel_mass)) %>% 
  select(prot_name, everything()) %>% 
  head(20)
# list top proteins by weighted degradation
# list top proteins by dissipation
data_w_counts %>% 
  arrange(desc(dissipation_weighted)) %>% 
  select(prot_name, everything()) %>% 
  head(20)