The search for cancer-specific protein biomarkers and drug targets can be a daunting task. Like other high-throughput technologies, mass spectrometry generates enormous datasets that require a great deal of mathematics, statistics, and computing power to be able to extract critical information. Alexey I. Nesvizhskii, Ph.D., an assistant professor of pathology at the University of Michigan Health System, is at the forefront of these efforts, working to close the critical gap between the development of high-throughput proteomics methods and the ability to manage the resulting data and convert it into new biological knowledge.
Mass spectrometry is the technology used by scientists to identify protein biomarkers in clinical specimens. The proteins are broken down into smaller pieces, called peptides, and these peptides are run through the instrument, which computes the mass-to-charge (m/z) ratio for each peptide, visualized as a mass spectrum. The resulting mass spectra are then compared against sequence databases to find the best matching peptides, allowing scientists to infer which proteins are present in the specimen based on the peptide identification.
"In a typical experiment, there are often hundreds of thousands-if not millions-of mass spectra collected, but often only 20% to 50% of those collected spectra can be identified. In most cases, I would say it's closer to 20%," said Nesvizhskii.
Poor quality spectra, or those with a high signal-to-noise ratio, are one of the biggest reasons that the majority of spectra cannot be interpreted. More importantly, there are also many high quality spectra that cannot be identified, meaning they do not have a confident match to a peptide in a sequence database. According to Nesvizhskii, this discrepancy is very important because these spectra may represent the very cancer-specific novel proteins that scientists are looking for, including those generated by alternative forms of a protein (known as alternative isoforms) or unanticipated posttranslational modifications (chemically modified proteins).
Novel protein isoforms are not annotated in existing protein sequence databases precisely because they are novel. "There are certainly a lot of alternative splicing events taking place [in cancer samples], so there is a lot of interest in correcting the existing genome annotations and identifying novel peptides using mass spectrometry data," said Nesvizhskii.
Proteins with unanticipated posttranslational modifications cannot be identified using existing spectral library search tools because these chemical modifications (e.g., phosphorylation) shift the mass of the modified peptide, producing a different mass spectrum. If mass shifts are not accounted for in the computational analysis, then the resulting high quality spectra cannot be identified.
An Iterative Approach
Nesvizhskii and colleagues have developed an iterative computational approach that more aggressively interrogates mass spectrometry data, designed to help identify the unassigned high quality spectra that could represent our next cancer biomarker or drug target. Using this approach, the unassigned spectra are subjected to multiple stages of database searching with different search parameters, spectral library searching, blind searching for posttranslationally-modified peptides, and genomic database searching for novel isoforms. Several studies have been conducted to demonstrate that such a strategy can significantly increase the number of spectra that are assigned a peptide and that biologically interesting new insights can be gained from existing datasets [PMID: 16352522; 20455209].
"This work also has implications for targeted, or quantitative, proteomics," said Nesvizhskii. Targeted proteomics is the method used to target and measure the absolute quantity of specific proteins/peptides in complex samples, and is frequently used to verify cancer biomarkers. A protein is considered a biomarker only if its concentration reflects the presence of disease or disease severity (e.g., prostate-specific antigen). For this reason, biomarker verification is a critical decision point (go/no go) in the protein biomarker pipeline.
Multiple reaction monitoring is an analytical technique used for targeted proteomics. Typically, researchers select unmodified peptides to monitor during these experiments in order to gauge the quantity of proteins present in the sample. However, if the protein has a lot of posttranslational modifications or artificial modifications introduced through sample handling-and these modifications are not accounted for in the computational analysis-then protein abundance will appear lower than it really is in the sample. "So if we only base our quantification on unmodified peptides and we do not understand what other kinds of [modified] peptides we can expect in a sample, then we are reducing the accuracy of our protein level analysis from the start," noted Nesvizhskii.
"I should say that even though we are able to identify a good number of those previously unassigned high quality spectra, we still cannot identify all spectra because there is a lot in the data that we don't understand," said Nesvizhskii. His team continues to modify the rigor and accuracy of their approach through analysis of very large raw datasets that have been made available to the public (e.g., Tranche).
Statistical Analysis of Interactomes (SAINT)
In addition to finding cancer-related proteins 'hidden' in large datasets, Nesvizhskii is also using a computational approach to reconstruct protein-protein interaction networks. Cancer proteins rarely work in isolation-they typically have partners in crime through direct interactions with other key proteins in a network. It is important to understand how these protein complexes and interaction networks differ in cancer cells. Critical protein-protein interactions could serve as the next drug target for the development of anticancer agents.
Using this approach, the protein of interest, called the bait, is affinity purified from the sample. The affinity purification process preserves protein-protein interactions so the bait's interaction partners, called the prey, are also purified. The purified sample-the bait and prey-are analyzed using mass spectrometry, the proteins in the complex are identified, and then the protein interaction networks are reconstructed.
"We developed a computational tool called SAINT, which stands for Statistical Analysis of Interactomes, that can analyze this type of affinity purification mass spectrometry data and assign a confidence score to each bait/prey pair," said Nesvizhskii. SAINT is able to identify which protein interactions are false positive due to non-specific binding.
SAINT was first applied to the systematic identification and reconstruction of the interaction network involving protein kinases and phosphatases in budding yeast [PMID: 20489023]. The network reconstructed provided a wealth of information for subsequent biological analysis. "The application to the yeast dataset was published in Science, but SAINT is organism-independent. We currently have a lot of ongoing collaborations involving human cancer datasets. This analysis of protein-protein interaction networks using mass spectrometry-based proteomics is a very powerful and successful application, and a very large effort in our lab is to develop tools for processing this type of data."
A number of open-source mass spectrometry software tools developed by the Nesvizhskii lab, including SAINT, can be found at: http://www.nesvilab.org/software.html.