Advances in high-throughput proteomic protocols-which analyze tens of thousands of protein fragments in a single experiment-generate an enormous amount of information about the proteins present in the cell at a given time. Yet the volume of data presents new challenges, and Nathan Edwards, Ph.D., assistant professor at Georgetown University, is working to meet this challenge, applying tools from mathematics, computer science, and statistics to massive proteomics datasets to extract as much biological insight as possible.
Mass spectrometry is used by protein scientists to identify the proteins present in biological samples. The so-called bottom-up protein identification workflows analyze protein fragments, called peptides, using tandem mass-spectrometry, generating hundreds of thousands of mass-spectra per study. The resulting peptide fragmentation spectra are analyzed using computer programs to identify the peptide sequences and ultimately the proteins present in the biological sample.
In particular, researchers are interested in proteins that are over or under expressed in disease samples, contain unexpected mutations, or undergo alternative splicing, in which a gene produces multiple protein variants (isoforms)-because these findings could lead to new disease-specific biomarkers and drug targets. "Protein isoforms are incredibly important because they represent a change in the cellular machinery. They can provide evidence for the action of a disease in general and cancer in particular," noted Edwards. Although evidence of these novel protein variants, in the form of peptide fragmentation spectra, is often captured by bottom-up protein identification workflows, software tools often miss these important peptides.
Edwards and his colleagues develop software to analyze mass spectrometry data that can "find" these critical peptides/proteins lost in the data. "I'm trying to push our informatics tools to extract more information from these datasets, so that you can ask, 'What version of each protein is there?'. Although the mass spectra contain evidence of protein isoforms, our tools are often not good enough to find it," said Edwards.

Losing Evidence of Peptides
According to Edwards, scientists lose evidence of novel peptides for a variety of reasons. Four major barriers to effective identification include:
Maximizing Peptide Identifications
Edwards is overcoming these barriers to protein identification through the development of three freely-available data resources and software tools: Peptide Sequence Database, PepArML Meta-Search Engine, and PeptideMapper Web-Service.
The Peptide Sequence Database, available for human, mouse, rat, and zebra-fish, provides an inclusive set of peptide sequences derived from protein, mRNA, and expressed sequence tag (EST) sequence resources. Edwards uses a compression strategy to eliminate peptide sequence duplication, reducing database size and search times by a factor of 40. Unlike conventional protein sequence databases that only include high-quality, well-understood protein sequences, the strength of Edwards' database is that it also includes peptides for which only a few ESTs indicate the peptide is real. ESTs, which represent portions of expressed genes, account for the majority of experimental evidence for alternative splicing in humans. To date, over 8 million ESTs have been generated from human samples and their sequences deposited in NCBI's Genbank repository. Now scientists have the means to discover novel peptides in their datasets.
The second software tool developed by Edwards is the Peptide Arbiter by Machine Learning (PepArML) Meta-Search Engine, pronounced "peppermill." PepArML allows researchers to carry out peptide identification searches with seven different search engines at the same time, and combines the results to boost peptide identification confidence. "This tool uses machine learning to figure out how to balance the contributions from each of the search engines. It is able to determine which search engine should be weighted most highly for a particular dataset," said Edwards. This approach of taking many different tools and combining the predictions is exactly the approach that won the Netflix prize," said Edwards.
PepArML is designed to be able to use computational resources from wherever researchers can get them. Users who want to search their spectra can use the shared resources provided by the Edwards lab cluster, but they can also contribute their own in-house resources or utilize Amazon cloud computing. "We're able to do much more thorough searches because we can recruit a much wider array of computational resources to the task of getting the searches done."
The real strength of PepArML comes from taking advantage of search engine agreement. When the scores from each search engine are poor, we don't expect then to agree, but if and when they do, it is more likely that the result is real and not random noise. "We get two to three times more peptide identifications than we would from any one search engine alone without giving up any specificity," said Edwards.
Brian M. Balgley, Ph.D.
Chief Scientific Officer
Bioproximity, LLC
The third tool, PeptideMapper, allows scientists to take any peptide sequence from human, mouse, and rat, and project it back onto its protein, transcript, and genomic evidence. "It's really important to be able to map these peptides back to their sequence evidence and a genome browser so that you can see the peptides and their sequence evidence in the context of gene models, single nucleotide polymorphisms (SNPs) and predicted full-length transcripts," said Edwards.
This is especially important for novel peptides, because good, significant peptide identifications to variant sequences are often false positives. "There are a lot of ways in which masses and search engines can collude to make you think that you're about to win a Nobel Prize when in fact the masses are adding up in just the right way, and it isn't a real peptide," noted Edwards. PeptideMapper is a quality assurance tool, helping scientists evaluate the strength of the sequence evidence for novel peptide sequences.
All three tools are becoming increasingly popular among members of the proteomics community. "New tools always take some time to get mindshare in the community," said Edwards. "That's something that's been changing steadily for the better over the last year, and I'm starting to hit a critical mass." The data-and the field-will be better for it.