Protein phosphorylation dysregulation has been recognized as a key feature of several diseases, especially cancer. In recent years, phosphoproteomic research has revealed novel, effective biomarkers and drug targets for disease prognosis and treatment. Tandem mass spectrometry (MS/MS)-based phosphoproteomics provides a high-throughput method to study protein phosphorylation in complex biological samples. However, translating phosphoproteomic data into relevant biological and clinical insights relies on effective data analysis. There are a variety of existing computational tools which enable researchers to identify peptide-spectrum matches and score phosphorylation sites but applying multiple computational pipelines to the same dataset often produces inconsistent results. It is necessary then, to compare the performance of different computational pipelines in an effort to evaluate them objectively in terms of the number and quality of identifications they can provide. In a recent CPTAC study, published in Molecular & Cellular Proteomics, researchers introduced three deep-learning-derived (a powerful subset of machine learning) features as potential metrics for benchmarking computational pipelines for phosphopeptide identification.
While efforts have been made to compare performance of computational pipelines using synthetic phosphopeptide datasets, evaluations involving real-world patient datasets have been largely limited. In order to assess the performance of computational pipelines, CPTAC investigators first needed to determine which features of the pipelines to benchmark. Deep learning represents a promising method to learn patterns from large datasets curated from the massive amounts of MS/MS proteomic data available. The CPTAC team investigated three deep-learning-derived features as potential evaluation metrics: phosphosite probability, Delta RT, and spectral similarity. The first feature is predicted phosphosite probability, which can be computed by MusiteDeep with high accuracy. The second feature is Delta RT, which is defined as the absolute difference between observed RTs and is predicted by AutoRT. The third feature is spectral similarity, which is defined as the Pearson’s correlation coefficient (PCC) between spectra observed and predicted by pDeep2. The team used a synthetic peptide dataset to customize AutoRT and pDeep2 models and evaluate performance of Delta RT and spectral similarity in discriminating correct and incorrect peptide spectrum matches (PSMs)--phosphosite probability predicted by MusiteDeep is independent of experimental conditions and could be used directly without customization. In essence, the team used deep-learning tools to analyze synthetic peptide data which had been modified so as to include some “negative” results. DeltaRT (the output of the experiment-specific AutoRT model) and PCC/spectral similarity (the output of the experiment-specific pDEEP2 model) provided excellent discrimination between positives and multiple different types of the negatives.
All three evaluation metrics were then employed to benchmark four computational pipelines of interest on a tandem mass tag (TMT) dataset from the CPTAC human Uterine Corpus Endometrial Carcinomas (UCEC) study. The CPTAC UCEC phosphoproteomic data was searched against the RefSeq human protein database using the four computational pipelines (MS-GF+/Ascore, CDAP, MaxQuant, and FragPipe). To assess the quality of the pipelines, the experiment-specific AutoRT model, the experiment-specific pDeep2 model, and MusiteDeep were applied to PSMs uniquely identified by each pipeline. Delta RTs, spectral similarities, and predicted phosphosite probabilities for the groups of PSMs were compared against each other. The known variability in pipeline performance was confirmed by their results: among localized phosphopeptides and PSMs, only 22.3% and 11.4%, respectively, were commonly reported by all four pipelines. FragPipe identified the most localized phosphopeptides and PSMs. CDAP identified the most phosphopeptides but after extremely conservative localization probability filtering, it reported the fewest localized phosphopeptides. The proportion of unlocalized phosphopeptides from CDAP was 56%, which was much higher than the proportions from the other three pipelines, ranging from 22% to 31%. These results encourage broader adoption of the relatively new FragPipe in future phosphoproteomic studies, but also highlight some positive features of the pipelines. For example, MaxQuant identified fewer PSMs than FragPipe, but it outperformed MS-GF+/Ascore and CDAP and features a user-friendly interface.
Perhaps most importantly, this study demonstrated the utility of Delta RT and spectral similarity as effective metrics for systematic benchmarking of computational pipelines for phosphoproteomic data analysis. The team specifically notes that the primary goal of the study was not to identify the best pipeline or the best parameter setting, but to demonstrate a benchmarking method that could enable researchers to perform similar comparisons in their own studies. When asked his thoughts on the significance of this research, study leader Dr. Bing Zhang said, “The benchmark metrics demonstrated in this study will enable users to select computational pipelines and parameters for the analysis of phosphoproteomics data and will offer guidance for developers to improve computational methods.” In addition to serving as benchmarking metrics, the team expects that these deep-learning-derived features may also be used directly to improve phosphopeptide identification and site localization algorithms in the future.