Computational Tools

To advance precision medicine by understanding aspects of the molecular complexity of cancer, the CPTAC program develops novel approaches to process large-scale proteogenomic data sets.

An important part of the CPTAC mission is to make data and tools available and accessible to the greater research community. Here, OCCPR has curated a collection of computational tools developed and/or utilized by CPTAC for processing and analysis of proteogenomic data. Although NCI does not endorse any specific tool, this list gives researchers a gateway to access bioinformatic tools that are useful for analyzing and/or visualizing large-scale proteomic and proteogenomic datasets generated through high-throughput screens and other approaches.

Data Access

CPTAC (Python package)
Accessing and interacting with CPTAC data in python.

GDC
Harmonizing DNA sequences from CPTAC whole genome sequencing, whole exomes sequencing, and RNAseq using GDC pipelines.

PDC
Making cancer-related proteomic datasets easily accessible to the public and facilitating multi-omic integration through interoperability with other resources.

TCIA
Hosting both the radiology and pathology imaging data generated by CPTAC samples.

Data Processing and QC

CDAP
The CPTAC Common Data Analysis Platform for LC-MS/MS data.

COSMO
Identifying and correcting sample mislabeling in multi-omics data.

DREAM AI
An ensemble based imputation algorithm for labelled proteomics data resulted from the NCI-CPTAC DREAM Proteogenomics Challenge (2016) and post Challenge community effort.

MassQC
MassQC is an online Quality Control Tool that serves to diagnose liquid chromatography-mass spectrometry instrument hardware to ensure the instrument is running in a reproducible manner. Using data from CPTAC inter-lab studies, the National Institute of Standards and Technology(link is external) developed a number of metrics to assess instrument performance and ProteomeSoftware subsequently built a graphical user interface to commercialize this tool.

MS-PyCloud
MS-PyCloud is a cloud computing-based pipeline for proteomic and glycoproteomic data analysis. The major components of this pipeline include data file integrity validation, MS/MS database search for spectral assignments to peptide sequences, false discovery rate estimation, protein inference, quantitation of global protein levels, and specific glycan-modified glycopeptides as well as other modification-specific peptides such as phosphorylation, acetylation, and ubiquitination.

MSInspector
MSInspector is a Python program for quality evaluation of the five assay characterization experiments outlined by CPTAC Assay Portal guidance document. MSInspector enables researchers to test their Skyline files for statistical calculation and data visualization through the built-in R scripts. The report file describes the details of any errors.

OmicsEV
Comparing and evaluating data matrices generated from the same omics dataset using different tools, algorithms, or parameter settings.

Panorama
Panorama is a web application for storing, sharing, analyzing, and reusing targeted assays created and refined with Skyline. Panorama allows laboratories to store and organize curated results contained in Skyline documents with fine-grained permissions, which facilitates secure sharing of published and unpublished data via a web-browser interface.

Philosopher
A complete toolkit for shotgun proteomics data analysis.

Skyline
Skyline is a freely available, open-source Windows client application for building Selected Reaction Monitoring (SRM) / Multiple Reaction Monitoring (MRM), Parallel Reaction Monitoring (PRM), DIA/SWATH and targeted DDA quantitative methods and analyzing the resulting mass spectrometer data. Its flexible configuration supports All Molecules.

Data Analysis

Appyters
Appyters turn Jupyter Notebooks into fully functional standalone web-based bioinformatics applications. Appyters present to users an entry form enabling them to upload their data and set various parameters for a multitude of data analysis workflows.

ARHT
ARHT R package perform the Adaptable Regularized Hotelling's T^2 test (ARHT) for pathway analysis. Both one-sample and two sample mean test are available with various probabilistic alternative prior models.

AutoRT
A deep learning-based peptide retention time prediction tool.

BayesDeBulk
BayesDeBulk is a new method for estimating the immune/stromal cell composition based on bulk proteomic and gene expression data. BayesDeBulk utilizes the information of known cell-type-specific markers without requiring their absolute abundance levels as prior knowledge.

Black Sheep
Outlier analysis of proteogenomic datasets to identify samples with aberrantly high or low levels of genes, proteins or PTM sites.

ChEA3
ChIP-X Enrichment Analysis 3 (ChEA3) is a transcription factor enrichment analysis tool that ranks TFs associated with user-submitted gene sets. The ChEA3 background database contains a collection of gene set libraries generated from multiple sources including TF-gene co-expression from RNA-seq studies, TF-target associations from ChIP-seq experiments, and TF-gene co-occurrence computed from crowd-submitted gene lists.

CoPheeMap and CoPheeKSA
CoPheeMap performs supervised machine learning on phosphoproteomics data to construct a co-regulated phosphosite network, while CoPheeKSA leverages CoPheeMap and other features to predict kinase-substrate associations.

CPTAC-BRCA2016
Interactive visualization of retrospective CPTAC-BRCA multi-omics data.

CPTAC-BRCA2020
Interactive visualization of prospective CPTAC-BRCA multi-omics data.

CPTAC-LUAD2020
Interactive visualization of CPTAC-LUAD multi-omics data.

CustomProDB
Generating customized databases from DNA and RNA sequencing data for proteomics search.

DagBagM
DAGBagM is a tool to learn directed acyclic graph (DAG) based on -omics data for detecting causal biomarkers (continuous variables) of clinical outcomes (binary variables). In this tool a score-based approach coupled with bootstrap aggregation is used to jointly model the continuous and binary nodes using appropriate distributions. DAGBagM is also flexible in taking into prior information of edge directions.

DeepRescore
An immunopeptidomics data analysis tool that leverages deep learning prediction to improve peptide identification.

DeepRescore2
A phosphoproteomics data analysis tool that leverages deep learning prediction to improve phosphopeptide identification and phosphosite localization.

diaTracer
diaTracer is a new DIA (data-independent acquisition) computational tool designed to process the three-dimensional diaPASEF spectra by directly detecting precursor and fragment ion features without dependency on spectral libraries. Seamlessly integrated into the widely used FragPipe computational platform, diaTracer offers a spectrum-centric solution that enhances the processing of diaPASEF data and facilitates non-specific and unrestricted post-translational modification (PTM) searches.

FragPipe
a suite of computational tools enabling comprehensive and flexible analysis of mass spectrometry-based proteomics data.

FragPipe-Analyst
FragPipe-Analyst is an easy-to-use, interactive web application developed to perform various computational analyses and to visualize quantitative mass spectrometry-based proteomic datasets processed using FragPipe computational pipeline. It is compatible with the data-dependent acquisition label-free quantification, TMT (tandem mass tag), and DIA (data-independent acquisition) quantification workflows in FragPipe.

FunMap
A machine learning tool that leverages multi-omics datasets, such as proteomics and RNASeq, to construct functional networks through XGBoost.

HLAProphet
In stark contrast to the critical roles of human leukocyte antigen (HLA) proteins in health and disease, the lack of techniques for HLA protein quantification represents a significant impediment to basic, translational, and clinical research. We present HLAProphet, an algorithm that provides personalized allele-level quantification of class I and class II proteins from standard mass spectrometry-based (MS) proteomics data. We show that HLAProphet triples the number of peptide identifications and produces highly quantitative measurements of gene-level and allele-level HLA protein abundances. HLAprophet demonstrates excellent concordance with RNA expression and enables detection of HLA loss of heterozygosity (LOH) at the protein level.

HotPho
Identify spatially interacting phosphorylation sites and mutations by co-clustering somatic mutations and phoshosites.

iJRF
iJRFNet R package includes different functions for the estimation of co-expression networks from global proteomic, RNAseq and post translational modification data. It allows the joint learning of multiple related networks while borrowing information from existing databases such as protein-protein interactions and knock-down experiments.

IonQuant
IonQuant is a label free quantification tool for shotgun proteomics. It supports timsTOF PASEF and non-timsTOF (e.g. Orbitrap) data, as well as matching-between-runs (MBR) and light/heavy chemical labeling.

iProFun
A statistical method to characterize functional consequences of DNA altertaions in tumors by jointly model >5 types of omics data.

iProFun Portal
iProFun has been performed on several CPTAC data sets, including clear cell renal cell carcinoma, lung adenocarcinoma, glioblastoma, and pediatric brain tumors. The CPTAC iProFun results database can be explored using the CPTAC iProFun portal, a web application that can be used with any modern Internet browser. A user specifies a cohort of interest and a list of genes, then they receive an interactive table for investigating the presence of CNA, methylation, and mutation perturbations among the input genes.

iProMix
This tool aims at analyzing cell-type-specific gene-gene associations using bulk proteogenomics profiles. It decomposes proteogenomics data for the estimation and inference of cell-specific gene-gene correlation through a mixture model at the gene and pathway levels. An application to the CPTAC LUAD proteomic data of adjacent normal lung samples identified interferon α/γ response pathways associated with ACE2 protein abundances in epithelial cells.

iRafNet
iRafNet R package implements a random-forest based algorithm which integrates prior information from existing databases such as protein-protein interactions databases and knock-out experiments when estimating large co-expression networks based on protein/gene expression profiles.

JRF
A Random Forest based approach for constructing multiple co-expression gene/protein networks based on high-dimensional proteogenomic data sets.

KEA3
Kinase Enrichment Analysis 3 (KEA3) is a webserver application that infers overrepresentation of upstream kinases whose putative substrates are in a user-inputted list of proteins. KEA3 can be applied to analyze data from phosphoproteomics and proteomics studies to predict the upstream kinases responsible for observed differential phosphorylations. The KEA3 background database contains measured and predicted kinase-substrate interactions (KSI), kinase-protein interactions (KPI), and interactions supported by co-expression and co-occurrence data.

L1000FWD
L1000 fireworks display (L1000FWD) is a web application that provides interactive visualization of over 16,000 drug and small-molecule induced gene expression signatures. L1000FWD enables coloring of signatures by different attributes such as cell type, time point, concentration, as well as drug attributes such as MOA and clinical phase. Signature similarity search is implemented to enable the search for mimicking or opposing signatures given as input of up and down gene sets.

MixEMM
mixEMM R package contains functions for estimating a mixed-effects model for clustered data (or batch-processed data) with cluster-level (or batch- level) missing values in the outcome, i.e., the outcomes of some clusters are either all observed or missing altogether. The model is developed for analyzing incomplete data from labeling-based quantitative proteomics experiments but is not limited to this type of data.

MSBooster
MSBooster queries deep-learning models to generate peptide property predictions (e.g. MS2 spectra, retention time, and ion mobility) for MSFragger’s peptide candidates and calculates similarity metrics based on agreement between predicted and observed values. These newly generated features are then used by PSM rescoring methods to better differentiate true and false positives, yielding increased peptide identifications.

MSFragger
MSFragger is an ultrafast database search tool for peptide identification in mass spectrometry-based proteomics across a wide range of datasets and applications. The speed of MSFragger makes it particularly suitable for the analysis of large datasets, for enzyme unconstrained searches, for ‘open’ database searches for identification of modified peptides, and for glycopeptide identification (N-linked and O-linked) with MSFragger Glyco mode.

Multiomics2Targets
Multiomics2Targets can be used to analyze transcriptomics, proteomics, and phosphoproteomics collected from cohorts of cancer patients. Applied to analyze the CPTAC3 dataset, Multiomics2Targets produces a report that prioritizes proteins, genes, and transcripts as potential targets. pan-cancer cohort, identifying potential targets for each CPTAC3 cancer subtype.

MultiOmicsViz
Plotting the effect of one omics data on other omics data along the chromosome.

mvMISE
mvMISE R package offers a general framework of multivariate mixed-effects models for the joint analysis of multiple correlated features with clustered data structures and potential missingness. mvMISE is motivated by the multivariate data analysis on data with multiplex structures from labelled proteomic experiments.

NeoFlow
A proteogenomics pipeline for neoantigen prioritization.

NetGestalt
Visualizing and analyzing multi-omics data based on biological networks.

NetSAM
Co-expression network construction and network module analysis

OmicsOne
OmicsOne is a web framework for quick phenotype association analysis of multi-omic data. With one click, it integrates quality control, statistics, and visualization. It uses six modules: phenotype profiling, data preprocessing, knowledge annotation, feature discovery, individual feature correlation, and enrichment analysis for associated features.

PANOPLY
A cloud-based platform for automated and reproducible proteogenomic data analysis

Panoptes
Analyzing pathological images using a multi-resolution convolutional neural network architecture.

PepQuery
A universal targeted peptide search engine for identifying or validating known and novel peptides of interest.

Pollock
We introduce Pollock, an algorithm for cell type identification, that provides a suite of pretrained models, prediction interpretability modules, and compatibility with popular single cell libraries. Pollock performs commensurately with currently existing classification methods, while easily deployable pretrained classification modules generalize well across tissue types on a variety of different data types. Additionally, it demonstrates utility in an immune pan-cancer analysis.

ProMAP
ProMap is an R package for modeling the global trans-regulatory network between DNA alterations and RNA/protein expressions. ProMAP uses a penalized multivariate linear mixed effects model to handle the batch effects and non-ignorable missingness in proteomics data directly. It also utilizes the MAP penalty to facilitate the detection of trans-hubs (i.e. DNA alterations influencing a large number of RNA/proteins).

ProMS
Protein marker selection using proteomics or multi-omics data.

ProNetView
Visualizing networks inferred based on proteogenomic data from CPTAC CCRCC project.

ProTrack-ccRCC
Interactive visualization of CPTAC CCRCC multi-omics data.

ProTrack-PBT
interactive visualization of CPTAC-CBTTC pediatric brain tumor multi-omics data.

PTMcosmos
A database that stores post-translational modifications (PTMs) and cancer mutations in humans.

PTM-SEA
A modified version of ssGSEA to perform site-specific signature analysis by scoring PTMsigDB's bi-directional signature-sets.

PTM-Shepherd
PTM-Shepherd automates characterization of PTM profiles detected in open (mass-tolerant) proteomics searches based on attributes such as amino acid localization, fragmentation spectra similarity, retention time shifts, and relative modification rates. PTM-Shepherd can also perform multi-experiment comparisons for studying changes in modification profiles, e.g. in data generated in different laboratories or under different conditions.

PTMsigDB
A collection of modification site-specific signatures of perturbations, kinase activities and signaling pathways curated from literature.

QUILTS
Creating sample specific protein sequence databases using genomic and transcriptomic data.

RHT
RHT R package offers functions to perform regularized Hotelling's T-square test for pathway or gene set analysis. The package is tailored for but not limited to proteomics data, in which sample sizes are often small, a large proportion of the data are missing and/or correlations may be present.

Rummagene
By crawling over 6 million PubMed Central articles, the Rummagene server provides access to hundreds of thousand human and mouse gene sets extracted from supporting materials of research publications.

RummaGEO
RummaGEO is a webserver application that enables gene expression signature search of a large collection of human and mouse RNA-seq studies deposited into GEO.

SC-ION
SC-ION infers regulatory networks from multi-omics data. It uses the expression of regulators (e.g. transcription factors) in one dataset (e.g. proteome/PTM) to predict the expression of targets (e.g. genes) in another dataset (e.g. transcriptome). SC-ION can be used with different types of large-scale omics data to infer multiple networks. It is available as an RShiny application.

SEPepQuant
Improving protein isoform characterization in shotgun proteomics through a graph theory-based approach.

SETPath
SETPath R package provides functions to test gene/protein expression data from a biological pathway for biologically meaningful differences in the eigenstructure between two classes.

Skyline
Skyline is an application for building Selected Reaction Monitoring (SRM) / Multiple Reaction Monitoring (MRM), Parallel Reaction Monitoring (PRM - Targeted MS/MS and DIA/SWATH) and targeted DDA with MS1 quantitative methods and analyzing the resulting mass spectrometer data.

spaceMap
The spaceMap R package constructs de novo networks from multiple data types in a high-dimensional context by applying a novel conditional graphical model. In addition to learning network structure, an accompanying network analysis toolkit is also provided. The toolkit has been developed with genomics applications in mind---but may be adapted for other applications ---and maps scientic domain knowledge onto networks.

TAAPrediction
Bayesian algorithm for inferring gene expression states in individual samples, and its application to identifying candidate tumor-associated antigens (TAAs).

TMT-Integrator
TMT-Integrator normalizes and combines channel abundances from multiple TMT or iTRAQ-labeled proteomics samples, generating quantification reports as specified by the user. TMT-Integrator currently provides four quantification options: gene, protein, peptide, and modified site levels.

TSNet
TSNet R package implements a new method which constructs tumor-cell specific gene/protein co-expression networks based on gene/protein expression profiles of tumor tissues. TSNet treats the observed expression profile as a mixture of expressions from different cell types and explicitly models tumor purity percentage in each tumor sample.

WebGestalt
A gene set analysis toolkit. The 2024 update introduces faster gene set analysis and new support for metabolomics and multi-omics.

WebGestaltR
An R package for gene set analysis.

Databases and Web Portals

FunMap Portal
A web portal for exploring functional networks generated by FunMap.

IDPpub
Illuminating the dark phosphoproteome through PubMed mining.

LinkedOmics
Analyzing multi-omics data from TCGA and CPTAC projects using association and pathway analysis, both within and across cancer types.

LinkedOmicsKB
LinkedOmicsKB makes consistently processed and systematically precomputed CPTAC pan-cancer proteogenomics data easily accessible to the public through a web portal. With approximately 40,000 gene-, protein-, mutation-, and phenotype-centric web pages, it enables anyone with internet access to conduct meaningful inquiries into CPTAC data, facilitating data-driven scientific discoveries.

LinkedOmics Targets
We integrate CPTAC pan-cancer dataset with other public datasets to provide insights into existing cancer drug targets and to systematically identify candidate new targets for drug repurposing or development. The analyses include overexpressed and hyperactivated protein dependencies, protein dependencies associated with the loss of tumor suppressor genes, and putative neoantigens and tumor-associated antigens.

Pathway Figure OCR
Pathway Figure OCR is dedicated to extracting pathway information from the published literature where ~1000 pathway figures are published each month.

ProKap
ProTrack Kinase Activity Portal (ProKAP) is a web-portal for querying, visualizing, and downloading the derived pan-cancer kinase activity database for the CPTAC PanCancer cohort. The Kinase Activity database, which was built using KEA3 (https://maayanlab.cloud/kea3/), characterizes putative differences in kinase state between tumor and normal tissues within and across 10 cancer types.

WikiPathways
WikiPathways is a database of molecular pathway diagrams contributed and refined by the research community. A dedicated portal was established for CPTAC pathways(link is external) including the hallmarks of cancer and phosphorylation state information. Browse, download, analyze or draw your own pathways of interest.

Search form

Data Access

Data Processing and QC

Data Analysis

Databases and Web Portals