To advance precision medicine by understanding aspects of the molecular complexity of cancer, the CPTAC program develops novel approaches to process large-scale proteogenomic data sets.
An important part of the CPTAC mission is to make data and tools available and accessible to the greater research community. Here, OCCPR has curated a collection of computational tools developed and/or utilized by CPTAC for processing and analysis of proteogenomic data. Although NCI does not endorse any specific tool, this list gives researchers a gateway to access bioinformatic tools that are useful for analyzing and/or visualizing large-scale proteomic and proteogenomic datasets generated through high-throughput screens and other approaches.
CPTAC (Python package)
Accessing and interacting with CPTAC data in python.
Harmonizing DNA sequences from CPTAC whole genome sequencing, whole exomes sequencing, and RNAseq using GDC pipelines.
Making cancer-related proteomic datasets easily accessible to the public and facilitating multi-omic integration through interoperability with other resources.
Hosting both the radiology and pathology imaging data generated by CPTAC samples.
Data Processing and QC
The CPTAC Common Data Analysis Platform for LC-MS/MS data.
Identifying and correcting sample mislabeling in multi-omics data.
An ensemble based imputation algorithm for labelled proteomics data resulted from the NCI-CPTAC DREAM Proteogenomics Challenge (2016) and post Challenge community effort.
MassQC is an online Quality Control Tool that serves to diagnose liquid chromatography-mass spectrometry instrument hardware to ensure the instrument is running in a reproducible manner. Using data from CPTAC inter-lab studies, the National Institute of Standards and Technology(link is external) developed a number of metrics to assess instrument performance and ProteomeSoftware subsequently built a graphical user interface to commercialize this tool.
MSInspector is a Python program for quality evaluation of the five assay characterization experiments outlined by CPTAC Assay Portal guidance document. MSInspector enables researchers to test their Skyline files for statistical calculation and data visualization through the built-in R scripts. The report file describes the details of any errors.
Comparing and evaluating data matrices generated from the same omics dataset using different tools, algorithms, or parameter settings.
Panorama is a web application for storing, sharing, analyzing, and reusing targeted assays created and refined with Skyline. Panorama allows laboratories to store and organize curated results contained in Skyline documents with fine-grained permissions, which facilitates secure sharing of published and unpublished data via a web-browser interface.
A complete toolkit for shotgun proteomics data analysis.
Skyline is a freely available, open-source Windows client application for building Selected Reaction Monitoring (SRM) / Multiple Reaction Monitoring (MRM), Parallel Reaction Monitoring (PRM), DIA/SWATH and targeted DDA quantitative methods and analyzing the resulting mass spectrometer data. Its flexible configuration supports All Molecules.
Appyters turn Jupyter Notebooks into fully functional standalone web-based bioinformatics applications. Appyters present to users an entry form enabling them to upload their data and set various parameters for a multitude of data analysis workflows.
ARHT R package perform the Adaptable Regularized Hotelling's T^2 test (ARHT) for pathway analysis. Both one-sample and two sample mean test are available with various probabilistic alternative prior models.
A deep learning-based peptide retention time prediction tool.
Outlier analysis of proteogenomic datasets to identify samples with aberrantly high or low levels of genes, proteins or PTM sites.
ChIP-X Enrichment Analysis 3 (ChEA3) is a transcription factor enrichment analysis tool that ranks TFs associated with user-submitted gene sets. The ChEA3 background database contains a collection of gene set libraries generated from multiple sources including TF-gene co-expression from RNA-seq studies, TF-target associations from ChIP-seq experiments, and TF-gene co-occurrence computed from crowd-submitted gene lists.
Interactive visualization of retrospective CPTAC-BRCA multi-omics data.
Interactive visualization of prospective CPTAC-BRCA multi-omics data.
Interactive visualization of CPTAC-LUAD multi-omics data.
Generating customized databases from DNA and RNA sequencing data for proteomics search.
DagBagM R package contains functions for learning directed acycic graphs for mixture of continuous and binary variables. It utilizes an efficient implementation of the hill climbing algorithm as well as structural hamming distance to aggregate DAGs.
An immunopeptidomics data analysis tool that leverages deep learning prediction to improve peptide identification.
A phosphoproteomics data analysis tool that leverages deep learning prediction to improve phosphopeptide identification and phosphosite localization.
a suite of computational tools enabling comprehensive and flexible analysis of mass spectrometry-based proteomics data.
Identify spatially interacting phosphorylation sites and mutations by co-clustering somatic mutations and phoshosites.
iJRFNet R package includes different functions for the estimation of co-expression networks from global proteomic, RNAseq and post translational modification data. It allows the joint learning of multiple related networks while borrowing information from existing databases such as protein-protein interactions and knock-down experiments.
IonQuant is a label free quantification tool for shotgun proteomics. It supports timsTOF PASEF and non-timsTOF (e.g. Orbitrap) data, as well as matching-between-runs (MBR) and light/heavy chemical labeling.
A statistical method to characterize functional consequences of DNA altertaions in tumors by jointly model >5 types of omics data.
iProFun has been performed on several CPTAC data sets, including clear cell renal cell carcinoma, lung adenocarcinoma, glioblastoma, and pediatric brain tumors. The CPTAC iProFun results database can be explored using the CPTAC iProFun portal, a web application that can be used with any modern Internet browser. A user specifies a cohort of interest and a list of genes, then they receive an interactive table for investigating the presence of CNA, methylation, and mutation perturbations among the input genes.
iRafNet R package implements a random-forest based algorithm which integrates prior information from existing databases such as protein-protein interactions databases and knock-out experiments when estimating large co-expression networks based on protein/gene expression profiles.
A Random Forest based approach for constructing multiple co-expression gene/protein networks based on high-dimensional proteogenomic data sets.
Kinase Enrichment Analysis 3 (KEA3) is a webserver application that infers overrepresentation of upstream kinases whose putative substrates are in a user-inputted list of proteins. KEA3 can be applied to analyze data from phosphoproteomics and proteomics studies to predict the upstream kinases responsible for observed differential phosphorylations. The KEA3 background database contains measured and predicted kinase-substrate interactions (KSI), kinase-protein interactions (KPI), and interactions supported by co-expression and co-occurrence data.
L1000 fireworks display (L1000FWD) is a web application that provides interactive visualization of over 16,000 drug and small-molecule induced gene expression signatures. L1000FWD enables coloring of signatures by different attributes such as cell type, time point, concentration, as well as drug attributes such as MOA and clinical phase. Signature similarity search is implemented to enable the search for mimicking or opposing signatures given as input of up and down gene sets.
mixEMM R package contains functions for estimating a mixed-effects model for clustered data (or batch-processed data) with cluster-level (or batch- level) missing values in the outcome, i.e., the outcomes of some clusters are either all observed or missing altogether. The model is developed for analyzing incomplete data from labeling-based quantitative proteomics experiments but is not limited to this type of data.
MSFragger is an ultrafast database search tool for peptide identification in mass spectrometry-based proteomics across a wide range of datasets and applications. The speed of MSFragger makes it particularly suitable for the analysis of large datasets, for enzyme unconstrained searches, for ‘open’ database searches for identification of modified peptides, and for glycopeptide identification (N-linked and O-linked) with MSFragger Glyco mode.
Plotting the effect of one omics data on other omics data along the chromosome.
mvMISE R package offers a general framework of multivariate mixed-effects models for the joint analysis of multiple correlated features with clustered data structures and potential missingness. mvMISE is motivated by the multivariate data analysis on data with multiplex structures from labelled proteomic experiments.
A proteogenomics pipeline for neoantigen prioritization.
Visualizing and analyzing multi-omics data based on biological networks.
Co-expression network construction and network module analysis
A cloud-based platform for automated and reproducible proteogenomic data analysis
Analyzing pathological images using a multi-resolution convolutional neural network architecture.
A universal targeted peptide search engine for identifying or validating known and novel peptides of interest.
We introduce Pollock, an algorithm for cell type identification, that provides a suite of pretrained models, prediction interpretability modules, and compatibility with popular single cell libraries. Pollock performs commensurately with currently existing classification methods, while easily deployable pretrained classification modules generalize well across tissue types on a variety of different data types. Additionally, it demonstrates utility in an immune pan-cancer analysis.
ProMap is an R package for modeling the global trans-regulatory network between DNA alterations and RNA/protein expressions. ProMAP uses a penalized multivariate linear mixed effects model to handle the batch effects and non-ignorable missingness in proteomics data directly. It also utilizes the MAP penalty to facilitate the detection of trans-hubs (i.e. DNA alterations influencing a large number of RNA/proteins).
Protein marker selection using proteomics or multi-omics data.
Visualizing networks inferred based on proteogenomic data from CPTAC CCRCC project.
Interactive visualization of CPTAC CCRCC multi-omics data.
interactive visualization of CPTAC-CBTTC pediatric brain tumor multi-omics data.
A database that stores post-translational modifications (PTMs) and cancer mutations in humans.
A modified version of ssGSEA to perform site-specific signature analysis by scoring PTMsigDB's bi-directional signature-sets.
PTM-Shepherd automates characterization of PTM profiles detected in open (mass-tolerant) proteomics searches based on attributes such as amino acid localization, fragmentation spectra similarity, retention time shifts, and relative modification rates. PTM-Shepherd can also perform multi-experiment comparisons for studying changes in modification profiles, e.g. in data generated in different laboratories or under different conditions.
A collection of modification site-specific signatures of perturbations, kinase activities and signaling pathways curated from literature.
Creating sample specific protein sequence databases using genomic and transcriptomic data.
RHT R package offers functions to perform regularized Hotelling's T-square test for pathway or gene set analysis. The package is tailored for but not limited to proteomics data, in which sample sizes are often small, a large proportion of the data are missing and/or correlations may be present.
SC-ION infers regulatory networks from multi-omics data. It uses the expression of regulators (e.g. transcription factors) in one dataset (e.g. proteome/PTM) to predict the expression of targets (e.g. genes) in another dataset (e.g. transcriptome). SC-ION can be used with different types of large-scale omics data to infer multiple networks. It is available as an RShiny application.
Improving protein isoform characterization in shotgun proteomics through a graph theory-based approach.
SETPath R package provides functions to test gene/protein expression data from a biological pathway for biologically meaningful differences in the eigenstructure between two classes.
Skyline is an application for building Selected Reaction Monitoring (SRM) / Multiple Reaction Monitoring (MRM), Parallel Reaction Monitoring (PRM - Targeted MS/MS and DIA/SWATH) and targeted DDA with MS1 quantitative methods and analyzing the resulting mass spectrometer data.
The spaceMap R package constructs de novo networks from multiple data types in a high-dimensional context by applying a novel conditional graphical model. In addition to learning network structure, an accompanying network analysis toolkit is also provided. The toolkit has been developed with genomics applications in mind---but may be adapted for other applications ---and maps scientic domain knowledge onto networks.
TMT-Integrator normalizes and combines channel abundances from multiple TMT or iTRAQ-labeled proteomics samples, generating quantification reports as specified by the user. TMT-Integrator currently provides four quantification options: gene, protein, peptide, and modified site levels.
TSNet R package implements a new method which constructs tumor-cell specific gene/protein co-expression networks based on gene/protein expression profiles of tumor tissues. TSNet treats the observed expression profile as a mixture of expressions from different cell types and explicitly models tumor purity percentage in each tumor sample.
A gene set analysis toolkit with new support for phosphosite enrichment analysis.
An R package for gene set analysis.
Databases and Web Portals
Illuminating the dark phosphoproteome through PubMed mining.
Analyzing multi-omics data from TCGA and CPTAC projects using association and pathway analysis, both within and across cancer types.
LinkedOmicsKB makes consistently processed and systematically precomputed CPTAC pan-cancer proteogenomics data easily accessible to the public through a web portal. With approximately 40,000 gene-, protein-, mutation-, and phenotype-centric web pages, it enables anyone with internet access to conduct meaningful inquiries into CPTAC data, facilitating data-driven scientific discoveries.
Pathway Figure OCR
Pathway Figure OCR is dedicated to extracting pathway information from the published literature where ~1000 pathway figures are published each month.
WikiPathways is a database of molecular pathway diagrams contributed and refined by the research community. A dedicated portal was established for CPTAC pathways(link is external) including the hallmarks of cancer and phosphorylation state information. Browse, download, analyze or draw your own pathways of interest.