CPTAC supports analyses of the mass spectrometry raw data (mapping of spectra to peptide sequences and protein identification) for the public using a Common Data Analysis Pipeline (CDAP). The data types available on the public portal are described below. A general overview of this pipeline can be downloaded here.
Mass Spectrometry Data Formats
RAW (Vendor) Format
Mass spectrometry data is uploaded by the PCCs as RAW or vendor format files corresponding to the mass spectrometers used to acquire the spectra. These files are usually very large and can only be read using the mass spectrometer vendor’s libraries on (typically) Windows-based operating systems. Alternatively, these files can be read using a number of open-source projects that integrate these vendor libraries, such as the ProteoWizard project. The spectral data in RAW files are considered unprocessed, although in some cases, the acquisition software of the mass spectrometer may process it, in real-time, before recording it.
The RAW format spectra are converted to HUPO Proteome Standards Initiative (PSI) compliant mzML format at CPTAC’s DCC. This standardized XML format for mass spectrometry data is generated using MSConvert from the ProteoWizard project. In this process, each spectrum is transformed to a peak list using the vendor’s peak-picking algorithms. These spectral data files are smaller than the RAW format spectral data files and are completely operating system and programming language agnostic. These files can be viewed using the ProteoWizard SeeMS tool and converted to other peak list formats suitable for analysis by tandem-mass-spectrometry search engines using MSConvert. A list of commercial and open-source tools supporting the mzML format can be found at the PSI site.
Peptide-Spectrum Match Data
The first-level analysis of the spectra uploaded by the PCCs is the matching of tandem-mass spectra to peptide sequences. Tandem-mass spectrometry search engines match the spectra to peptide sequences from protein sequence databases, score the matches, and output the best peptide-spectrum matches (PSMs) for each spectrum. PSMs are then filtered by score and statistical significance to ensure that only the most reliable PSMs are retained. Each PSM links an identifier for the spectrum, the peptide sequence, any post-translational modifications (PTMs) on the peptide, and a list of identifiers for the protein sequences found to contain the peptide sequence. Additionally, PSMs may be annotated with additional information depending on the analysis pipeline, such as iTRAQ reporter ion intensities and PTM localization scores.
RAW PSM Format
The CDAP implemented for CPTAC by NIST produces tab-separated-value format files containing PSMs generated by MS-GF+ for each CPTAC study. The current reference protein database used for human-in-mouse xenograft tumor pooled samples is concatenated RefSeq H. sapiens (build 37), M. musculus (build 37), and the sequence for S. scrofa (porcine) trypsinogen. The FASTA file used for analysis of human The Cancer Genome Atlas (TCGA) samples and ovarian cancer tumors includes RefSeq H. sapiens (build 37) and the sequence for S. scrofa (porcine) trypsinogen.
Reference mass spectral peptide libraries may be downloaded freely from NIST Peptide Library.
PCCs may also analyze the spectral data and provide PSMs in other formats, including IDPicker3 database and MS-GF+ mzIdentML. Separate documents will describe the details of these analysis pipelines and document PSM formats.
mzIdentML PSM Format
Raw PSMs from the CDAP or the PCCs are converted to PSI compliant mzIdentML format at the DCC. This standardized XML format for PSMs is generated using a tool developed at the DCC with support from the ProteoWizard project. In this process, the PSMs are standardized and normalized for consumption by third-party data processing pipelines. PSM normalization includes realignment of peptide sequences to current RefSeq/UniProt protein sequence databases to obtain peptide start and end positions, consistent accession format, and human readable descriptions; normalization of all PTMs with UNIMOD accessions and PSI conventions for N-terminal modifications; recomputation of all theoretical masses from elemental composition; extraction of precursor m/z and retention time data from spectral data files; and verification and population of mzML native IDs as spectral identifiers. PSI-MS controlled vocabulary terms are used wherever possible. A list of commercial and open-source tools supporting the mzIdentML format can be found at the PSI site.
The protein reports are based on the PSMs obtained from the CDAP and provide protein identification and quantitation for both label-free and multiplexed iTRAQ/TMT workflows with a common reference sample. These results are based on a conservative gene-based generalized parsimony analysis developed by the Edwards lab. Peptides are associated with genes, rather than protein identifiers, and genes with at least two unshared peptide identifications are inferred. The resulting gene list is estimated to have a false-discovery rate of at most 1%. A summary of the gene-based generalized parsimony analysis is provided in the protein identification summary report.