What data is hosted by the CPTAC Data Portal?
The Data Portal hosts the mass spectrometry data from the CPTAC program. A key component is the proteogenomic profiling of patient tumors, such as those from the breast, colorectal, and ovarian cancer programs in The Cancer Genome Atlas (TCGA). The portal also hosts data from completed programs and external studies.
What research groups generate these data?
The CPTAC program is the main data generator of the hosted data. Additional datasets from other programs such as APOLLO and ICPC are anticipated in the future.
What are the data use policies for files downloaded from the CPTAC Data Portal?
The CPTAC program abides by the Amsterdam principles and has established the following policy to clarify freedom of CPTAC and non-CPTAC users to publish findings using CPTAC data (Data Use Agreement).
How do I cite my work in publications?
The CPTAC program requests that publications using data from this program include the following statement:
“Data used in this publication were generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH).”
The following manuscripts may also be cited:
CPTAC Program Overview: Ellis et al. (2013)
CPTAC Data Portal: Edwards et al. (2015).
Do I need Aspera Connect Client Plug-in for file transfer?
Yes, Aspera Connect Client Plug-in enables the high speed file transfer. You will not be able to "Download" files from the links on the study page without it. You can download without Aspera using a HTTPS protocol from here, but this is significantly slower than data transfer with Aspera.
Where can I get Aspera Connect Client Plug-in?
Aspera Connect Client Plug-in can be downloaded from http://downloads.asperasoft.com/connect2. The Aspera download site automatically recognizes your operating system and will recommend the correct client plug-in for your machine.
Where can I get documentation for Aspera Connect Client that I installed on my computer?
Information on Aspera Connect Web Browser Plug-in is found at:
I received an error message that Aspera Client Plug-in was unable to authenticate using Port 33001. What does this mean?
The Aspera Connect Server at the CPTAC DCC uses nonstandard ports for security, UDP 33001 for file transfer and TCP 33001 for User Authentication (via SSH). If a user is working at a University or Research Institute and within their own security firewall, they will need to contact their IT security staff to open these ports, UDP 33001 and TCP 33001.
Is there a way I can set my transfers in the Aspera Connect Client to resume automatically when my internet connection is interrupted?
Go to Aspera Connect "Preferences" on your machine. In the Transfers tab, you need to enable the auto-retry function by checking the Automatically retry failed transfers box and entering a numerical value for the number of time to retry that suits your situation. You can also manually click the retry icon to restart the download.
Can I use Aspera Command Line to download data?
Yes, there are two ways to use the Aspera Command Line:
- Direct Access from a Linux system
- a. Install Aspera Connect Client on your linux system (http://asperasoft.com/software/transfer-clients/connect-web-browser-plug-in)
- b. The default install location will be the user home directory. Modify the path in the below command line example if the Aspera Connect Client is installed in a different location.
- c. Run the following command to test
- ~/. aspera/connect/bin/ascp -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty -P 33001 -O 33001 -l 50M -T -Q --user public --host cptc-xfer.uis.georgetown.edu --mode recv /Phase_II_Data/CompRef/CompRef_Proteome_BI/CompRef_Proteome_BI_mzML.cksum.
- d. If (c) is successful, simply replace the '/Phase_II_Data/CompRef/CompRef_Proteome_BI/CompRef_Proteome_BI_mzML.cksum' with the desired file or folder name.
- e. Example folder names
- f. Additional folder names and individual dataset names can be obtained by browsing the web portal.
- A Python executable script allows direct file transfer from the CPTAC Data Portal and is subjected to the CPTAC Data Use Agreement. The script can be run from the command prompt to perform whole directory or single file transfers. The script can be obtained here.
Can I download data without using Aspera?
Yes, the DCC offers access to CPTAC data using a HTTPS protocol. Look for the “Http Data Access” link on each study page, or access https://cptc-xfer.uis.georgetown.edu/publicData directly.
How do I access data in compressed files with a.tar.gz file-extension?
On Linux and OSX systems, the system tar and gzip command-line tools should be used. On Windows, the 7z suite of file-compression tools have been tested to successfully uncompress even the very large compressed files.
What are .cksum files for?
Checksum files are used to verify file integrity, providing a way to check that the local copy of a file is the same as the DCC's copy of the file. Checksum files are computed on a per-folder basis, providing the filename, md5, sha1, and files size for all files in a folder. For folder <folder-name>, the checksum file is called <folder-name>.cksum. Checksum files can be renamed with their folders, if necessary, without compromising the checksum file contents.
The checksum files list one file per line, with the values on each line separated by tabs. Each line provides the md5 hash, the sha1 hash, the file size in bytes, and the filename, in that order. The files are listed in (ascii) sorted order. This format makes it particularly easy to compare two checksum files using standard Unix/Linux tools like grep, awk, and diff.
The DCC provides a program (cksum) for computing and verifying checksum files as part of its CPTAC-DCC Tools package (Windows, Linux, Python versions are available). The checksum files computed by this script are platform-independent so that they can be computed on one platform and verified on another.
How can I verify checksums?
On Linux and OSX systems, the traditional ls, md5sum, and sha1sum programs compute the same file sizes and hashes and file sizes as those contained in the .cksum files. In addition, the DCC offers a command-line program, cksum, for generating and checking .cksum files.
How can the Aspera infrastructure help ensure file integrity?
The DCC has configured the Aspera Connect Server to use integrity verification for each transmitted data block. Furthermore, Aspera client will only download files that are missing or different than the files on the server, using file size and sparse checksums to determine if files on the local file system are different from those on the server. The command-line program, cptacpublic (see above) for headless execution of Aspera downloads, can also be configured to require the Aspera client compute full file checksums. Finally, checksum files (see above) can be used to provide an orthogonal check of downloaded file integrity.
Experimental Design and Data Formats
Where can I find protocols for the preparation of tumor samples and methods for mass spectrometry?
Each laboratory reports details of their experimental protocols in their publications. Links to the CPTAC publications can be found on the Available Studies tab in the third column. Prior to publication, metadata files are provided with details of sample file naming, instruments and instrumental parameters. These files are available for download from each study page under the dataset column "meta."
Where can I find the assignment of biospecimens to iTRAQ labels?
In studies using iTRAQ labels, there is a file for iTRAQ Sample Mapping available for download from each study page under the dataset column "meta." In the TCGA ovarian and breast cancer Studies, this file is also provided under the section "Biospecimens and Metadata Files."
What data formats are available?
Raw (Vendor) format
RAW or vendor format files correspond to mass spectrometers used to acquire the spectra.
RAW format spectra are converted to HUPO Proteome Standards Initiative (PSI) compliant mzML format at the DCC.
Raw PSM format
The Common Data Analysis Pipeline (CDAP) implemented by NIST produces tab-separated-value format files containing PSMs generated by MS-GF+ for each CPTAC study.
mzIdentML PSM Format
Raw PSMs from CDAP or PCCs are converted to HUPO Proteome Standards Initiative (PSI) compliant mzIdentML format at the DCC.
Is original instrument data retrievable from the CPTAC Data Portal?
Yes, on the data download pages, specify ‘raw’ as the data type desired.
Where can I find spectral data format information?
Spectral data is available in vendor RAW format, and in HUPO PSI format mzML files from the study pages. Select datatypes “raw” or “mzML.”
Where can I find details on PSM data formats? For example, what do iTRAQ flags signify?
Data format details begin on Page 8 in Software Programs and Output Files of CDAP. There are three flags defined on p. 10 (I, M, and D) that signify iTRAQ signal purity and abundance.
Where can I find details on XML format PSMs?
XML format PSMs are in HUPO PSI format mzIdentML files. The document on mzIdentML Format Peptide-Spectrum-Matches describes the transformation of CDAP format PSM data to mzIdentML.
Where is there detailed description of Protein reports?
Refer to CDAP Protein Report Description
Common Data Analysis Pipeline
What data is available from a CDAP?
The CPTAC program supports analyses of the mass spectrometry raw data (mapping of spectra to peptide sequences and protein identification) for the public using a CDAP.
Why is a CDAP used?
While each laboratory thoroughly analyzes and publishes on its own data, there is considerable interest in cross-study analyses. To facilitate cross-study comparisons, all spectral data is processed by a CDAP to ensure uniformly formatted results with consistent identification acceptance thresholds. Refer to CDAP Results Overview for more information.
How and why would published protein reports differ from the CDAP results?
Each PCC selects search engines, reference databases, other data analysis programs, and parameters to generate the most informative and comprehensive analysis for each study. While CPTAC Steering Committee has agreed on the publicly accessible and well documented tools and methods for the common pipeline, the same scientists are free to select different software and sequence databases for their own analyses. A description of different strategies for peptide assignment is summarized in the CDAP Results Overview.
What types of analyses were performed on each tumor type using CDAP? Are they directly comparable?
All data were processed using CDAP described in the CDAP Results Overview document. In addition, each PCC analyzed their own data. The specific methods they used are described in the publications posted on the CPTAC Overview page.
Were any normal samples analyzed in the TCGA colorectal cancer study?
Normal colon tissue was analyzed using identical protocols as for the TCGA samples, and is found in Normal Colon Epithelium Samples. Note that the normal colon samples analyzed are not matched normals from the TCGA, or CPTAC tumor sample donors.
Were any normal samples analyzed in the TCGA breast or ovarian cancer studies?
How can I get relative protein abundance for my genes from the TCGA breast cancer study?
Download the TCGA_Breast_BI_Proteome_CDAP_Protein_Report.r1 dataset using the “Prot” datatype selector. The tab-separated-values format protein report TCGA_Breast_BI_Proteome_CDAP.r1.itraq.tsv provides relative protein abundance by sample. Rows correspond to proteins, while columns correspond to TCGA samples. The “XXXX Log Ratio" columns contain the relative abundance of sample XXXX, with respect to the pooled reference sample, as log ratios (base 2). The “XXXX Unshared Log Ratio” columns contain the relative abundance of sample XXXX computed using only those peptide ions whose peptide sequences are associated with a single inferred protein.
How can I get relative protein abundance for my genes from the TCGA colorectal cancer study?
Download the TCGA_Colon_VU_Proteome_CDAP_Protein_Report.r1 dataset using the “Prot” datatype selector. The tab-separated-values format protein report TCGA_Colon_Proteome_CDAP.r1.spectral_counts.tsv provides spectral count protein abundance by sample. Rows correspond to proteins, while columns correspond to TCGA samples. The “XXXX Spectral Count" columns contain the spectral count values for sample XXXX. The “XXXX Unshared Spectral Count" columns contain the spectral count values for sample XXXX computed using only those peptide ions whose peptide sequences are associated with a single inferred protein. Similarly, protein abundance based on integration of precursor peaks is available in the protein report TCGA_Colon_Proteome_CDAP.r1.precursor_area.tsv.
How is the consistency and reproducibility of CPTAC spectral data assessed?
NIST performed quality assessment using parameters derived from each of the output files from quantitation and isotope analysis. Each PCC pre-tested their experimental protocol in the system suitability studies using human-in-mouse xenograft breast cancer tumor called Comparative Reference Material (CompRef) distributed to all groups for lab-to-lab and within-laboratory performance checks. The same CompRef materials are run between TCGA samples for quality control and the resulting ‘interstitial’ CompRef analyses made available for download. Refer to CDAP Results Overview for additional description. Subsequent studies performed by the PCCs on additional tumor samples also use these quality control procedures.
Will mass spectral library spectra result from these data?
Yes. Mass spectral files accumulated by CPTAC currently represent more than 100 million mass spectra. The mass spectrum of each unique peptide sequence exhibits a characteristic reproducible pattern of mass/charge vs. intensity, much like an individual’s fingerprint. Consequently, mass spectral libraries of previously characterized components permit very rapid peptide identification. The NIST Mass Spectrometry Data Center established repositories of compound specific mass spectral data useful for rapid recognition of simple chemical structures such as drugs, pesticides, steroids, amino acids, etc. More recently, libraries of tandem mass spectra of peptides recorded using liquid chromatographic separation/electrospray ionization mass spectrometers by the CPTAC labs have been distributed to the public by downloading freely from the NIST Peptide Library.
How should I cite the CDAP?
A publication describing CDAP authored by Rudnick et al. (2016) is available.
Who should I contact if I need assistance?
Having problems with CPTAC Data Portal - contact firstname.lastname@example.org.
Suggestions for new features - contact email@example.com.
How can I request new features for the CPTAC Data Portal?
Feature requests, suggestions and comments are welcome, please send an e-mail to: firstname.lastname@example.org