Riding the wave of the future requires scientists to move away from silo-thinking to an inclusive and collaborative mind set. By leveraging the power of crowdsourcing, precisionFDA and NCI-CPTAC teamed up to launch the Multi-omics Enabled Sample Mislabeling Correction Big Data Challenge. Over 500 participants from 20 countries joined the call to develop computational algorithms that would identify multi-omics samples that were switched prior to or during data processing and analysis. By looking at genomic, transcriptomic and proteomic datasets mined from a several single-subject cases, participants developed algorithms that were not only able to identify mislabeled samples, but also match them to their correct case.
The Big Data Challenge was organized into two independent Subchallenges: In Subchallenge 1 participants developed a model computational algorithm to identify samples in a test (training) data set that had known unmatched clinical genomics and protein profiling data. In Subchallenge 2 participants were presented with RNA profiling data in addition to the training data set, to develop a model computational algorithm that would identify and correct any mislabeled data.
During this year’s RECOMB2019 DREAM Satellite Conference held at George Washington University on May 4th, 2019, Dr. Henry Rodriguez, Director for the Office of Cancer Clinical Proteomics Research at NCI, Dr. Emily Boja, Program Director in the Office of Cancer Clinical Proteomics Research, and Dr. Bing Zhang, Professor at Baylor College of Medicine, presented awards to two of our Big Data Challenge best performers. Ms. Renke Pan from (Sentieon; pictured top panel) was ranked among the three best performers for both Subchallenge 1 and 2, while Dr. Anders Carlsson (Bionamic; pictured bottom panel) ranked in the top three for Subchallenge 2. They both presented their approach to the challenge and their computational strategies for developing the successful model algorithms. This challenge reinforced yet another reason to gather and assess multi-omic data for patient pre-diagnosis. Combining genomic data with proteomic analysis can give clinicians more information to assist in diagnosis and treatment.
What happens next? As mentioned in our recent article (Nature Medicine, PMID: 30194412), these algorithms will be aggregated and refined into a final open-source product that will become part of the pre-assessment of our multi-omics data analysis pipeline. This will ensure the addition of reliable and accurate information to our proteomic database, and help propel the translation of multi-omics technologies and datasets to the clinic. Additionally, this project also gave us a wonderful opportunity to expand our collaborations beyond the proteomic community and find inventive ways to improve our resources.