In biomedical research, sample mislabeling, or incorrect annotation has been a long-standing problem that contributes to irreproducible results and invalid conclusions. These problems are particularly prevalent in large scale multi-omics studies where human errors could arise during sample transferring, sample tracking, large-scale data generation, and data sharing/management.
Thus, there is a pressing need to identify and correct sample and data mislabeling events to ensure the right data for the right patient (Nature Medicine, PMID: 30194412). In that regard, multi-omics data collected on the same patient have an unprecedented advantage of being able to pinpoint and correct mislabeling mistakes encountered in the process because the same sample gets measured multiple times on different platforms (e.g., The Cancer Genome Atlas, Clinical Proteomic Tumor Analysis Consortium).
To address this critical roadblock that can occur in translational and clinical research, CPTAC and the Food and Drug Administration (FDA), in coordination with the DREAM Challenges, launched the first-ever Crowdsourced Multi-omics Sample Mislabeling Big Data Challenge.
The joint Challenge comprised of two subchallenges to capitalize on multi-omics data and to encourage the development and evaluation of computational algorithms that can accurately detect and correct mislabeled samples, with the hope to accelerate the translation of omics technologies and datasets into the clinic. Subchallenge 1 focused on the development of next-gen computational models for detecting mislabeled samples using proteomic and clinical data generated by CPTAC, while Subchallenge 2 took a step further, aiming to identify and correct mismatched samples with one data type mislabeled among three data types (clinical, transcriptomic, and proteomic data). In this case, correction is feasible because only one data type among the three is mislabeled.
Challenge by the Number and Best Performers
A total of 230 submissions from 81 participants around the globe were received and evaluated. Subchallenge 1 went from September 24, 2018 to November 4, 2018 and received 148 submissions from 51 participants; and Subchallenge 2 went from November 5, 2018 to December 18, 2018 and received 82 submissions from 30 participants.
Participants were not required to participate in Subchallenge 1 to participate in Subchallenge 2. Since the Challenge permitted each participant to submit multiple submissions/algorithms, the ranking of best performing methods and best performing participants vary slightly. For additional details regarding the Subchallenge results, evaluation metrics, and ranking processes, go to https://precision.fda.gov/challenges/4/view/results for Subchallenge 1 and https://precision.fda.gov/challenges/5/view/results for Subchallenge 2.
In Subchallenge 1, Dr. Daniel Schlauch, a principal biostatistician at Genospace, Mr. Eric (Peng) Li, a staff engineer at Alibaba Cloud, and Ms. Renke Pan, a data scientist at Sentieon, were ranked as the top three participants for Subchallenge 1 that aimed at successfully detecting samples with unmatched clinical annotation (e.g., tumor grade, sex) and proteomic data.
In Subchallenge 2, Dr. Anders Carlsson at Bionamic who is interested in developing analytical tools and strategies for life sciences, Ms. Renke Pan, one of the top performers for Subchallenge 1, and Mr. Soon Jye Kho, a graduate research assistant at Wright State University, were deemed as the top three performing participants who developed the top three performing methods for Subchallenge 2 that aimed at identifying and correcting mislabeled data using transcriptomic, proteomic and clinical data.
The Challenge’s best performers will move into the community phase of the project where they will collaborate with each other to assess their methods and devise a better solution to solving the problems. Also, a few participants will be actively involved in an overview manuscript for Nature Medicine that supports the submission, contingent on a standard evaluation and peer review process. Furthermore, select top performers will present their findings at RECOMB2019 Dream Challenge Satellite Conference.
This Challenge represents the first collaborative endeavor between precisionFDA, NCI and DREAM with hopes to spur future collaboration in the data science community, and efforts to improve “Rigor and Reproducibility” in biomedical studies in this exciting era of “Big Data.”