In biomedical research, sample mislabeling or incorrect annotation has been a long-standing issue contributing to irreproducible results and invalid conclusions. These issues are particularly prevalent in large scale multi-omics studies, in which multiple different omics experiments are carried out at different time periods and/or in different labs and human errors can arise during sample transferring, sample tracking, large-scale data generation, and data sharing/management. On the other hand, simultaneous use of multiple types of omics platforms to characterize a large set of biological samples, as utilized in The Cancer Genome Atlas and the Clinical Proteomic Tumor Analysis Consortium (CPTAC) projects, is becoming a powerful and popular approach to precision medicine.
While there is value in multi-omics technologies and datasets to help reach a deeper understanding of a disease and ultimately help a physician and patient determine the most appropriate treatment option, sample mislabeling presents a roadblock that can occur in data production and analysis pipelines involving data-rich, large-scale omics studies. As a result, there is a pressing need to develop computational algorithms that can accurately detect and correct mislabeled data in large-scale multi-omics studies.
In a newly published community (crowdsourced) study led by the National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) and precisionFDA programs, participants showed a wide range of accuracy, underscoring the importance of the benchmarking effort. Post-challenge collaboration between the top-performing teams and the challenge organizers has resulted on the development of an open-source software, COSMO, with demonstrated high accuracy and robustness in mislabeling identification and correction in simulated and real multi-omic datasets.
“Using COSMO, we have successfully detected and fixed sample label errors in multiple projects. I hope the broader research community will find this tool useful, as checking sample label errors is a highly recommended quality control step when handling multi-omics data sets from large-scale proteogenomic studies,” said co-corresponding author Pei Wang, Ph.D., a professor of genetics and genomic sciences at the Icahn School of Medicine at Mount Sinai.
COSMO and its source code are openly available at the GitHub, thus allowing broad usage and continuous development by the global scientific community.
“It is highly rewarding to see collaborative efforts from teams with different backgrounds and locations eventually lead to an open-source software product addressing a critical need in multi-omics research,” said co-corresponding author Bing Zhang, Ph.D., a professor of molecular and human genetics at the Baylor College of Medicine.
Source: A community effort to identify and correct mislabeled samples in proteogenomic studies. Patterns, Volume 2, Issue 5, 14 May 2021, 100245.