Modern computational tools for data collection and analysis have revolutionized how scientists are able to research and respond to problems. While the growing number of large, public data sets is a great boon for researchers, the accessibility of these resources can introduce quality and security risks. Data in the biological sciences is particularly vulnerable to fraud given its massive size, making it easier to hide manipulation. Inspired in part by techniques used to detect fraud in the financial sector, Dr. Samuel Payne and PhD student Michael Bradshaw have demonstrated that machine learning can be used to detect fraud in large-scale omics databases. Published December 2021 in Plos One, their results show that models using digit frequencies as inputs can detect fraud with exceptional accuracy.
The two researchers take time to differentiate the difference between fraud and honest human error; fraud is defined as the intent to cheat or deceive. They clarify that an honest error might be forgetting to include a few samples, while intentionally excluding samples would be fraud. Data fabrication is a specific type of fraud defined as “making up data or results and recording or reporting them.” This type of fraud, which is always purposeful, circumvents any moral ambiguity and is always negative. In an effort to “get into the mind” of someone who might fabricate data, Dr. Payne and Mr. Bradshaw explored three methods of data fabrication and then endeavored to find out if machine learning models could detect those methods effectively.
Simulated fraudulent data used in this study was generated using three different methods: random number generation, resampling with replacement, and imputation. The three methods the researchers used represent three potential ways that an actual scientist might fabricate data. Using each method, the researchers created 50 fake samples which were combined with the 100 real samples to form a mixed dataset of 150 samples. The real data used in this study comes from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) cohort for endometrial carcinoma, specifically the copy number alteration data.
Dr. Payne and Mr. Bradshaw tested five different machine learning models which, after training, were asked to classify testing data containing 75 samples, 50 real and 25 fake. This process of model training and evaluation was performed 50 times; each time a completely new set of 25 fabricated samples were made. The five models were first trained to use gene copy-number alteration data as input, machine learning models correctly predicted fraud with 58–100% accuracy. Unsatisfied with these results, the researchers used digit frequency as input instead, resulting in a drastic improvement. In contrast to their first models, machine learning models which utilized the digit frequencies were highly accurate (82%-100%, Average 98%) and showed less variation over the 50 trials
In their publication, Dr. Payne and Mr. Bradshaw present a proof-of-concept method for detecting fabrication in biomedical datasets and the team believes that their fraud detection methods could be refined and generalized for broad use. As part of a thought-provoking discussion section, they acknowledge the possibility that techniques like this could be used for an opposite purpose, aiding those attempting to commit fraud by providing a means of evaluating the quality of their data fabrication. Despite that concern, this publication represents a victory in the battle against fraud in large-scale datasets and is an exciting foundation for further research.