Leakage and the Reproducibility Crisis in ML-based Science

arxiv.org

DJB June 2016 "Malaise of (Cyber-security) Science

Robust science has a few characteristics, among them reproducibility, repeatability, and internal as well as external validity. Due to misaligned incentive structures and poor methodology, much of scientific literature is suffused with ‘results’ that cannot be repeated by the same team, let alone be reproduced by others. [..] Tim Horton, EIC Lancet (a leading medical journal) put it thusly: The case against science is straightforward: much of the scientific literature, perhaps half, may simply be untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, science has taken turn towards darkness."

2022: Leakage and the Reproducibility Crisis in ML-based Science

Survey of 20 papers that identify pitfalls in the adoption of ML methods across 17 fields, collectively affecting 329 papers in some cases leading to wildly overoptimistic conclusions.

Data leakage is indeed a widespread problem and has led to severe reproducibility failures.

The attractiveness of adopting ML methods in scientific research is in part due to the widespread availability of off the-shelf tools to create models without expertise in ML methods (Hutson, 2019). However, this laissez faire approach leads to common pitfalls spreading to all scientific fields that use ML. So far, each research community has independently rediscovered these pitfalls. Without fundamental changes to research and reporting practices, we risk losing public trust owing to the severity and prevalence of the reproducibility crisis across disciplines