Cargando…
Leakage and the reproducibility crisis in machine-learning-based science
Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields th...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10499856/ https://www.ncbi.nlm.nih.gov/pubmed/37720327 http://dx.doi.org/10.1016/j.patter.2023.100804 |
_version_ | 1785105799977107456 |
---|---|
author | Kapoor, Sayash Narayanan, Arvind |
author_facet | Kapoor, Sayash Narayanan, Arvind |
author_sort | Kapoor, Sayash |
collection | PubMed |
description | Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields that have adopted ML methods, we find 17 fields where leakage has been found, collectively affecting 294 papers and, in some cases, leading to wildly overoptimistic conclusions. Based on our survey, we introduce a detailed taxonomy of eight types of leakage, ranging from textbook errors to open research problems. We propose that researchers test for each type of leakage by filling out model info sheets, which we introduce. Finally, we conduct a reproducibility study of civil war prediction, where complex ML models are believed to vastly outperform traditional statistical models such as logistic regression (LR). When the errors are corrected, complex ML models do not perform substantively better than decades-old LR models. |
format | Online Article Text |
id | pubmed-10499856 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-104998562023-09-15 Leakage and the reproducibility crisis in machine-learning-based science Kapoor, Sayash Narayanan, Arvind Patterns (N Y) Article Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields that have adopted ML methods, we find 17 fields where leakage has been found, collectively affecting 294 papers and, in some cases, leading to wildly overoptimistic conclusions. Based on our survey, we introduce a detailed taxonomy of eight types of leakage, ranging from textbook errors to open research problems. We propose that researchers test for each type of leakage by filling out model info sheets, which we introduce. Finally, we conduct a reproducibility study of civil war prediction, where complex ML models are believed to vastly outperform traditional statistical models such as logistic regression (LR). When the errors are corrected, complex ML models do not perform substantively better than decades-old LR models. Elsevier 2023-08-04 /pmc/articles/PMC10499856/ /pubmed/37720327 http://dx.doi.org/10.1016/j.patter.2023.100804 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Kapoor, Sayash Narayanan, Arvind Leakage and the reproducibility crisis in machine-learning-based science |
title | Leakage and the reproducibility crisis in machine-learning-based science |
title_full | Leakage and the reproducibility crisis in machine-learning-based science |
title_fullStr | Leakage and the reproducibility crisis in machine-learning-based science |
title_full_unstemmed | Leakage and the reproducibility crisis in machine-learning-based science |
title_short | Leakage and the reproducibility crisis in machine-learning-based science |
title_sort | leakage and the reproducibility crisis in machine-learning-based science |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10499856/ https://www.ncbi.nlm.nih.gov/pubmed/37720327 http://dx.doi.org/10.1016/j.patter.2023.100804 |
work_keys_str_mv | AT kapoorsayash leakageandthereproducibilitycrisisinmachinelearningbasedscience AT narayananarvind leakageandthereproducibilitycrisisinmachinelearningbasedscience |