Cargando…

Leakage and the reproducibility crisis in machine-learning-based science

Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields th...

Descripción completa

Detalles Bibliográficos
Autores principales: Kapoor, Sayash, Narayanan, Arvind
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10499856/
https://www.ncbi.nlm.nih.gov/pubmed/37720327
http://dx.doi.org/10.1016/j.patter.2023.100804
_version_ 1785105799977107456
author Kapoor, Sayash
Narayanan, Arvind
author_facet Kapoor, Sayash
Narayanan, Arvind
author_sort Kapoor, Sayash
collection PubMed
description Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields that have adopted ML methods, we find 17 fields where leakage has been found, collectively affecting 294 papers and, in some cases, leading to wildly overoptimistic conclusions. Based on our survey, we introduce a detailed taxonomy of eight types of leakage, ranging from textbook errors to open research problems. We propose that researchers test for each type of leakage by filling out model info sheets, which we introduce. Finally, we conduct a reproducibility study of civil war prediction, where complex ML models are believed to vastly outperform traditional statistical models such as logistic regression (LR). When the errors are corrected, complex ML models do not perform substantively better than decades-old LR models.
format Online
Article
Text
id pubmed-10499856
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-104998562023-09-15 Leakage and the reproducibility crisis in machine-learning-based science Kapoor, Sayash Narayanan, Arvind Patterns (N Y) Article Machine-learning (ML) methods have gained prominence in the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. We systematically investigate reproducibility issues in ML-based science. Through a survey of literature in fields that have adopted ML methods, we find 17 fields where leakage has been found, collectively affecting 294 papers and, in some cases, leading to wildly overoptimistic conclusions. Based on our survey, we introduce a detailed taxonomy of eight types of leakage, ranging from textbook errors to open research problems. We propose that researchers test for each type of leakage by filling out model info sheets, which we introduce. Finally, we conduct a reproducibility study of civil war prediction, where complex ML models are believed to vastly outperform traditional statistical models such as logistic regression (LR). When the errors are corrected, complex ML models do not perform substantively better than decades-old LR models. Elsevier 2023-08-04 /pmc/articles/PMC10499856/ /pubmed/37720327 http://dx.doi.org/10.1016/j.patter.2023.100804 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Kapoor, Sayash
Narayanan, Arvind
Leakage and the reproducibility crisis in machine-learning-based science
title Leakage and the reproducibility crisis in machine-learning-based science
title_full Leakage and the reproducibility crisis in machine-learning-based science
title_fullStr Leakage and the reproducibility crisis in machine-learning-based science
title_full_unstemmed Leakage and the reproducibility crisis in machine-learning-based science
title_short Leakage and the reproducibility crisis in machine-learning-based science
title_sort leakage and the reproducibility crisis in machine-learning-based science
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10499856/
https://www.ncbi.nlm.nih.gov/pubmed/37720327
http://dx.doi.org/10.1016/j.patter.2023.100804
work_keys_str_mv AT kapoorsayash leakageandthereproducibilitycrisisinmachinelearningbasedscience
AT narayananarvind leakageandthereproducibilitycrisisinmachinelearningbasedscience