Cargando…

Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls

PURPOSE: To investigate the impact of the following three methodological pitfalls on model generalizability: (a) violation of the independence assumption, (b) model evaluation with an inappropriate performance indicator or baseline for comparison, and (c) batch effect. MATERIALS AND METHODS: The aut...

Descripción completa

Detalles Bibliográficos
Autores principales:	Maleki, Farhad, Ovens, Katie, Gupta, Rajiv, Reinhold, Caroline, Spatz, Alan, Forghani, Reza
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Radiological Society of North America 2022
Materias:	Special Report
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9885377/ https://www.ncbi.nlm.nih.gov/pubmed/36721408 http://dx.doi.org/10.1148/ryai.220028

_version_	1784879918935441408
author	Maleki, Farhad Ovens, Katie Gupta, Rajiv Reinhold, Caroline Spatz, Alan Forghani, Reza
author_facet	Maleki, Farhad Ovens, Katie Gupta, Rajiv Reinhold, Caroline Spatz, Alan Forghani, Reza
author_sort	Maleki, Farhad
collection	PubMed
description	PURPOSE: To investigate the impact of the following three methodological pitfalls on model generalizability: (a) violation of the independence assumption, (b) model evaluation with an inappropriate performance indicator or baseline for comparison, and (c) batch effect. MATERIALS AND METHODS: The authors used retrospective CT, histopathologic analysis, and radiography datasets to develop machine learning models with and without the three methodological pitfalls to quantitatively illustrate their effect on model performance and generalizability. F1 score was used to measure performance, and differences in performance between models developed with and without errors were assessed using the Wilcoxon rank sum test when applicable. RESULTS: Violation of the independence assumption by applying oversampling, feature selection, and data augmentation before splitting data into training, validation, and test sets seemingly improved model F1 scores by 71.2% for predicting local recurrence and 5.0% for predicting 3-year overall survival in head and neck cancer and by 46.0% for distinguishing histopathologic patterns in lung cancer. Randomly distributing data points for a patient across datasets superficially improved the F1 score by 21.8%. High model performance metrics did not indicate high-quality lung segmentation. In the presence of a batch effect, a model built for pneumonia detection had an F1 score of 98.7% but correctly classified only 3.86% of samples from a new dataset of healthy patients. CONCLUSION: Machine learning models developed with these methodological pitfalls, which are undetectable during internal evaluation, produce inaccurate predictions; thus, understanding and avoiding these pitfalls is necessary for developing generalizable models. Keywords: Random Forest, Diagnosis, Prognosis, Convolutional Neural Network (CNN), Medical Image Analysis, Generalizability, Machine Learning, Deep Learning, Model Evaluation Supplemental material is available for this article. Published under a CC BY 4.0 license.
format	Online Article Text
id	pubmed-9885377
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Radiological Society of North America
record_format	MEDLINE/PubMed
spelling	pubmed-98853772023-01-30 Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls Maleki, Farhad Ovens, Katie Gupta, Rajiv Reinhold, Caroline Spatz, Alan Forghani, Reza Radiol Artif Intell Special Report PURPOSE: To investigate the impact of the following three methodological pitfalls on model generalizability: (a) violation of the independence assumption, (b) model evaluation with an inappropriate performance indicator or baseline for comparison, and (c) batch effect. MATERIALS AND METHODS: The authors used retrospective CT, histopathologic analysis, and radiography datasets to develop machine learning models with and without the three methodological pitfalls to quantitatively illustrate their effect on model performance and generalizability. F1 score was used to measure performance, and differences in performance between models developed with and without errors were assessed using the Wilcoxon rank sum test when applicable. RESULTS: Violation of the independence assumption by applying oversampling, feature selection, and data augmentation before splitting data into training, validation, and test sets seemingly improved model F1 scores by 71.2% for predicting local recurrence and 5.0% for predicting 3-year overall survival in head and neck cancer and by 46.0% for distinguishing histopathologic patterns in lung cancer. Randomly distributing data points for a patient across datasets superficially improved the F1 score by 21.8%. High model performance metrics did not indicate high-quality lung segmentation. In the presence of a batch effect, a model built for pneumonia detection had an F1 score of 98.7% but correctly classified only 3.86% of samples from a new dataset of healthy patients. CONCLUSION: Machine learning models developed with these methodological pitfalls, which are undetectable during internal evaluation, produce inaccurate predictions; thus, understanding and avoiding these pitfalls is necessary for developing generalizable models. Keywords: Random Forest, Diagnosis, Prognosis, Convolutional Neural Network (CNN), Medical Image Analysis, Generalizability, Machine Learning, Deep Learning, Model Evaluation Supplemental material is available for this article. Published under a CC BY 4.0 license. Radiological Society of North America 2022-11-16 /pmc/articles/PMC9885377/ /pubmed/36721408 http://dx.doi.org/10.1148/ryai.220028 Text en © 2022 by the Radiological Society of North America, Inc. https://creativecommons.org/licenses/by/4.0/Published under a (https://creativecommons.org/licenses/by/4.0/) CC BY 4.0 license.
spellingShingle	Special Report Maleki, Farhad Ovens, Katie Gupta, Rajiv Reinhold, Caroline Spatz, Alan Forghani, Reza Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls
title	Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls
title_full	Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls
title_fullStr	Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls
title_full_unstemmed	Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls
title_short	Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls
title_sort	generalizability of machine learning models: quantitative evaluation of three methodological pitfalls
topic	Special Report
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9885377/ https://www.ncbi.nlm.nih.gov/pubmed/36721408 http://dx.doi.org/10.1148/ryai.220028
work_keys_str_mv	AT malekifarhad generalizabilityofmachinelearningmodelsquantitativeevaluationofthreemethodologicalpitfalls AT ovenskatie generalizabilityofmachinelearningmodelsquantitativeevaluationofthreemethodologicalpitfalls AT guptarajiv generalizabilityofmachinelearningmodelsquantitativeevaluationofthreemethodologicalpitfalls AT reinholdcaroline generalizabilityofmachinelearningmodelsquantitativeevaluationofthreemethodologicalpitfalls AT spatzalan generalizabilityofmachinelearningmodelsquantitativeevaluationofthreemethodologicalpitfalls AT forghanireza generalizabilityofmachinelearningmodelsquantitativeevaluationofthreemethodologicalpitfalls

Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls

Ejemplares similares