Cargando…

Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies

BACKGROUND: Low birthweight (LBW) is a leading cause of neonatal mortality in the United States and a major causative factor of adverse health effects in newborns. Identifying high-risk patients early in prenatal care is crucial to preventing adverse outcomes. Previous studies have proposed various...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ren, Yang, Wu, Dezhi, Tong, Yan, López-DeFede, Ana, Gareau, Sarah
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10267797/ https://www.ncbi.nlm.nih.gov/pubmed/37256674 http://dx.doi.org/10.2196/44081

_version_	1785059002396180480
author	Ren, Yang Wu, Dezhi Tong, Yan López-DeFede, Ana Gareau, Sarah
author_facet	Ren, Yang Wu, Dezhi Tong, Yan López-DeFede, Ana Gareau, Sarah
author_sort	Ren, Yang
collection	PubMed
description	BACKGROUND: Low birthweight (LBW) is a leading cause of neonatal mortality in the United States and a major causative factor of adverse health effects in newborns. Identifying high-risk patients early in prenatal care is crucial to preventing adverse outcomes. Previous studies have proposed various machine learning (ML) models for LBW prediction task, but they were limited by small and imbalanced data sets. Some authors attempted to address this through different data rebalancing methods. However, most of their reported performances did not reflect the models’ actual performance in real-life scenarios. To date, few studies have successfully benchmarked the performance of ML models in maternal health; thus, it is critical to establish benchmarks to advance ML use to subsequently improve birth outcomes. OBJECTIVE: This study aimed to establish several key benchmarking ML models to predict LBW and systematically apply different rebalancing optimization methods to a large-scale and extremely imbalanced all-payer hospital record data set that connects mother and baby data at a state level in the United States. We also performed feature importance analysis to identify the most contributing features in the LBW classification task, which can aid in targeted intervention. METHODS: Our large data set consisted of 266,687 birth records across 6 years, and 8.63% (n=23,019) of records were labeled as LBW. To set up benchmarking ML models to predict LBW, we applied 7 classic ML models (ie, logistic regression, naive Bayes, random forest, extreme gradient boosting, adaptive boosting, multilayer perceptron, and sequential artificial neural network) while using 4 different data rebalancing methods: random undersampling, random oversampling, synthetic minority oversampling technique, and weight rebalancing. Owing to ethical considerations, in addition to ML evaluation metrics, we primarily used recall to evaluate model performance, indicating the number of correctly predicted LBW cases out of all actual LBW cases, as false negative health care outcomes could be fatal. We further analyzed feature importance to explore the degree to which each feature contributed to ML model prediction among our best-performing models. RESULTS: We found that extreme gradient boosting achieved the highest recall score—0.70—using the weight rebalancing method. Our results showed that various data rebalancing methods improved the prediction performance of the LBW group substantially. From the feature importance analysis, maternal race, age, payment source, sum of predelivery emergency department and inpatient hospitalizations, predelivery disease profile, and different social vulnerability index components were important risk factors associated with LBW. CONCLUSIONS: Our findings establish useful ML benchmarks to improve birth outcomes in the maternal health domain. They are informative to identify the minority class (ie, LBW) based on an extremely imbalanced data set, which may guide the development of personalized LBW early prevention, clinical interventions, and statewide maternal and infant health policy changes.
format	Online Article Text
id	pubmed-10267797
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-102677972023-06-15 Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies Ren, Yang Wu, Dezhi Tong, Yan López-DeFede, Ana Gareau, Sarah J Med Internet Res Original Paper BACKGROUND: Low birthweight (LBW) is a leading cause of neonatal mortality in the United States and a major causative factor of adverse health effects in newborns. Identifying high-risk patients early in prenatal care is crucial to preventing adverse outcomes. Previous studies have proposed various machine learning (ML) models for LBW prediction task, but they were limited by small and imbalanced data sets. Some authors attempted to address this through different data rebalancing methods. However, most of their reported performances did not reflect the models’ actual performance in real-life scenarios. To date, few studies have successfully benchmarked the performance of ML models in maternal health; thus, it is critical to establish benchmarks to advance ML use to subsequently improve birth outcomes. OBJECTIVE: This study aimed to establish several key benchmarking ML models to predict LBW and systematically apply different rebalancing optimization methods to a large-scale and extremely imbalanced all-payer hospital record data set that connects mother and baby data at a state level in the United States. We also performed feature importance analysis to identify the most contributing features in the LBW classification task, which can aid in targeted intervention. METHODS: Our large data set consisted of 266,687 birth records across 6 years, and 8.63% (n=23,019) of records were labeled as LBW. To set up benchmarking ML models to predict LBW, we applied 7 classic ML models (ie, logistic regression, naive Bayes, random forest, extreme gradient boosting, adaptive boosting, multilayer perceptron, and sequential artificial neural network) while using 4 different data rebalancing methods: random undersampling, random oversampling, synthetic minority oversampling technique, and weight rebalancing. Owing to ethical considerations, in addition to ML evaluation metrics, we primarily used recall to evaluate model performance, indicating the number of correctly predicted LBW cases out of all actual LBW cases, as false negative health care outcomes could be fatal. We further analyzed feature importance to explore the degree to which each feature contributed to ML model prediction among our best-performing models. RESULTS: We found that extreme gradient boosting achieved the highest recall score—0.70—using the weight rebalancing method. Our results showed that various data rebalancing methods improved the prediction performance of the LBW group substantially. From the feature importance analysis, maternal race, age, payment source, sum of predelivery emergency department and inpatient hospitalizations, predelivery disease profile, and different social vulnerability index components were important risk factors associated with LBW. CONCLUSIONS: Our findings establish useful ML benchmarks to improve birth outcomes in the maternal health domain. They are informative to identify the minority class (ie, LBW) based on an extremely imbalanced data set, which may guide the development of personalized LBW early prevention, clinical interventions, and statewide maternal and infant health policy changes. JMIR Publications 2023-05-31 /pmc/articles/PMC10267797/ /pubmed/37256674 http://dx.doi.org/10.2196/44081 Text en ©Yang Ren, Dezhi Wu, Yan Tong, Ana López-DeFede, Sarah Gareau. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 31.05.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Ren, Yang Wu, Dezhi Tong, Yan López-DeFede, Ana Gareau, Sarah Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies
title	Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies
title_full	Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies
title_fullStr	Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies
title_full_unstemmed	Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies
title_short	Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies
title_sort	issue of data imbalance on low birthweight baby outcomes prediction and associated risk factors identification: establishment of benchmarking key machine learning models with data rebalancing strategies
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10267797/ https://www.ncbi.nlm.nih.gov/pubmed/37256674 http://dx.doi.org/10.2196/44081
work_keys_str_mv	AT renyang issueofdataimbalanceonlowbirthweightbabyoutcomespredictionandassociatedriskfactorsidentificationestablishmentofbenchmarkingkeymachinelearningmodelswithdatarebalancingstrategies AT wudezhi issueofdataimbalanceonlowbirthweightbabyoutcomespredictionandassociatedriskfactorsidentificationestablishmentofbenchmarkingkeymachinelearningmodelswithdatarebalancingstrategies AT tongyan issueofdataimbalanceonlowbirthweightbabyoutcomespredictionandassociatedriskfactorsidentificationestablishmentofbenchmarkingkeymachinelearningmodelswithdatarebalancingstrategies AT lopezdefedeana issueofdataimbalanceonlowbirthweightbabyoutcomespredictionandassociatedriskfactorsidentificationestablishmentofbenchmarkingkeymachinelearningmodelswithdatarebalancingstrategies AT gareausarah issueofdataimbalanceonlowbirthweightbabyoutcomespredictionandassociatedriskfactorsidentificationestablishmentofbenchmarkingkeymachinelearningmodelswithdatarebalancingstrategies

Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies

Ejemplares similares