Cargando…

An effective up-sampling approach for breast cancer prediction with imbalanced data: A machine learning model-based comparative analysis

Early detection of breast cancer plays a critical role in successful treatment that saves thousands of lives of patients every year. Despite massive clinical data have been collected and stored by healthcare organizations, only a small portion of the data has been used to support decision-making for...

Descripción completa

Detalles Bibliográficos
Autores principales: Tran, Tuan, Le, Uyen, Shi, Yihui
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9140301/
https://www.ncbi.nlm.nih.gov/pubmed/35622821
http://dx.doi.org/10.1371/journal.pone.0269135
_version_ 1784715063907581952
author Tran, Tuan
Le, Uyen
Shi, Yihui
author_facet Tran, Tuan
Le, Uyen
Shi, Yihui
author_sort Tran, Tuan
collection PubMed
description Early detection of breast cancer plays a critical role in successful treatment that saves thousands of lives of patients every year. Despite massive clinical data have been collected and stored by healthcare organizations, only a small portion of the data has been used to support decision-making for treatments. In this study, we proposed an engineered up-sampling method (ENUS) for handling imbalanced data to improve predictive performance of machine learning models. Our experiment results showed that when the ratio of the minority to the majority class is less than 20%, training models with ENUS improved the balanced accuracy 3.74%, sensitivity 8.36% and F1 score 3.83%. Our study also identified that XGBoost Tree (XGBTree) using ENUS achieved the best performance with an average balanced accuracy of 97.47% (min = 93%, max = 100%), sensitivity of 97.88% (min = 89% and max = 100%), and F1 score of 96.20% (min = 89.5%, max = 100%) in the validation dataset. Furthermore, our ensemble algorithm identified Cell_Shape and Nuclei as the most important attributes in predicting breast cancer. The finding re-affirms the previous knowledge of the relationship between Cell_Shape, Nuclei, and the grades of breast cancer using a data-driven approach. Finally, our experiment showed that Random Forest and Neural Network models had the least training time. Our study provided a comprehensive comparison of a wide range of machine learning methods in predicting breast cancer risk. It can be used as a tool for healthcare practitioners to effectively detect and treat breast cancer.
format Online
Article
Text
id pubmed-9140301
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-91403012022-05-28 An effective up-sampling approach for breast cancer prediction with imbalanced data: A machine learning model-based comparative analysis Tran, Tuan Le, Uyen Shi, Yihui PLoS One Research Article Early detection of breast cancer plays a critical role in successful treatment that saves thousands of lives of patients every year. Despite massive clinical data have been collected and stored by healthcare organizations, only a small portion of the data has been used to support decision-making for treatments. In this study, we proposed an engineered up-sampling method (ENUS) for handling imbalanced data to improve predictive performance of machine learning models. Our experiment results showed that when the ratio of the minority to the majority class is less than 20%, training models with ENUS improved the balanced accuracy 3.74%, sensitivity 8.36% and F1 score 3.83%. Our study also identified that XGBoost Tree (XGBTree) using ENUS achieved the best performance with an average balanced accuracy of 97.47% (min = 93%, max = 100%), sensitivity of 97.88% (min = 89% and max = 100%), and F1 score of 96.20% (min = 89.5%, max = 100%) in the validation dataset. Furthermore, our ensemble algorithm identified Cell_Shape and Nuclei as the most important attributes in predicting breast cancer. The finding re-affirms the previous knowledge of the relationship between Cell_Shape, Nuclei, and the grades of breast cancer using a data-driven approach. Finally, our experiment showed that Random Forest and Neural Network models had the least training time. Our study provided a comprehensive comparison of a wide range of machine learning methods in predicting breast cancer risk. It can be used as a tool for healthcare practitioners to effectively detect and treat breast cancer. Public Library of Science 2022-05-27 /pmc/articles/PMC9140301/ /pubmed/35622821 http://dx.doi.org/10.1371/journal.pone.0269135 Text en © 2022 Tran et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Tran, Tuan
Le, Uyen
Shi, Yihui
An effective up-sampling approach for breast cancer prediction with imbalanced data: A machine learning model-based comparative analysis
title An effective up-sampling approach for breast cancer prediction with imbalanced data: A machine learning model-based comparative analysis
title_full An effective up-sampling approach for breast cancer prediction with imbalanced data: A machine learning model-based comparative analysis
title_fullStr An effective up-sampling approach for breast cancer prediction with imbalanced data: A machine learning model-based comparative analysis
title_full_unstemmed An effective up-sampling approach for breast cancer prediction with imbalanced data: A machine learning model-based comparative analysis
title_short An effective up-sampling approach for breast cancer prediction with imbalanced data: A machine learning model-based comparative analysis
title_sort effective up-sampling approach for breast cancer prediction with imbalanced data: a machine learning model-based comparative analysis
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9140301/
https://www.ncbi.nlm.nih.gov/pubmed/35622821
http://dx.doi.org/10.1371/journal.pone.0269135
work_keys_str_mv AT trantuan aneffectiveupsamplingapproachforbreastcancerpredictionwithimbalanceddataamachinelearningmodelbasedcomparativeanalysis
AT leuyen aneffectiveupsamplingapproachforbreastcancerpredictionwithimbalanceddataamachinelearningmodelbasedcomparativeanalysis
AT shiyihui aneffectiveupsamplingapproachforbreastcancerpredictionwithimbalanceddataamachinelearningmodelbasedcomparativeanalysis
AT trantuan effectiveupsamplingapproachforbreastcancerpredictionwithimbalanceddataamachinelearningmodelbasedcomparativeanalysis
AT leuyen effectiveupsamplingapproachforbreastcancerpredictionwithimbalanceddataamachinelearningmodelbasedcomparativeanalysis
AT shiyihui effectiveupsamplingapproachforbreastcancerpredictionwithimbalanceddataamachinelearningmodelbasedcomparativeanalysis