Cargando…

Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records

BACKGROUND AND AIMS: The incidence of colorectal cancer (CRC) is increasing in adults younger than 50, and early screening remains challenging due to cost and under-utilization. To identify individuals aged 35–50 years who may benefit from early screening, we developed a prediction model using machi...

Descripción completa

Detalles Bibliográficos
Autores principales: Hussan, Hisham, Zhao, Jing, Badu-Tawiah, Abraham K., Stanich, Peter, Tabung, Fred, Gray, Darrell, Ma, Qin, Kalady, Matthew, Clinton, Steven K.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9064446/
https://www.ncbi.nlm.nih.gov/pubmed/35271664
http://dx.doi.org/10.1371/journal.pone.0265209
_version_ 1784699378585305088
author Hussan, Hisham
Zhao, Jing
Badu-Tawiah, Abraham K.
Stanich, Peter
Tabung, Fred
Gray, Darrell
Ma, Qin
Kalady, Matthew
Clinton, Steven K.
author_facet Hussan, Hisham
Zhao, Jing
Badu-Tawiah, Abraham K.
Stanich, Peter
Tabung, Fred
Gray, Darrell
Ma, Qin
Kalady, Matthew
Clinton, Steven K.
author_sort Hussan, Hisham
collection PubMed
description BACKGROUND AND AIMS: The incidence of colorectal cancer (CRC) is increasing in adults younger than 50, and early screening remains challenging due to cost and under-utilization. To identify individuals aged 35–50 years who may benefit from early screening, we developed a prediction model using machine learning and electronic health record (EHR)-derived factors. METHODS: We enrolled 3,116 adults aged 35–50 at average-risk for CRC and underwent colonoscopy between 2017–2020 at a single center. Prediction outcomes were (1) CRC and (2) CRC or high-risk polyps. We derived our predictors from EHRs (e.g., demographics, obesity, laboratory values, medications, and zip code-derived factors). We constructed four machine learning-based models using a training set (random sample of 70% of participants): regularized discriminant analysis, random forest, neural network, and gradient boosting decision tree. In the testing set (remaining 30% of participants), we measured predictive performance by comparing C-statistics to a reference model (logistic regression). RESULTS: The study sample was 55.1% female, 32.8% non-white, and included 16 (0.05%) CRC cases and 478 (15.3%) cases of CRC or high-risk polyps. All machine learning models predicted CRC with higher discriminative ability compared to the reference model [e.g., C-statistics (95%CI); neural network: 0.75 (0.48–1.00) vs. reference: 0.43 (0.18–0.67); P = 0.07] Furthermore, all machine learning approaches, except for gradient boosting, predicted CRC or high-risk polyps significantly better than the reference model [e.g., C-statistics (95%CI); regularized discriminant analysis: 0.64 (0.59–0.69) vs. reference: 0.55 (0.50–0.59); P<0.0015]. The most important predictive variables in the regularized discriminant analysis model for CRC or high-risk polyps were income per zip code, the colonoscopy indication, and body mass index quartiles. DISCUSSION: Machine learning can predict CRC risk in adults aged 35–50 using EHR with improved discrimination. Further development of our model is needed, followed by validation in a primary-care setting, before clinical application.
format Online
Article
Text
id pubmed-9064446
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-90644462022-05-04 Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records Hussan, Hisham Zhao, Jing Badu-Tawiah, Abraham K. Stanich, Peter Tabung, Fred Gray, Darrell Ma, Qin Kalady, Matthew Clinton, Steven K. PLoS One Research Article BACKGROUND AND AIMS: The incidence of colorectal cancer (CRC) is increasing in adults younger than 50, and early screening remains challenging due to cost and under-utilization. To identify individuals aged 35–50 years who may benefit from early screening, we developed a prediction model using machine learning and electronic health record (EHR)-derived factors. METHODS: We enrolled 3,116 adults aged 35–50 at average-risk for CRC and underwent colonoscopy between 2017–2020 at a single center. Prediction outcomes were (1) CRC and (2) CRC or high-risk polyps. We derived our predictors from EHRs (e.g., demographics, obesity, laboratory values, medications, and zip code-derived factors). We constructed four machine learning-based models using a training set (random sample of 70% of participants): regularized discriminant analysis, random forest, neural network, and gradient boosting decision tree. In the testing set (remaining 30% of participants), we measured predictive performance by comparing C-statistics to a reference model (logistic regression). RESULTS: The study sample was 55.1% female, 32.8% non-white, and included 16 (0.05%) CRC cases and 478 (15.3%) cases of CRC or high-risk polyps. All machine learning models predicted CRC with higher discriminative ability compared to the reference model [e.g., C-statistics (95%CI); neural network: 0.75 (0.48–1.00) vs. reference: 0.43 (0.18–0.67); P = 0.07] Furthermore, all machine learning approaches, except for gradient boosting, predicted CRC or high-risk polyps significantly better than the reference model [e.g., C-statistics (95%CI); regularized discriminant analysis: 0.64 (0.59–0.69) vs. reference: 0.55 (0.50–0.59); P<0.0015]. The most important predictive variables in the regularized discriminant analysis model for CRC or high-risk polyps were income per zip code, the colonoscopy indication, and body mass index quartiles. DISCUSSION: Machine learning can predict CRC risk in adults aged 35–50 using EHR with improved discrimination. Further development of our model is needed, followed by validation in a primary-care setting, before clinical application. Public Library of Science 2022-03-10 /pmc/articles/PMC9064446/ /pubmed/35271664 http://dx.doi.org/10.1371/journal.pone.0265209 Text en © 2022 Hussan et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Hussan, Hisham
Zhao, Jing
Badu-Tawiah, Abraham K.
Stanich, Peter
Tabung, Fred
Gray, Darrell
Ma, Qin
Kalady, Matthew
Clinton, Steven K.
Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records
title Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records
title_full Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records
title_fullStr Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records
title_full_unstemmed Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records
title_short Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records
title_sort utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9064446/
https://www.ncbi.nlm.nih.gov/pubmed/35271664
http://dx.doi.org/10.1371/journal.pone.0265209
work_keys_str_mv AT hussanhisham utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords
AT zhaojing utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords
AT badutawiahabrahamk utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords
AT stanichpeter utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords
AT tabungfred utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords
AT graydarrell utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords
AT maqin utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords
AT kaladymatthew utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords
AT clintonstevenk utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords