Cargando…
Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records
BACKGROUND AND AIMS: The incidence of colorectal cancer (CRC) is increasing in adults younger than 50, and early screening remains challenging due to cost and under-utilization. To identify individuals aged 35–50 years who may benefit from early screening, we developed a prediction model using machi...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9064446/ https://www.ncbi.nlm.nih.gov/pubmed/35271664 http://dx.doi.org/10.1371/journal.pone.0265209 |
_version_ | 1784699378585305088 |
---|---|
author | Hussan, Hisham Zhao, Jing Badu-Tawiah, Abraham K. Stanich, Peter Tabung, Fred Gray, Darrell Ma, Qin Kalady, Matthew Clinton, Steven K. |
author_facet | Hussan, Hisham Zhao, Jing Badu-Tawiah, Abraham K. Stanich, Peter Tabung, Fred Gray, Darrell Ma, Qin Kalady, Matthew Clinton, Steven K. |
author_sort | Hussan, Hisham |
collection | PubMed |
description | BACKGROUND AND AIMS: The incidence of colorectal cancer (CRC) is increasing in adults younger than 50, and early screening remains challenging due to cost and under-utilization. To identify individuals aged 35–50 years who may benefit from early screening, we developed a prediction model using machine learning and electronic health record (EHR)-derived factors. METHODS: We enrolled 3,116 adults aged 35–50 at average-risk for CRC and underwent colonoscopy between 2017–2020 at a single center. Prediction outcomes were (1) CRC and (2) CRC or high-risk polyps. We derived our predictors from EHRs (e.g., demographics, obesity, laboratory values, medications, and zip code-derived factors). We constructed four machine learning-based models using a training set (random sample of 70% of participants): regularized discriminant analysis, random forest, neural network, and gradient boosting decision tree. In the testing set (remaining 30% of participants), we measured predictive performance by comparing C-statistics to a reference model (logistic regression). RESULTS: The study sample was 55.1% female, 32.8% non-white, and included 16 (0.05%) CRC cases and 478 (15.3%) cases of CRC or high-risk polyps. All machine learning models predicted CRC with higher discriminative ability compared to the reference model [e.g., C-statistics (95%CI); neural network: 0.75 (0.48–1.00) vs. reference: 0.43 (0.18–0.67); P = 0.07] Furthermore, all machine learning approaches, except for gradient boosting, predicted CRC or high-risk polyps significantly better than the reference model [e.g., C-statistics (95%CI); regularized discriminant analysis: 0.64 (0.59–0.69) vs. reference: 0.55 (0.50–0.59); P<0.0015]. The most important predictive variables in the regularized discriminant analysis model for CRC or high-risk polyps were income per zip code, the colonoscopy indication, and body mass index quartiles. DISCUSSION: Machine learning can predict CRC risk in adults aged 35–50 using EHR with improved discrimination. Further development of our model is needed, followed by validation in a primary-care setting, before clinical application. |
format | Online Article Text |
id | pubmed-9064446 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-90644462022-05-04 Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records Hussan, Hisham Zhao, Jing Badu-Tawiah, Abraham K. Stanich, Peter Tabung, Fred Gray, Darrell Ma, Qin Kalady, Matthew Clinton, Steven K. PLoS One Research Article BACKGROUND AND AIMS: The incidence of colorectal cancer (CRC) is increasing in adults younger than 50, and early screening remains challenging due to cost and under-utilization. To identify individuals aged 35–50 years who may benefit from early screening, we developed a prediction model using machine learning and electronic health record (EHR)-derived factors. METHODS: We enrolled 3,116 adults aged 35–50 at average-risk for CRC and underwent colonoscopy between 2017–2020 at a single center. Prediction outcomes were (1) CRC and (2) CRC or high-risk polyps. We derived our predictors from EHRs (e.g., demographics, obesity, laboratory values, medications, and zip code-derived factors). We constructed four machine learning-based models using a training set (random sample of 70% of participants): regularized discriminant analysis, random forest, neural network, and gradient boosting decision tree. In the testing set (remaining 30% of participants), we measured predictive performance by comparing C-statistics to a reference model (logistic regression). RESULTS: The study sample was 55.1% female, 32.8% non-white, and included 16 (0.05%) CRC cases and 478 (15.3%) cases of CRC or high-risk polyps. All machine learning models predicted CRC with higher discriminative ability compared to the reference model [e.g., C-statistics (95%CI); neural network: 0.75 (0.48–1.00) vs. reference: 0.43 (0.18–0.67); P = 0.07] Furthermore, all machine learning approaches, except for gradient boosting, predicted CRC or high-risk polyps significantly better than the reference model [e.g., C-statistics (95%CI); regularized discriminant analysis: 0.64 (0.59–0.69) vs. reference: 0.55 (0.50–0.59); P<0.0015]. The most important predictive variables in the regularized discriminant analysis model for CRC or high-risk polyps were income per zip code, the colonoscopy indication, and body mass index quartiles. DISCUSSION: Machine learning can predict CRC risk in adults aged 35–50 using EHR with improved discrimination. Further development of our model is needed, followed by validation in a primary-care setting, before clinical application. Public Library of Science 2022-03-10 /pmc/articles/PMC9064446/ /pubmed/35271664 http://dx.doi.org/10.1371/journal.pone.0265209 Text en © 2022 Hussan et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Hussan, Hisham Zhao, Jing Badu-Tawiah, Abraham K. Stanich, Peter Tabung, Fred Gray, Darrell Ma, Qin Kalady, Matthew Clinton, Steven K. Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records |
title | Utility of machine learning in developing a predictive model for
early-age-onset colorectal neoplasia using electronic health
records |
title_full | Utility of machine learning in developing a predictive model for
early-age-onset colorectal neoplasia using electronic health
records |
title_fullStr | Utility of machine learning in developing a predictive model for
early-age-onset colorectal neoplasia using electronic health
records |
title_full_unstemmed | Utility of machine learning in developing a predictive model for
early-age-onset colorectal neoplasia using electronic health
records |
title_short | Utility of machine learning in developing a predictive model for
early-age-onset colorectal neoplasia using electronic health
records |
title_sort | utility of machine learning in developing a predictive model for
early-age-onset colorectal neoplasia using electronic health
records |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9064446/ https://www.ncbi.nlm.nih.gov/pubmed/35271664 http://dx.doi.org/10.1371/journal.pone.0265209 |
work_keys_str_mv | AT hussanhisham utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords AT zhaojing utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords AT badutawiahabrahamk utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords AT stanichpeter utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords AT tabungfred utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords AT graydarrell utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords AT maqin utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords AT kaladymatthew utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords AT clintonstevenk utilityofmachinelearningindevelopingapredictivemodelforearlyageonsetcolorectalneoplasiausingelectronichealthrecords |