Cargando…

Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer

IMPORTANCE: Electronic health records (EHRs) provide a low-cost means of accessing detailed longitudinal clinical data for large populations. A lung cancer cohort assembled from EHR data would be a powerful platform for clinical outcome studies. OBJECTIVE: To investigate whether a clinical cohort as...

Descripción completa

Detalles Bibliográficos
Autores principales: Yuan, Qianyu, Cai, Tianrun, Hong, Chuan, Du, Mulong, Johnson, Bruce E., Lanuti, Michael, Cai, Tianxi, Christiani, David C.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Medical Association 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8264641/
https://www.ncbi.nlm.nih.gov/pubmed/34232304
http://dx.doi.org/10.1001/jamanetworkopen.2021.14723
_version_ 1783719602920882176
author Yuan, Qianyu
Cai, Tianrun
Hong, Chuan
Du, Mulong
Johnson, Bruce E.
Lanuti, Michael
Cai, Tianxi
Christiani, David C.
author_facet Yuan, Qianyu
Cai, Tianrun
Hong, Chuan
Du, Mulong
Johnson, Bruce E.
Lanuti, Michael
Cai, Tianxi
Christiani, David C.
author_sort Yuan, Qianyu
collection PubMed
description IMPORTANCE: Electronic health records (EHRs) provide a low-cost means of accessing detailed longitudinal clinical data for large populations. A lung cancer cohort assembled from EHR data would be a powerful platform for clinical outcome studies. OBJECTIVE: To investigate whether a clinical cohort assembled from EHRs could be used in a lung cancer prognosis study. DESIGN, SETTING, AND PARTICIPANTS: In this cohort study, patients with lung cancer were identified among 76 643 patients with at least 1 lung cancer diagnostic code deposited in an EHR in Mass General Brigham health care system from July 1988 to October 2018. Patients were identified via a semisupervised machine learning algorithm, for which clinical information was extracted from structured and unstructured data via natural language processing tools. Data completeness and accuracy were assessed by comparing with the Boston Lung Cancer Study and against criterion standard EHR review results. A prognostic model for non–small cell lung cancer (NSCLC) overall survival was further developed for clinical application. Data were analyzed from March 2019 through July 2020. EXPOSURES: Clinical data deposited in EHRs for cohort construction and variables of interest for the prognostic model were collected. MAIN OUTCOMES AND MEASURES: The primary outcomes were the performance of the lung cancer classification model and the quality of the extracted variables; the secondary outcome was the performance of the prognostic model. RESULTS: Among 76 643 patients with at least 1 lung cancer diagnostic code, 42 069 patients were identified as having lung cancer, with a positive predictive value of 94.4%. The study cohort consisted of 35 375 patients (16 613 men [47.0%] and 18 756 women [53.0%]; 30 140 White individuals [85.2%], 1040 Black individuals [2.9%], and 857 Asian individuals [2.4%]) after excluding patients with lung cancer history and less than 14 days of follow-up after initial diagnosis. The median (interquartile range) age at diagnosis was 66.7 (58.4-74.1) years. The area under the receiver operating characteristic curves of the prognostic model for overall survival with NSCLC were 0.828 (95% CI, 0.815-0.842) for 1-year prediction, 0.825 (95% CI, 0.812-0.836) for 2-year prediction, 0.814 (95% CI, 0.800-0.826) for 3-year prediction, 0.814 (95% CI, 0.799-0.828) for 4-year prediction, and 0.812 (95% CI, 0.798-0.825) for 5-year prediction. CONCLUSIONS AND RELEVANCE: These findings suggest the feasibility of assembling a large-scale EHR-based lung cancer cohort with detailed longitudinal clinical measurements and that EHR data may be applied in cancer progression with a set of generalizable approaches.
format Online
Article
Text
id pubmed-8264641
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher American Medical Association
record_format MEDLINE/PubMed
spelling pubmed-82646412021-07-09 Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer Yuan, Qianyu Cai, Tianrun Hong, Chuan Du, Mulong Johnson, Bruce E. Lanuti, Michael Cai, Tianxi Christiani, David C. JAMA Netw Open Original Investigation IMPORTANCE: Electronic health records (EHRs) provide a low-cost means of accessing detailed longitudinal clinical data for large populations. A lung cancer cohort assembled from EHR data would be a powerful platform for clinical outcome studies. OBJECTIVE: To investigate whether a clinical cohort assembled from EHRs could be used in a lung cancer prognosis study. DESIGN, SETTING, AND PARTICIPANTS: In this cohort study, patients with lung cancer were identified among 76 643 patients with at least 1 lung cancer diagnostic code deposited in an EHR in Mass General Brigham health care system from July 1988 to October 2018. Patients were identified via a semisupervised machine learning algorithm, for which clinical information was extracted from structured and unstructured data via natural language processing tools. Data completeness and accuracy were assessed by comparing with the Boston Lung Cancer Study and against criterion standard EHR review results. A prognostic model for non–small cell lung cancer (NSCLC) overall survival was further developed for clinical application. Data were analyzed from March 2019 through July 2020. EXPOSURES: Clinical data deposited in EHRs for cohort construction and variables of interest for the prognostic model were collected. MAIN OUTCOMES AND MEASURES: The primary outcomes were the performance of the lung cancer classification model and the quality of the extracted variables; the secondary outcome was the performance of the prognostic model. RESULTS: Among 76 643 patients with at least 1 lung cancer diagnostic code, 42 069 patients were identified as having lung cancer, with a positive predictive value of 94.4%. The study cohort consisted of 35 375 patients (16 613 men [47.0%] and 18 756 women [53.0%]; 30 140 White individuals [85.2%], 1040 Black individuals [2.9%], and 857 Asian individuals [2.4%]) after excluding patients with lung cancer history and less than 14 days of follow-up after initial diagnosis. The median (interquartile range) age at diagnosis was 66.7 (58.4-74.1) years. The area under the receiver operating characteristic curves of the prognostic model for overall survival with NSCLC were 0.828 (95% CI, 0.815-0.842) for 1-year prediction, 0.825 (95% CI, 0.812-0.836) for 2-year prediction, 0.814 (95% CI, 0.800-0.826) for 3-year prediction, 0.814 (95% CI, 0.799-0.828) for 4-year prediction, and 0.812 (95% CI, 0.798-0.825) for 5-year prediction. CONCLUSIONS AND RELEVANCE: These findings suggest the feasibility of assembling a large-scale EHR-based lung cancer cohort with detailed longitudinal clinical measurements and that EHR data may be applied in cancer progression with a set of generalizable approaches. American Medical Association 2021-07-07 /pmc/articles/PMC8264641/ /pubmed/34232304 http://dx.doi.org/10.1001/jamanetworkopen.2021.14723 Text en Copyright 2021 Yuan Q et al. JAMA Network Open. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the CC-BY License.
spellingShingle Original Investigation
Yuan, Qianyu
Cai, Tianrun
Hong, Chuan
Du, Mulong
Johnson, Bruce E.
Lanuti, Michael
Cai, Tianxi
Christiani, David C.
Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer
title Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer
title_full Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer
title_fullStr Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer
title_full_unstemmed Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer
title_short Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer
title_sort performance of a machine learning algorithm using electronic health record data to identify and estimate survival in a longitudinal cohort of patients with lung cancer
topic Original Investigation
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8264641/
https://www.ncbi.nlm.nih.gov/pubmed/34232304
http://dx.doi.org/10.1001/jamanetworkopen.2021.14723
work_keys_str_mv AT yuanqianyu performanceofamachinelearningalgorithmusingelectronichealthrecorddatatoidentifyandestimatesurvivalinalongitudinalcohortofpatientswithlungcancer
AT caitianrun performanceofamachinelearningalgorithmusingelectronichealthrecorddatatoidentifyandestimatesurvivalinalongitudinalcohortofpatientswithlungcancer
AT hongchuan performanceofamachinelearningalgorithmusingelectronichealthrecorddatatoidentifyandestimatesurvivalinalongitudinalcohortofpatientswithlungcancer
AT dumulong performanceofamachinelearningalgorithmusingelectronichealthrecorddatatoidentifyandestimatesurvivalinalongitudinalcohortofpatientswithlungcancer
AT johnsonbrucee performanceofamachinelearningalgorithmusingelectronichealthrecorddatatoidentifyandestimatesurvivalinalongitudinalcohortofpatientswithlungcancer
AT lanutimichael performanceofamachinelearningalgorithmusingelectronichealthrecorddatatoidentifyandestimatesurvivalinalongitudinalcohortofpatientswithlungcancer
AT caitianxi performanceofamachinelearningalgorithmusingelectronichealthrecorddatatoidentifyandestimatesurvivalinalongitudinalcohortofpatientswithlungcancer
AT christianidavidc performanceofamachinelearningalgorithmusingelectronichealthrecorddatatoidentifyandestimatesurvivalinalongitudinalcohortofpatientswithlungcancer