Cargando…

Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes

BACKGROUND: The Fontan operation is associated with significant morbidity and premature mortality. Fontan cases cannot always be identified by International Classification of Diseases (ICD) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language pro...

Descripción completa

Detalles Bibliográficos
Autores principales:	Guo, Yuting, Al‐Garadi, Mohammed A., Book, Wendy M., Ivey, Lindsey C., Rodriguez, Fred H., Raskind‐Hood, Cheryl L., Robichaux, Chad, Sarker, Abeed
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	John Wiley and Sons Inc. 2023
Materias:	Original Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10356083/ https://www.ncbi.nlm.nih.gov/pubmed/37345821 http://dx.doi.org/10.1161/JAHA.123.030046

_version_	1785075193828343808
author	Guo, Yuting Al‐Garadi, Mohammed A. Book, Wendy M. Ivey, Lindsey C. Rodriguez, Fred H. Raskind‐Hood, Cheryl L. Robichaux, Chad Sarker, Abeed
author_facet	Guo, Yuting Al‐Garadi, Mohammed A. Book, Wendy M. Ivey, Lindsey C. Rodriguez, Fred H. Raskind‐Hood, Cheryl L. Robichaux, Chad Sarker, Abeed
author_sort	Guo, Yuting
collection	PubMed
description	BACKGROUND: The Fontan operation is associated with significant morbidity and premature mortality. Fontan cases cannot always be identified by International Classification of Diseases (ICD) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing–based machine learning models to automatically detect Fontan cases from free texts in electronic health records, and compare their performances with ICD code–based classification. METHODS AND RESULTS: We included free‐text notes of 10 935 manually validated patients, 778 (7.1%) Fontan and 10 157 (92.9%) non‐Fontan, from 2 health care systems. Using 80% of the patient data, we trained and optimized multiple machine learning models, support vector machines and 2 versions of RoBERTa (a robustly optimized transformer‐based model for language understanding), for automatically identifying Fontan cases based on notes. For RoBERTa, we implemented a novel sliding window strategy to overcome its length limit. We evaluated the machine learning models and ICD code–based classification on 20% of the held‐out patient data using the F (1) score metric. The ICD classification model, support vector machine, and RoBERTa achieved F (1) scores of 0.81 (95% CI, 0.79–0.83), 0.95 (95% CI, 0.92–0.97), and 0.89 (95% CI, 0.88–0.85) for the positive (Fontan) class, respectively. Support vector machines obtained the best performance (P<0.05), and both natural language processing models outperformed ICD code–based classification (P<0.05). The sliding window strategy improved performance over the base model (P<0.05) but did not outperform support vector machines. ICD code–based classification produced more false positives. CONCLUSIONS: Natural language processing models can automatically detect Fontan patients based on clinical notes with higher accuracy than ICD codes, and the former demonstrated the possibility of further improvement.
format	Online Article Text
id	pubmed-10356083
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	John Wiley and Sons Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-103560832023-07-20 Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes Guo, Yuting Al‐Garadi, Mohammed A. Book, Wendy M. Ivey, Lindsey C. Rodriguez, Fred H. Raskind‐Hood, Cheryl L. Robichaux, Chad Sarker, Abeed J Am Heart Assoc Original Research BACKGROUND: The Fontan operation is associated with significant morbidity and premature mortality. Fontan cases cannot always be identified by International Classification of Diseases (ICD) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing–based machine learning models to automatically detect Fontan cases from free texts in electronic health records, and compare their performances with ICD code–based classification. METHODS AND RESULTS: We included free‐text notes of 10 935 manually validated patients, 778 (7.1%) Fontan and 10 157 (92.9%) non‐Fontan, from 2 health care systems. Using 80% of the patient data, we trained and optimized multiple machine learning models, support vector machines and 2 versions of RoBERTa (a robustly optimized transformer‐based model for language understanding), for automatically identifying Fontan cases based on notes. For RoBERTa, we implemented a novel sliding window strategy to overcome its length limit. We evaluated the machine learning models and ICD code–based classification on 20% of the held‐out patient data using the F (1) score metric. The ICD classification model, support vector machine, and RoBERTa achieved F (1) scores of 0.81 (95% CI, 0.79–0.83), 0.95 (95% CI, 0.92–0.97), and 0.89 (95% CI, 0.88–0.85) for the positive (Fontan) class, respectively. Support vector machines obtained the best performance (P<0.05), and both natural language processing models outperformed ICD code–based classification (P<0.05). The sliding window strategy improved performance over the base model (P<0.05) but did not outperform support vector machines. ICD code–based classification produced more false positives. CONCLUSIONS: Natural language processing models can automatically detect Fontan patients based on clinical notes with higher accuracy than ICD codes, and the former demonstrated the possibility of further improvement. John Wiley and Sons Inc. 2023-06-22 /pmc/articles/PMC10356083/ /pubmed/37345821 http://dx.doi.org/10.1161/JAHA.123.030046 Text en © 2023 The Authors. Published on behalf of the American Heart Association, Inc., by Wiley. https://creativecommons.org/licenses/by-nc/4.0/This is an open access article under the terms of the http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.
spellingShingle	Original Research Guo, Yuting Al‐Garadi, Mohammed A. Book, Wendy M. Ivey, Lindsey C. Rodriguez, Fred H. Raskind‐Hood, Cheryl L. Robichaux, Chad Sarker, Abeed Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes
title	Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes
title_full	Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes
title_fullStr	Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes
title_full_unstemmed	Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes
title_short	Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes
title_sort	supervised text classification system detects fontan patients in electronic records with higher accuracy than icd codes
topic	Original Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10356083/ https://www.ncbi.nlm.nih.gov/pubmed/37345821 http://dx.doi.org/10.1161/JAHA.123.030046
work_keys_str_mv	AT guoyuting supervisedtextclassificationsystemdetectsfontanpatientsinelectronicrecordswithhigheraccuracythanicdcodes AT algaradimohammeda supervisedtextclassificationsystemdetectsfontanpatientsinelectronicrecordswithhigheraccuracythanicdcodes AT bookwendym supervisedtextclassificationsystemdetectsfontanpatientsinelectronicrecordswithhigheraccuracythanicdcodes AT iveylindseyc supervisedtextclassificationsystemdetectsfontanpatientsinelectronicrecordswithhigheraccuracythanicdcodes AT rodriguezfredh supervisedtextclassificationsystemdetectsfontanpatientsinelectronicrecordswithhigheraccuracythanicdcodes AT raskindhoodcheryll supervisedtextclassificationsystemdetectsfontanpatientsinelectronicrecordswithhigheraccuracythanicdcodes AT robichauxchad supervisedtextclassificationsystemdetectsfontanpatientsinelectronicrecordswithhigheraccuracythanicdcodes AT sarkerabeed supervisedtextclassificationsystemdetectsfontanpatientsinelectronicrecordswithhigheraccuracythanicdcodes

Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes

Ejemplares similares