Cargando…

Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study

BACKGROUND: Financial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients...

Descripción completa

Detalles Bibliográficos
Autores principales:	Maarseveen, Tjardo D, Meinderink, Timo, Reinders, Marcel J T, Knitza, Johannes, Huizinga, Tom W J, Kleyer, Arnd, Simon, David, van den Akker, Erik B, Knevel, Rachel
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2020
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7735897/ https://www.ncbi.nlm.nih.gov/pubmed/33252349 http://dx.doi.org/10.2196/23930

_version_	1783622722376433664
author	Maarseveen, Tjardo D Meinderink, Timo Reinders, Marcel J T Knitza, Johannes Huizinga, Tom W J Kleyer, Arnd Simon, David van den Akker, Erik B Knevel, Rachel
author_facet	Maarseveen, Tjardo D Meinderink, Timo Reinders, Marcel J T Knitza, Johannes Huizinga, Tom W J Kleyer, Arnd Simon, David van den Akker, Erik B Knevel, Rachel
author_sort	Maarseveen, Tjardo D
collection	PubMed
description	BACKGROUND: Financial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries. OBJECTIVE: The aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capable of accurately identifying patients with rheumatoid arthritis from format-free text fields in electronic health records. METHODS: Two electronic health record data sets were employed: Leiden (n=3000) and Erlangen (n=4771). Using a portion of the Leiden data (n=2000), we compared 6 different machine learning methods and a naïve word-matching algorithm using 10-fold cross-validation. Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm. We selected the optimal threshold of positive predictive value for case identification based on the output of the best method in the training data. This validation workflow was subsequently applied to a portion of the Erlangen data (n=4293). For testing, the best performing methods were applied to remaining data (Leiden n=1000; Erlangen n=478) for an unbiased evaluation. RESULTS: For the Leiden data set, the word-matching algorithm demonstrated mixed performance (AUROC 0.90; AUPRC 0.33; F1 score 0.55), and 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83). Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94), and with this method, we could identify 2873 patients with rheumatoid arthritis in less than 7 seconds out of the complete collection of 23,300 patients in the Leiden electronic health record system. For the Erlangen data set, gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set, and applied to the test data, resulted once again in good results (F1 score 0.67; PPV 0.97). CONCLUSIONS: We demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and could be applied to any type of diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers with their own high-performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in electronic health record systems.
format	Online Article Text
id	pubmed-7735897
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-77358972020-12-18 Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study Maarseveen, Tjardo D Meinderink, Timo Reinders, Marcel J T Knitza, Johannes Huizinga, Tom W J Kleyer, Arnd Simon, David van den Akker, Erik B Knevel, Rachel JMIR Med Inform Original Paper BACKGROUND: Financial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries. OBJECTIVE: The aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capable of accurately identifying patients with rheumatoid arthritis from format-free text fields in electronic health records. METHODS: Two electronic health record data sets were employed: Leiden (n=3000) and Erlangen (n=4771). Using a portion of the Leiden data (n=2000), we compared 6 different machine learning methods and a naïve word-matching algorithm using 10-fold cross-validation. Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm. We selected the optimal threshold of positive predictive value for case identification based on the output of the best method in the training data. This validation workflow was subsequently applied to a portion of the Erlangen data (n=4293). For testing, the best performing methods were applied to remaining data (Leiden n=1000; Erlangen n=478) for an unbiased evaluation. RESULTS: For the Leiden data set, the word-matching algorithm demonstrated mixed performance (AUROC 0.90; AUPRC 0.33; F1 score 0.55), and 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83). Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94), and with this method, we could identify 2873 patients with rheumatoid arthritis in less than 7 seconds out of the complete collection of 23,300 patients in the Leiden electronic health record system. For the Erlangen data set, gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set, and applied to the test data, resulted once again in good results (F1 score 0.67; PPV 0.97). CONCLUSIONS: We demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and could be applied to any type of diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers with their own high-performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in electronic health record systems. JMIR Publications 2020-11-30 /pmc/articles/PMC7735897/ /pubmed/33252349 http://dx.doi.org/10.2196/23930 Text en ©Tjardo D Maarseveen, Timo Meinderink, Marcel J T Reinders, Johannes Knitza, Tom W J Huizinga, Arnd Kleyer, David Simon, Erik B van den Akker, Rachel Knevel. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 30.11.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Maarseveen, Tjardo D Meinderink, Timo Reinders, Marcel J T Knitza, Johannes Huizinga, Tom W J Kleyer, Arnd Simon, David van den Akker, Erik B Knevel, Rachel Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study
title	Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study
title_full	Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study
title_fullStr	Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study
title_full_unstemmed	Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study
title_short	Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study
title_sort	machine learning electronic health record identification of patients with rheumatoid arthritis: algorithm pipeline development and validation study
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7735897/ https://www.ncbi.nlm.nih.gov/pubmed/33252349 http://dx.doi.org/10.2196/23930
work_keys_str_mv	AT maarseveentjardod machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy AT meinderinktimo machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy AT reindersmarceljt machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy AT knitzajohannes machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy AT huizingatomwj machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy AT kleyerarnd machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy AT simondavid machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy AT vandenakkererikb machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy AT knevelrachel machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy

Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study

Ejemplares similares