Cargando…

Cohort design and natural language processing to reduce bias in electronic health records research

Electronic health record (EHR) datasets are statistically powerful but are subject to ascertainment bias and missingness. Using the Mass General Brigham multi-institutional EHR, we approximated a community-based cohort by sampling patients receiving longitudinal primary care between 2001-2018 (Commu...

Descripción completa

Detalles Bibliográficos
Autores principales: Khurshid, Shaan, Reeder, Christopher, Harrington, Lia X., Singh, Pulkit, Sarma, Gopal, Friedman, Samuel F., Di Achille, Paolo, Diamant, Nathaniel, Cunningham, Jonathan W., Turner, Ashby C., Lau, Emily S., Haimovich, Julian S., Al-Alusi, Mostafa A., Wang, Xin, Klarqvist, Marcus D. R., Ashburner, Jeffrey M., Diedrich, Christian, Ghadessi, Mercedeh, Mielke, Johanna, Eilken, Hanna M., McElhinney, Alice, Derix, Andrea, Atlas, Steven J., Ellinor, Patrick T., Philippakis, Anthony A., Anderson, Christopher D., Ho, Jennifer E., Batra, Puneet, Lubitz, Steven A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8993873/
https://www.ncbi.nlm.nih.gov/pubmed/35396454
http://dx.doi.org/10.1038/s41746-022-00590-0
_version_ 1784683995658715136
author Khurshid, Shaan
Reeder, Christopher
Harrington, Lia X.
Singh, Pulkit
Sarma, Gopal
Friedman, Samuel F.
Di Achille, Paolo
Diamant, Nathaniel
Cunningham, Jonathan W.
Turner, Ashby C.
Lau, Emily S.
Haimovich, Julian S.
Al-Alusi, Mostafa A.
Wang, Xin
Klarqvist, Marcus D. R.
Ashburner, Jeffrey M.
Diedrich, Christian
Ghadessi, Mercedeh
Mielke, Johanna
Eilken, Hanna M.
McElhinney, Alice
Derix, Andrea
Atlas, Steven J.
Ellinor, Patrick T.
Philippakis, Anthony A.
Anderson, Christopher D.
Ho, Jennifer E.
Batra, Puneet
Lubitz, Steven A.
author_facet Khurshid, Shaan
Reeder, Christopher
Harrington, Lia X.
Singh, Pulkit
Sarma, Gopal
Friedman, Samuel F.
Di Achille, Paolo
Diamant, Nathaniel
Cunningham, Jonathan W.
Turner, Ashby C.
Lau, Emily S.
Haimovich, Julian S.
Al-Alusi, Mostafa A.
Wang, Xin
Klarqvist, Marcus D. R.
Ashburner, Jeffrey M.
Diedrich, Christian
Ghadessi, Mercedeh
Mielke, Johanna
Eilken, Hanna M.
McElhinney, Alice
Derix, Andrea
Atlas, Steven J.
Ellinor, Patrick T.
Philippakis, Anthony A.
Anderson, Christopher D.
Ho, Jennifer E.
Batra, Puneet
Lubitz, Steven A.
author_sort Khurshid, Shaan
collection PubMed
description Electronic health record (EHR) datasets are statistically powerful but are subject to ascertainment bias and missingness. Using the Mass General Brigham multi-institutional EHR, we approximated a community-based cohort by sampling patients receiving longitudinal primary care between 2001-2018 (Community Care Cohort Project [C3PO], n = 520,868). We utilized natural language processing (NLP) to recover vital signs from unstructured notes. We assessed the validity of C3PO by deploying established risk models for myocardial infarction/stroke and atrial fibrillation. We then compared C3PO to Convenience Samples including all individuals from the same EHR with complete data, but without a longitudinal primary care requirement. NLP reduced the missingness of vital signs by 31%. NLP-recovered vital signs were highly correlated with values derived from structured fields (Pearson r range 0.95–0.99). Atrial fibrillation and myocardial infarction/stroke incidence were lower and risk models were better calibrated in C3PO as opposed to the Convenience Samples (calibration error range for myocardial infarction/stroke: 0.012–0.030 in C3PO vs. 0.028–0.046 in Convenience Samples; calibration error for atrial fibrillation 0.028 in C3PO vs. 0.036 in Convenience Samples). Sampling patients receiving regular primary care and using NLP to recover missing data may reduce bias and maximize generalizability of EHR research.
format Online
Article
Text
id pubmed-8993873
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-89938732022-04-27 Cohort design and natural language processing to reduce bias in electronic health records research Khurshid, Shaan Reeder, Christopher Harrington, Lia X. Singh, Pulkit Sarma, Gopal Friedman, Samuel F. Di Achille, Paolo Diamant, Nathaniel Cunningham, Jonathan W. Turner, Ashby C. Lau, Emily S. Haimovich, Julian S. Al-Alusi, Mostafa A. Wang, Xin Klarqvist, Marcus D. R. Ashburner, Jeffrey M. Diedrich, Christian Ghadessi, Mercedeh Mielke, Johanna Eilken, Hanna M. McElhinney, Alice Derix, Andrea Atlas, Steven J. Ellinor, Patrick T. Philippakis, Anthony A. Anderson, Christopher D. Ho, Jennifer E. Batra, Puneet Lubitz, Steven A. NPJ Digit Med Article Electronic health record (EHR) datasets are statistically powerful but are subject to ascertainment bias and missingness. Using the Mass General Brigham multi-institutional EHR, we approximated a community-based cohort by sampling patients receiving longitudinal primary care between 2001-2018 (Community Care Cohort Project [C3PO], n = 520,868). We utilized natural language processing (NLP) to recover vital signs from unstructured notes. We assessed the validity of C3PO by deploying established risk models for myocardial infarction/stroke and atrial fibrillation. We then compared C3PO to Convenience Samples including all individuals from the same EHR with complete data, but without a longitudinal primary care requirement. NLP reduced the missingness of vital signs by 31%. NLP-recovered vital signs were highly correlated with values derived from structured fields (Pearson r range 0.95–0.99). Atrial fibrillation and myocardial infarction/stroke incidence were lower and risk models were better calibrated in C3PO as opposed to the Convenience Samples (calibration error range for myocardial infarction/stroke: 0.012–0.030 in C3PO vs. 0.028–0.046 in Convenience Samples; calibration error for atrial fibrillation 0.028 in C3PO vs. 0.036 in Convenience Samples). Sampling patients receiving regular primary care and using NLP to recover missing data may reduce bias and maximize generalizability of EHR research. Nature Publishing Group UK 2022-04-08 /pmc/articles/PMC8993873/ /pubmed/35396454 http://dx.doi.org/10.1038/s41746-022-00590-0 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Khurshid, Shaan
Reeder, Christopher
Harrington, Lia X.
Singh, Pulkit
Sarma, Gopal
Friedman, Samuel F.
Di Achille, Paolo
Diamant, Nathaniel
Cunningham, Jonathan W.
Turner, Ashby C.
Lau, Emily S.
Haimovich, Julian S.
Al-Alusi, Mostafa A.
Wang, Xin
Klarqvist, Marcus D. R.
Ashburner, Jeffrey M.
Diedrich, Christian
Ghadessi, Mercedeh
Mielke, Johanna
Eilken, Hanna M.
McElhinney, Alice
Derix, Andrea
Atlas, Steven J.
Ellinor, Patrick T.
Philippakis, Anthony A.
Anderson, Christopher D.
Ho, Jennifer E.
Batra, Puneet
Lubitz, Steven A.
Cohort design and natural language processing to reduce bias in electronic health records research
title Cohort design and natural language processing to reduce bias in electronic health records research
title_full Cohort design and natural language processing to reduce bias in electronic health records research
title_fullStr Cohort design and natural language processing to reduce bias in electronic health records research
title_full_unstemmed Cohort design and natural language processing to reduce bias in electronic health records research
title_short Cohort design and natural language processing to reduce bias in electronic health records research
title_sort cohort design and natural language processing to reduce bias in electronic health records research
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8993873/
https://www.ncbi.nlm.nih.gov/pubmed/35396454
http://dx.doi.org/10.1038/s41746-022-00590-0
work_keys_str_mv AT khurshidshaan cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT reederchristopher cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT harringtonliax cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT singhpulkit cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT sarmagopal cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT friedmansamuelf cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT diachillepaolo cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT diamantnathaniel cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT cunninghamjonathanw cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT turnerashbyc cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT lauemilys cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT haimovichjulians cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT alalusimostafaa cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT wangxin cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT klarqvistmarcusdr cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT ashburnerjeffreym cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT diedrichchristian cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT ghadessimercedeh cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT mielkejohanna cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT eilkenhannam cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT mcelhinneyalice cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT derixandrea cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT atlasstevenj cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT ellinorpatrickt cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT philippakisanthonya cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT andersonchristopherd cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT hojennifere cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT batrapuneet cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch
AT lubitzstevena cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch