Cargando…
Cohort design and natural language processing to reduce bias in electronic health records research
Electronic health record (EHR) datasets are statistically powerful but are subject to ascertainment bias and missingness. Using the Mass General Brigham multi-institutional EHR, we approximated a community-based cohort by sampling patients receiving longitudinal primary care between 2001-2018 (Commu...
Autores principales: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8993873/ https://www.ncbi.nlm.nih.gov/pubmed/35396454 http://dx.doi.org/10.1038/s41746-022-00590-0 |
_version_ | 1784683995658715136 |
---|---|
author | Khurshid, Shaan Reeder, Christopher Harrington, Lia X. Singh, Pulkit Sarma, Gopal Friedman, Samuel F. Di Achille, Paolo Diamant, Nathaniel Cunningham, Jonathan W. Turner, Ashby C. Lau, Emily S. Haimovich, Julian S. Al-Alusi, Mostafa A. Wang, Xin Klarqvist, Marcus D. R. Ashburner, Jeffrey M. Diedrich, Christian Ghadessi, Mercedeh Mielke, Johanna Eilken, Hanna M. McElhinney, Alice Derix, Andrea Atlas, Steven J. Ellinor, Patrick T. Philippakis, Anthony A. Anderson, Christopher D. Ho, Jennifer E. Batra, Puneet Lubitz, Steven A. |
author_facet | Khurshid, Shaan Reeder, Christopher Harrington, Lia X. Singh, Pulkit Sarma, Gopal Friedman, Samuel F. Di Achille, Paolo Diamant, Nathaniel Cunningham, Jonathan W. Turner, Ashby C. Lau, Emily S. Haimovich, Julian S. Al-Alusi, Mostafa A. Wang, Xin Klarqvist, Marcus D. R. Ashburner, Jeffrey M. Diedrich, Christian Ghadessi, Mercedeh Mielke, Johanna Eilken, Hanna M. McElhinney, Alice Derix, Andrea Atlas, Steven J. Ellinor, Patrick T. Philippakis, Anthony A. Anderson, Christopher D. Ho, Jennifer E. Batra, Puneet Lubitz, Steven A. |
author_sort | Khurshid, Shaan |
collection | PubMed |
description | Electronic health record (EHR) datasets are statistically powerful but are subject to ascertainment bias and missingness. Using the Mass General Brigham multi-institutional EHR, we approximated a community-based cohort by sampling patients receiving longitudinal primary care between 2001-2018 (Community Care Cohort Project [C3PO], n = 520,868). We utilized natural language processing (NLP) to recover vital signs from unstructured notes. We assessed the validity of C3PO by deploying established risk models for myocardial infarction/stroke and atrial fibrillation. We then compared C3PO to Convenience Samples including all individuals from the same EHR with complete data, but without a longitudinal primary care requirement. NLP reduced the missingness of vital signs by 31%. NLP-recovered vital signs were highly correlated with values derived from structured fields (Pearson r range 0.95–0.99). Atrial fibrillation and myocardial infarction/stroke incidence were lower and risk models were better calibrated in C3PO as opposed to the Convenience Samples (calibration error range for myocardial infarction/stroke: 0.012–0.030 in C3PO vs. 0.028–0.046 in Convenience Samples; calibration error for atrial fibrillation 0.028 in C3PO vs. 0.036 in Convenience Samples). Sampling patients receiving regular primary care and using NLP to recover missing data may reduce bias and maximize generalizability of EHR research. |
format | Online Article Text |
id | pubmed-8993873 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-89938732022-04-27 Cohort design and natural language processing to reduce bias in electronic health records research Khurshid, Shaan Reeder, Christopher Harrington, Lia X. Singh, Pulkit Sarma, Gopal Friedman, Samuel F. Di Achille, Paolo Diamant, Nathaniel Cunningham, Jonathan W. Turner, Ashby C. Lau, Emily S. Haimovich, Julian S. Al-Alusi, Mostafa A. Wang, Xin Klarqvist, Marcus D. R. Ashburner, Jeffrey M. Diedrich, Christian Ghadessi, Mercedeh Mielke, Johanna Eilken, Hanna M. McElhinney, Alice Derix, Andrea Atlas, Steven J. Ellinor, Patrick T. Philippakis, Anthony A. Anderson, Christopher D. Ho, Jennifer E. Batra, Puneet Lubitz, Steven A. NPJ Digit Med Article Electronic health record (EHR) datasets are statistically powerful but are subject to ascertainment bias and missingness. Using the Mass General Brigham multi-institutional EHR, we approximated a community-based cohort by sampling patients receiving longitudinal primary care between 2001-2018 (Community Care Cohort Project [C3PO], n = 520,868). We utilized natural language processing (NLP) to recover vital signs from unstructured notes. We assessed the validity of C3PO by deploying established risk models for myocardial infarction/stroke and atrial fibrillation. We then compared C3PO to Convenience Samples including all individuals from the same EHR with complete data, but without a longitudinal primary care requirement. NLP reduced the missingness of vital signs by 31%. NLP-recovered vital signs were highly correlated with values derived from structured fields (Pearson r range 0.95–0.99). Atrial fibrillation and myocardial infarction/stroke incidence were lower and risk models were better calibrated in C3PO as opposed to the Convenience Samples (calibration error range for myocardial infarction/stroke: 0.012–0.030 in C3PO vs. 0.028–0.046 in Convenience Samples; calibration error for atrial fibrillation 0.028 in C3PO vs. 0.036 in Convenience Samples). Sampling patients receiving regular primary care and using NLP to recover missing data may reduce bias and maximize generalizability of EHR research. Nature Publishing Group UK 2022-04-08 /pmc/articles/PMC8993873/ /pubmed/35396454 http://dx.doi.org/10.1038/s41746-022-00590-0 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Khurshid, Shaan Reeder, Christopher Harrington, Lia X. Singh, Pulkit Sarma, Gopal Friedman, Samuel F. Di Achille, Paolo Diamant, Nathaniel Cunningham, Jonathan W. Turner, Ashby C. Lau, Emily S. Haimovich, Julian S. Al-Alusi, Mostafa A. Wang, Xin Klarqvist, Marcus D. R. Ashburner, Jeffrey M. Diedrich, Christian Ghadessi, Mercedeh Mielke, Johanna Eilken, Hanna M. McElhinney, Alice Derix, Andrea Atlas, Steven J. Ellinor, Patrick T. Philippakis, Anthony A. Anderson, Christopher D. Ho, Jennifer E. Batra, Puneet Lubitz, Steven A. Cohort design and natural language processing to reduce bias in electronic health records research |
title | Cohort design and natural language processing to reduce bias in electronic health records research |
title_full | Cohort design and natural language processing to reduce bias in electronic health records research |
title_fullStr | Cohort design and natural language processing to reduce bias in electronic health records research |
title_full_unstemmed | Cohort design and natural language processing to reduce bias in electronic health records research |
title_short | Cohort design and natural language processing to reduce bias in electronic health records research |
title_sort | cohort design and natural language processing to reduce bias in electronic health records research |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8993873/ https://www.ncbi.nlm.nih.gov/pubmed/35396454 http://dx.doi.org/10.1038/s41746-022-00590-0 |
work_keys_str_mv | AT khurshidshaan cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT reederchristopher cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT harringtonliax cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT singhpulkit cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT sarmagopal cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT friedmansamuelf cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT diachillepaolo cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT diamantnathaniel cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT cunninghamjonathanw cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT turnerashbyc cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT lauemilys cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT haimovichjulians cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT alalusimostafaa cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT wangxin cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT klarqvistmarcusdr cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT ashburnerjeffreym cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT diedrichchristian cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT ghadessimercedeh cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT mielkejohanna cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT eilkenhannam cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT mcelhinneyalice cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT derixandrea cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT atlasstevenj cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT ellinorpatrickt cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT philippakisanthonya cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT andersonchristopherd cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT hojennifere cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT batrapuneet cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch AT lubitzstevena cohortdesignandnaturallanguageprocessingtoreducebiasinelectronichealthrecordsresearch |