Cargando…

Identifying who has long COVID in the USA: a machine learning approach using N3C data

BACKGROUND: Post-acute sequelae of SARS-CoV-2 infection, known as long COVID, have severely affected recovery from the COVID-19 pandemic for patients and society alike. Long COVID is characterised by evolving, heterogeneous symptoms, making it challenging to derive an unambiguous definition. Studies...

Descripción completa

Detalles Bibliográficos
Autores principales: Pfaff, Emily R, Girvin, Andrew T, Bennett, Tellen D, Bhatia, Abhishek, Brooks, Ian M, Deer, Rachel R, Dekermanjian, Jonathan P, Jolley, Sarah Elizabeth, Kahn, Michael G, Kostka, Kristin, McMurry, Julie A, Moffitt, Richard, Walden, Anita, Chute, Christopher G, Haendel, Melissa A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: The Author(s). Published by Elsevier Ltd. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9110014/
https://www.ncbi.nlm.nih.gov/pubmed/35589549
http://dx.doi.org/10.1016/S2589-7500(22)00048-6
_version_ 1784709005394837504
author Pfaff, Emily R
Girvin, Andrew T
Bennett, Tellen D
Bhatia, Abhishek
Brooks, Ian M
Deer, Rachel R
Dekermanjian, Jonathan P
Jolley, Sarah Elizabeth
Kahn, Michael G
Kostka, Kristin
McMurry, Julie A
Moffitt, Richard
Walden, Anita
Chute, Christopher G
Haendel, Melissa A
author_facet Pfaff, Emily R
Girvin, Andrew T
Bennett, Tellen D
Bhatia, Abhishek
Brooks, Ian M
Deer, Rachel R
Dekermanjian, Jonathan P
Jolley, Sarah Elizabeth
Kahn, Michael G
Kostka, Kristin
McMurry, Julie A
Moffitt, Richard
Walden, Anita
Chute, Christopher G
Haendel, Melissa A
author_sort Pfaff, Emily R
collection PubMed
description BACKGROUND: Post-acute sequelae of SARS-CoV-2 infection, known as long COVID, have severely affected recovery from the COVID-19 pandemic for patients and society alike. Long COVID is characterised by evolving, heterogeneous symptoms, making it challenging to derive an unambiguous definition. Studies of electronic health records are a crucial element of the US National Institutes of Health's RECOVER Initiative, which is addressing the urgent need to understand long COVID, identify treatments, and accurately identify who has it—the latter is the aim of this study. METHODS: Using the National COVID Cohort Collaborative's (N3C) electronic health record repository, we developed XGBoost machine learning models to identify potential patients with long COVID. We defined our base population (n=1 793 604) as any non-deceased adult patient (age ≥18 years) with either an International Classification of Diseases-10-Clinical Modification COVID-19 diagnosis code (U07.1) from an inpatient or emergency visit, or a positive SARS-CoV-2 PCR or antigen test, and for whom at least 90 days have passed since COVID-19 index date. We examined demographics, health-care utilisation, diagnoses, and medications for 97 995 adults with COVID-19. We used data on these features and 597 patients from a long COVID clinic to train three machine learning models to identify potential long COVID among all patients with COVID-19, patients hospitalised with COVID-19, and patients who had COVID-19 but were not hospitalised. Feature importance was determined via Shapley values. We further validated the models on data from a fourth site. FINDINGS: Our models identified, with high accuracy, patients who potentially have long COVID, achieving areas under the receiver operator characteristic curve of 0·92 (all patients), 0·90 (hospitalised), and 0·85 (non-hospitalised). Important features, as defined by Shapley values, include rate of health-care utilisation, patient age, dyspnoea, and other diagnosis and medication information available within the electronic health record. INTERPRETATION: Patients identified by our models as potentially having long COVID can be interpreted as patients warranting care at a specialty clinic for long COVID, which is an essential proxy for long COVID diagnosis as its definition continues to evolve. We also achieve the urgent goal of identifying potential long COVID in patients for clinical trials. As more data sources are identified, our models can be retrained and tuned based on the needs of individual studies. FUNDING: US National Institutes of Health and National Center for Advancing Translational Sciences through the RECOVER Initiative.
format Online
Article
Text
id pubmed-9110014
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher The Author(s). Published by Elsevier Ltd.
record_format MEDLINE/PubMed
spelling pubmed-91100142022-05-17 Identifying who has long COVID in the USA: a machine learning approach using N3C data Pfaff, Emily R Girvin, Andrew T Bennett, Tellen D Bhatia, Abhishek Brooks, Ian M Deer, Rachel R Dekermanjian, Jonathan P Jolley, Sarah Elizabeth Kahn, Michael G Kostka, Kristin McMurry, Julie A Moffitt, Richard Walden, Anita Chute, Christopher G Haendel, Melissa A Lancet Digit Health Articles BACKGROUND: Post-acute sequelae of SARS-CoV-2 infection, known as long COVID, have severely affected recovery from the COVID-19 pandemic for patients and society alike. Long COVID is characterised by evolving, heterogeneous symptoms, making it challenging to derive an unambiguous definition. Studies of electronic health records are a crucial element of the US National Institutes of Health's RECOVER Initiative, which is addressing the urgent need to understand long COVID, identify treatments, and accurately identify who has it—the latter is the aim of this study. METHODS: Using the National COVID Cohort Collaborative's (N3C) electronic health record repository, we developed XGBoost machine learning models to identify potential patients with long COVID. We defined our base population (n=1 793 604) as any non-deceased adult patient (age ≥18 years) with either an International Classification of Diseases-10-Clinical Modification COVID-19 diagnosis code (U07.1) from an inpatient or emergency visit, or a positive SARS-CoV-2 PCR or antigen test, and for whom at least 90 days have passed since COVID-19 index date. We examined demographics, health-care utilisation, diagnoses, and medications for 97 995 adults with COVID-19. We used data on these features and 597 patients from a long COVID clinic to train three machine learning models to identify potential long COVID among all patients with COVID-19, patients hospitalised with COVID-19, and patients who had COVID-19 but were not hospitalised. Feature importance was determined via Shapley values. We further validated the models on data from a fourth site. FINDINGS: Our models identified, with high accuracy, patients who potentially have long COVID, achieving areas under the receiver operator characteristic curve of 0·92 (all patients), 0·90 (hospitalised), and 0·85 (non-hospitalised). Important features, as defined by Shapley values, include rate of health-care utilisation, patient age, dyspnoea, and other diagnosis and medication information available within the electronic health record. INTERPRETATION: Patients identified by our models as potentially having long COVID can be interpreted as patients warranting care at a specialty clinic for long COVID, which is an essential proxy for long COVID diagnosis as its definition continues to evolve. We also achieve the urgent goal of identifying potential long COVID in patients for clinical trials. As more data sources are identified, our models can be retrained and tuned based on the needs of individual studies. FUNDING: US National Institutes of Health and National Center for Advancing Translational Sciences through the RECOVER Initiative. The Author(s). Published by Elsevier Ltd. 2022-07 2022-05-16 /pmc/articles/PMC9110014/ /pubmed/35589549 http://dx.doi.org/10.1016/S2589-7500(22)00048-6 Text en © 2022 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY-NC-ND 4.0 license Since January 2020 Elsevier has created a COVID-19 resource centre with free information in English and Mandarin on the novel coronavirus COVID-19. The COVID-19 resource centre is hosted on Elsevier Connect, the company's public news and information website. Elsevier hereby grants permission to make all its COVID-19-related research that is available on the COVID-19 resource centre - including this research content - immediately available in PubMed Central and other publicly funded repositories, such as the WHO COVID database with rights for unrestricted research re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for free by Elsevier for as long as the COVID-19 resource centre remains active.
spellingShingle Articles
Pfaff, Emily R
Girvin, Andrew T
Bennett, Tellen D
Bhatia, Abhishek
Brooks, Ian M
Deer, Rachel R
Dekermanjian, Jonathan P
Jolley, Sarah Elizabeth
Kahn, Michael G
Kostka, Kristin
McMurry, Julie A
Moffitt, Richard
Walden, Anita
Chute, Christopher G
Haendel, Melissa A
Identifying who has long COVID in the USA: a machine learning approach using N3C data
title Identifying who has long COVID in the USA: a machine learning approach using N3C data
title_full Identifying who has long COVID in the USA: a machine learning approach using N3C data
title_fullStr Identifying who has long COVID in the USA: a machine learning approach using N3C data
title_full_unstemmed Identifying who has long COVID in the USA: a machine learning approach using N3C data
title_short Identifying who has long COVID in the USA: a machine learning approach using N3C data
title_sort identifying who has long covid in the usa: a machine learning approach using n3c data
topic Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9110014/
https://www.ncbi.nlm.nih.gov/pubmed/35589549
http://dx.doi.org/10.1016/S2589-7500(22)00048-6
work_keys_str_mv AT pfaffemilyr identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT girvinandrewt identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT bennetttellend identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT bhatiaabhishek identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT brooksianm identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT deerrachelr identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT dekermanjianjonathanp identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT jolleysarahelizabeth identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT kahnmichaelg identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT kostkakristin identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT mcmurryjuliea identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT moffittrichard identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT waldenanita identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT chutechristopherg identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT haendelmelissaa identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata
AT identifyingwhohaslongcovidintheusaamachinelearningapproachusingn3cdata