Cargando…

Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model

Synthetic electronic health records (EHRs) that are both realistic and preserve privacy can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis. However, generating high-fidelity and granular electronic health record (EHR) data in its original, highly-dim...

Descripción completa

Detalles Bibliográficos
Autores principales: Theodorou, Brandon, Xiao, Cao, Sun, Jimeng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Journal Experts 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10029081/
https://www.ncbi.nlm.nih.gov/pubmed/36945542
http://dx.doi.org/10.21203/rs.3.rs-2644725/v1
_version_ 1784910073291603968
author Theodorou, Brandon
Xiao, Cao
Sun, Jimeng
author_facet Theodorou, Brandon
Xiao, Cao
Sun, Jimeng
author_sort Theodorou, Brandon
collection PubMed
description Synthetic electronic health records (EHRs) that are both realistic and preserve privacy can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis. However, generating high-fidelity and granular electronic health record (EHR) data in its original, highly-dimensional form poses challenges for existing methods due to the complexities inherent in high-dimensional data. In this paper, we propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR, which preserve the statistical properties of real EHR and can be used to train accurate ML models without privacy concerns. Our HALO method, designed as a hierarchical autoregressive model, generates a probability density function of medical codes, clinical visits, and patient records, allowing for the generation of realistic EHR data in its original, unaggregated form without the need for variable selection or aggregation. Additionally, our model also produces high-quality continuous variables in a longitudinal and probabilistic manner. We conducted extensive experiments and demonstrate that HALO can generate high-fidelity EHR data with high-dimensional disease code probabilities (d ≈ 10, 000), disease code co-occurrence probabilities within a visit (d ≈ 1, 000, 000), and conditional probabilities across consecutive visits (d ≈ 5, 000, 000) and achieve above 0.9 R2 correlation in comparison to real EHR data. In comparison to the leading baseline, HALO improves predictive modeling by over 17% in its predictive accuracy and perplexity on a hold-off test set of real EHR data. This performance then enables downstream ML models trained on its synthetic data to achieve comparable accuracy to models trained on real data (0.938 area under the ROC curve with HALO data vs. 0.943 with real data). Finally, using a combination of real and synthetic data enhances the accuracy of ML models beyond that achieved by using only real EHR data.
format Online
Article
Text
id pubmed-10029081
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Journal Experts
record_format MEDLINE/PubMed
spelling pubmed-100290812023-03-22 Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model Theodorou, Brandon Xiao, Cao Sun, Jimeng Res Sq Article Synthetic electronic health records (EHRs) that are both realistic and preserve privacy can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis. However, generating high-fidelity and granular electronic health record (EHR) data in its original, highly-dimensional form poses challenges for existing methods due to the complexities inherent in high-dimensional data. In this paper, we propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR, which preserve the statistical properties of real EHR and can be used to train accurate ML models without privacy concerns. Our HALO method, designed as a hierarchical autoregressive model, generates a probability density function of medical codes, clinical visits, and patient records, allowing for the generation of realistic EHR data in its original, unaggregated form without the need for variable selection or aggregation. Additionally, our model also produces high-quality continuous variables in a longitudinal and probabilistic manner. We conducted extensive experiments and demonstrate that HALO can generate high-fidelity EHR data with high-dimensional disease code probabilities (d ≈ 10, 000), disease code co-occurrence probabilities within a visit (d ≈ 1, 000, 000), and conditional probabilities across consecutive visits (d ≈ 5, 000, 000) and achieve above 0.9 R2 correlation in comparison to real EHR data. In comparison to the leading baseline, HALO improves predictive modeling by over 17% in its predictive accuracy and perplexity on a hold-off test set of real EHR data. This performance then enables downstream ML models trained on its synthetic data to achieve comparable accuracy to models trained on real data (0.938 area under the ROC curve with HALO data vs. 0.943 with real data). Finally, using a combination of real and synthetic data enhances the accuracy of ML models beyond that achieved by using only real EHR data. American Journal Experts 2023-03-10 /pmc/articles/PMC10029081/ /pubmed/36945542 http://dx.doi.org/10.21203/rs.3.rs-2644725/v1 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. https://creativecommons.org/licenses/by/4.0/License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License (https://creativecommons.org/licenses/by/4.0/)
spellingShingle Article
Theodorou, Brandon
Xiao, Cao
Sun, Jimeng
Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model
title Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model
title_full Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model
title_fullStr Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model
title_full_unstemmed Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model
title_short Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model
title_sort synthesize extremely high-dimensional longitudinal electronic health records via hierarchical autoregressive language model
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10029081/
https://www.ncbi.nlm.nih.gov/pubmed/36945542
http://dx.doi.org/10.21203/rs.3.rs-2644725/v1
work_keys_str_mv AT theodoroubrandon synthesizeextremelyhighdimensionallongitudinalelectronichealthrecordsviahierarchicalautoregressivelanguagemodel
AT xiaocao synthesizeextremelyhighdimensionallongitudinalelectronichealthrecordsviahierarchicalautoregressivelanguagemodel
AT sunjimeng synthesizeextremelyhighdimensionallongitudinalelectronichealthrecordsviahierarchicalautoregressivelanguagemodel