Cargando…

EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records

Privacy concerns often arise as the key bottleneck for the sharing of data between consumers and data holders, particularly for sensitive data such as Electronic Health Records (EHR). This impedes the application of data analytics and ML-based innovations with tremendous potential. One promising app...

Descripción completa

Detalles Bibliográficos
Autores principales: Yoon, Jinsung, Mizrahi, Michel, Ghalaty, Nahid Farhady, Jarvinen, Thomas, Ravi, Ashwin S., Brune, Peter, Kong, Fanyu, Anderson, Dave, Lee, George, Meir, Arie, Bandukwala, Farhana, Kanal, Elli, Arık, Sercan Ö., Pfister, Tomas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10421926/
https://www.ncbi.nlm.nih.gov/pubmed/37567968
http://dx.doi.org/10.1038/s41746-023-00888-7
_version_ 1785089084395356160
author Yoon, Jinsung
Mizrahi, Michel
Ghalaty, Nahid Farhady
Jarvinen, Thomas
Ravi, Ashwin S.
Brune, Peter
Kong, Fanyu
Anderson, Dave
Lee, George
Meir, Arie
Bandukwala, Farhana
Kanal, Elli
Arık, Sercan Ö.
Pfister, Tomas
author_facet Yoon, Jinsung
Mizrahi, Michel
Ghalaty, Nahid Farhady
Jarvinen, Thomas
Ravi, Ashwin S.
Brune, Peter
Kong, Fanyu
Anderson, Dave
Lee, George
Meir, Arie
Bandukwala, Farhana
Kanal, Elli
Arık, Sercan Ö.
Pfister, Tomas
author_sort Yoon, Jinsung
collection PubMed
description Privacy concerns often arise as the key bottleneck for the sharing of data between consumers and data holders, particularly for sensitive data such as Electronic Health Records (EHR). This impedes the application of data analytics and ML-based innovations with tremendous potential. One promising approach for such privacy concerns is to instead use synthetic data. We propose a generative modeling framework, EHR-Safe, for generating highly realistic and privacy-preserving synthetic EHR data. EHR-Safe is based on a two-stage model that consists of sequential encoder-decoder networks and generative adversarial networks. Our innovations focus on the key challenging aspects of real-world EHR data: heterogeneity, sparsity, coexistence of numerical and categorical features with distinct characteristics, and time-varying features with highly-varying sequence lengths. Under numerous evaluations, we demonstrate that the fidelity of EHR-Safe is almost-identical with real data (<3% accuracy difference for the models trained on them) while yielding almost-ideal performance in practical privacy metrics.
format Online
Article
Text
id pubmed-10421926
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-104219262023-08-13 EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records Yoon, Jinsung Mizrahi, Michel Ghalaty, Nahid Farhady Jarvinen, Thomas Ravi, Ashwin S. Brune, Peter Kong, Fanyu Anderson, Dave Lee, George Meir, Arie Bandukwala, Farhana Kanal, Elli Arık, Sercan Ö. Pfister, Tomas NPJ Digit Med Article Privacy concerns often arise as the key bottleneck for the sharing of data between consumers and data holders, particularly for sensitive data such as Electronic Health Records (EHR). This impedes the application of data analytics and ML-based innovations with tremendous potential. One promising approach for such privacy concerns is to instead use synthetic data. We propose a generative modeling framework, EHR-Safe, for generating highly realistic and privacy-preserving synthetic EHR data. EHR-Safe is based on a two-stage model that consists of sequential encoder-decoder networks and generative adversarial networks. Our innovations focus on the key challenging aspects of real-world EHR data: heterogeneity, sparsity, coexistence of numerical and categorical features with distinct characteristics, and time-varying features with highly-varying sequence lengths. Under numerous evaluations, we demonstrate that the fidelity of EHR-Safe is almost-identical with real data (<3% accuracy difference for the models trained on them) while yielding almost-ideal performance in practical privacy metrics. Nature Publishing Group UK 2023-08-11 /pmc/articles/PMC10421926/ /pubmed/37567968 http://dx.doi.org/10.1038/s41746-023-00888-7 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Yoon, Jinsung
Mizrahi, Michel
Ghalaty, Nahid Farhady
Jarvinen, Thomas
Ravi, Ashwin S.
Brune, Peter
Kong, Fanyu
Anderson, Dave
Lee, George
Meir, Arie
Bandukwala, Farhana
Kanal, Elli
Arık, Sercan Ö.
Pfister, Tomas
EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records
title EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records
title_full EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records
title_fullStr EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records
title_full_unstemmed EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records
title_short EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records
title_sort ehr-safe: generating high-fidelity and privacy-preserving synthetic electronic health records
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10421926/
https://www.ncbi.nlm.nih.gov/pubmed/37567968
http://dx.doi.org/10.1038/s41746-023-00888-7
work_keys_str_mv AT yoonjinsung ehrsafegeneratinghighfidelityandprivacypreservingsyntheticelectronichealthrecords
AT mizrahimichel ehrsafegeneratinghighfidelityandprivacypreservingsyntheticelectronichealthrecords
AT ghalatynahidfarhady ehrsafegeneratinghighfidelityandprivacypreservingsyntheticelectronichealthrecords
AT jarvinenthomas ehrsafegeneratinghighfidelityandprivacypreservingsyntheticelectronichealthrecords
AT raviashwins ehrsafegeneratinghighfidelityandprivacypreservingsyntheticelectronichealthrecords
AT brunepeter ehrsafegeneratinghighfidelityandprivacypreservingsyntheticelectronichealthrecords
AT kongfanyu ehrsafegeneratinghighfidelityandprivacypreservingsyntheticelectronichealthrecords
AT andersondave ehrsafegeneratinghighfidelityandprivacypreservingsyntheticelectronichealthrecords
AT leegeorge ehrsafegeneratinghighfidelityandprivacypreservingsyntheticelectronichealthrecords
AT meirarie ehrsafegeneratinghighfidelityandprivacypreservingsyntheticelectronichealthrecords
AT bandukwalafarhana ehrsafegeneratinghighfidelityandprivacypreservingsyntheticelectronichealthrecords
AT kanalelli ehrsafegeneratinghighfidelityandprivacypreservingsyntheticelectronichealthrecords
AT arıksercano ehrsafegeneratinghighfidelityandprivacypreservingsyntheticelectronichealthrecords
AT pfistertomas ehrsafegeneratinghighfidelityandprivacypreservingsyntheticelectronichealthrecords