Cargando…
Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
While nearly all computational methods operate on pseudonymized personal data, re-identification remains a risk. With personal health data, this re-identification risk may be considered a double-crossing of patients’ trust. Herein, we present a new method to generate synthetic data of individual gra...
Autores principales: | , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10006164/ https://www.ncbi.nlm.nih.gov/pubmed/36899082 http://dx.doi.org/10.1038/s41746-023-00771-5 |
_version_ | 1784905253729075200 |
---|---|
author | Guillaudeux, Morgan Rousseau, Olivia Petot, Julien Bennis, Zineb Dein, Charles-Axel Goronflot, Thomas Vince, Nicolas Limou, Sophie Karakachoff, Matilde Wargny, Matthieu Gourraud, Pierre-Antoine |
author_facet | Guillaudeux, Morgan Rousseau, Olivia Petot, Julien Bennis, Zineb Dein, Charles-Axel Goronflot, Thomas Vince, Nicolas Limou, Sophie Karakachoff, Matilde Wargny, Matthieu Gourraud, Pierre-Antoine |
author_sort | Guillaudeux, Morgan |
collection | PubMed |
description | While nearly all computational methods operate on pseudonymized personal data, re-identification remains a risk. With personal health data, this re-identification risk may be considered a double-crossing of patients’ trust. Herein, we present a new method to generate synthetic data of individual granularity while holding on to patients’ privacy. Developed for sensitive biomedical data, the method is patient-centric as it uses a local model to generate random new synthetic data, called an “avatar data”, for each initial sensitive individual. This method, compared with 2 other synthetic data generation techniques (Synthpop, CT-GAN), is applied to real health data with a clinical trial and a cancer observational study to evaluate the protection it provides while retaining the original statistical information. Compared to Synthpop and CT-GAN, the Avatar method shows a similar level of signal maintenance while allowing to compute additional privacy metrics. In the light of distance-based privacy metrics, each individual produces an avatar simulation that is on average indistinguishable from 12 other generated avatar simulations for the clinical trial and 24 for the observational study. Data transformation using the Avatar method both preserves, the evaluation of the treatment’s effectiveness with similar hazard ratios for the clinical trial (original HR = 0.49 [95% CI, 0.39–0.63] vs. avatar HR = 0.40 [95% CI, 0.31–0.52]) and the classification properties for the observational study (original AUC = 99.46 (s.e. 0.25) vs. avatar AUC = 99.84 (s.e. 0.12)). Once validated by privacy metrics, anonymous synthetic data enable the creation of value from sensitive pseudonymized data analyses by tackling the risk of a privacy breach. |
format | Online Article Text |
id | pubmed-10006164 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-100061642023-03-12 Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis Guillaudeux, Morgan Rousseau, Olivia Petot, Julien Bennis, Zineb Dein, Charles-Axel Goronflot, Thomas Vince, Nicolas Limou, Sophie Karakachoff, Matilde Wargny, Matthieu Gourraud, Pierre-Antoine NPJ Digit Med Article While nearly all computational methods operate on pseudonymized personal data, re-identification remains a risk. With personal health data, this re-identification risk may be considered a double-crossing of patients’ trust. Herein, we present a new method to generate synthetic data of individual granularity while holding on to patients’ privacy. Developed for sensitive biomedical data, the method is patient-centric as it uses a local model to generate random new synthetic data, called an “avatar data”, for each initial sensitive individual. This method, compared with 2 other synthetic data generation techniques (Synthpop, CT-GAN), is applied to real health data with a clinical trial and a cancer observational study to evaluate the protection it provides while retaining the original statistical information. Compared to Synthpop and CT-GAN, the Avatar method shows a similar level of signal maintenance while allowing to compute additional privacy metrics. In the light of distance-based privacy metrics, each individual produces an avatar simulation that is on average indistinguishable from 12 other generated avatar simulations for the clinical trial and 24 for the observational study. Data transformation using the Avatar method both preserves, the evaluation of the treatment’s effectiveness with similar hazard ratios for the clinical trial (original HR = 0.49 [95% CI, 0.39–0.63] vs. avatar HR = 0.40 [95% CI, 0.31–0.52]) and the classification properties for the observational study (original AUC = 99.46 (s.e. 0.25) vs. avatar AUC = 99.84 (s.e. 0.12)). Once validated by privacy metrics, anonymous synthetic data enable the creation of value from sensitive pseudonymized data analyses by tackling the risk of a privacy breach. Nature Publishing Group UK 2023-03-10 /pmc/articles/PMC10006164/ /pubmed/36899082 http://dx.doi.org/10.1038/s41746-023-00771-5 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Guillaudeux, Morgan Rousseau, Olivia Petot, Julien Bennis, Zineb Dein, Charles-Axel Goronflot, Thomas Vince, Nicolas Limou, Sophie Karakachoff, Matilde Wargny, Matthieu Gourraud, Pierre-Antoine Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis |
title | Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis |
title_full | Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis |
title_fullStr | Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis |
title_full_unstemmed | Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis |
title_short | Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis |
title_sort | patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10006164/ https://www.ncbi.nlm.nih.gov/pubmed/36899082 http://dx.doi.org/10.1038/s41746-023-00771-5 |
work_keys_str_mv | AT guillaudeuxmorgan patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis AT rousseauolivia patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis AT petotjulien patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis AT benniszineb patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis AT deincharlesaxel patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis AT goronflotthomas patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis AT vincenicolas patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis AT limousophie patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis AT karakachoffmatilde patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis AT wargnymatthieu patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis AT gourraudpierreantoine patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis |