Cargando…

Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis

While nearly all computational methods operate on pseudonymized personal data, re-identification remains a risk. With personal health data, this re-identification risk may be considered a double-crossing of patients’ trust. Herein, we present a new method to generate synthetic data of individual gra...

Descripción completa

Detalles Bibliográficos
Autores principales: Guillaudeux, Morgan, Rousseau, Olivia, Petot, Julien, Bennis, Zineb, Dein, Charles-Axel, Goronflot, Thomas, Vince, Nicolas, Limou, Sophie, Karakachoff, Matilde, Wargny, Matthieu, Gourraud, Pierre-Antoine
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10006164/
https://www.ncbi.nlm.nih.gov/pubmed/36899082
http://dx.doi.org/10.1038/s41746-023-00771-5
_version_ 1784905253729075200
author Guillaudeux, Morgan
Rousseau, Olivia
Petot, Julien
Bennis, Zineb
Dein, Charles-Axel
Goronflot, Thomas
Vince, Nicolas
Limou, Sophie
Karakachoff, Matilde
Wargny, Matthieu
Gourraud, Pierre-Antoine
author_facet Guillaudeux, Morgan
Rousseau, Olivia
Petot, Julien
Bennis, Zineb
Dein, Charles-Axel
Goronflot, Thomas
Vince, Nicolas
Limou, Sophie
Karakachoff, Matilde
Wargny, Matthieu
Gourraud, Pierre-Antoine
author_sort Guillaudeux, Morgan
collection PubMed
description While nearly all computational methods operate on pseudonymized personal data, re-identification remains a risk. With personal health data, this re-identification risk may be considered a double-crossing of patients’ trust. Herein, we present a new method to generate synthetic data of individual granularity while holding on to patients’ privacy. Developed for sensitive biomedical data, the method is patient-centric as it uses a local model to generate random new synthetic data, called an “avatar data”, for each initial sensitive individual. This method, compared with 2 other synthetic data generation techniques (Synthpop, CT-GAN), is applied to real health data with a clinical trial and a cancer observational study to evaluate the protection it provides while retaining the original statistical information. Compared to Synthpop and CT-GAN, the Avatar method shows a similar level of signal maintenance while allowing to compute additional privacy metrics. In the light of distance-based privacy metrics, each individual produces an avatar simulation that is on average indistinguishable from 12 other generated avatar simulations for the clinical trial and 24 for the observational study. Data transformation using the Avatar method both preserves, the evaluation of the treatment’s effectiveness with similar hazard ratios for the clinical trial (original HR = 0.49 [95% CI, 0.39–0.63] vs. avatar HR = 0.40 [95% CI, 0.31–0.52]) and the classification properties for the observational study (original AUC = 99.46 (s.e. 0.25) vs. avatar AUC = 99.84 (s.e. 0.12)). Once validated by privacy metrics, anonymous synthetic data enable the creation of value from sensitive pseudonymized data analyses by tackling the risk of a privacy breach.
format Online
Article
Text
id pubmed-10006164
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-100061642023-03-12 Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis Guillaudeux, Morgan Rousseau, Olivia Petot, Julien Bennis, Zineb Dein, Charles-Axel Goronflot, Thomas Vince, Nicolas Limou, Sophie Karakachoff, Matilde Wargny, Matthieu Gourraud, Pierre-Antoine NPJ Digit Med Article While nearly all computational methods operate on pseudonymized personal data, re-identification remains a risk. With personal health data, this re-identification risk may be considered a double-crossing of patients’ trust. Herein, we present a new method to generate synthetic data of individual granularity while holding on to patients’ privacy. Developed for sensitive biomedical data, the method is patient-centric as it uses a local model to generate random new synthetic data, called an “avatar data”, for each initial sensitive individual. This method, compared with 2 other synthetic data generation techniques (Synthpop, CT-GAN), is applied to real health data with a clinical trial and a cancer observational study to evaluate the protection it provides while retaining the original statistical information. Compared to Synthpop and CT-GAN, the Avatar method shows a similar level of signal maintenance while allowing to compute additional privacy metrics. In the light of distance-based privacy metrics, each individual produces an avatar simulation that is on average indistinguishable from 12 other generated avatar simulations for the clinical trial and 24 for the observational study. Data transformation using the Avatar method both preserves, the evaluation of the treatment’s effectiveness with similar hazard ratios for the clinical trial (original HR = 0.49 [95% CI, 0.39–0.63] vs. avatar HR = 0.40 [95% CI, 0.31–0.52]) and the classification properties for the observational study (original AUC = 99.46 (s.e. 0.25) vs. avatar AUC = 99.84 (s.e. 0.12)). Once validated by privacy metrics, anonymous synthetic data enable the creation of value from sensitive pseudonymized data analyses by tackling the risk of a privacy breach. Nature Publishing Group UK 2023-03-10 /pmc/articles/PMC10006164/ /pubmed/36899082 http://dx.doi.org/10.1038/s41746-023-00771-5 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Guillaudeux, Morgan
Rousseau, Olivia
Petot, Julien
Bennis, Zineb
Dein, Charles-Axel
Goronflot, Thomas
Vince, Nicolas
Limou, Sophie
Karakachoff, Matilde
Wargny, Matthieu
Gourraud, Pierre-Antoine
Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
title Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
title_full Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
title_fullStr Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
title_full_unstemmed Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
title_short Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
title_sort patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10006164/
https://www.ncbi.nlm.nih.gov/pubmed/36899082
http://dx.doi.org/10.1038/s41746-023-00771-5
work_keys_str_mv AT guillaudeuxmorgan patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis
AT rousseauolivia patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis
AT petotjulien patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis
AT benniszineb patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis
AT deincharlesaxel patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis
AT goronflotthomas patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis
AT vincenicolas patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis
AT limousophie patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis
AT karakachoffmatilde patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis
AT wargnymatthieu patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis
AT gourraudpierreantoine patientcentricsyntheticdatagenerationnoreasontoriskreidentificationinbiomedicaldataanalysis