Cargando…

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation

BACKGROUND: There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals fro...

Descripción completa

Detalles Bibliográficos
Autores principales:	El Emam, Khaled, Mosquera, Lucy, Bass, Jason
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2020
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7704280/ https://www.ncbi.nlm.nih.gov/pubmed/33196453 http://dx.doi.org/10.2196/23139

_version_	1783616788834025472
author	El Emam, Khaled Mosquera, Lucy Bass, Jason
author_facet	El Emam, Khaled Mosquera, Lucy Bass, Jason
author_sort	El Emam, Khaled
collection	PubMed
description	BACKGROUND: There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. OBJECTIVE: The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. METHODS: A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this “meaningful identity disclosure risk.” The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. RESULTS: The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. CONCLUSIONS: We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.
format	Online Article Text
id	pubmed-7704280
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-77042802020-12-04 Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation El Emam, Khaled Mosquera, Lucy Bass, Jason J Med Internet Res Original Paper BACKGROUND: There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. OBJECTIVE: The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. METHODS: A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this “meaningful identity disclosure risk.” The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. RESULTS: The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. CONCLUSIONS: We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data. JMIR Publications 2020-11-16 /pmc/articles/PMC7704280/ /pubmed/33196453 http://dx.doi.org/10.2196/23139 Text en ©Khaled El Emam, Lucy Mosquera, Jason Bass. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 16.11.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper El Emam, Khaled Mosquera, Lucy Bass, Jason Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation
title	Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation
title_full	Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation
title_fullStr	Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation
title_full_unstemmed	Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation
title_short	Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation
title_sort	evaluating identity disclosure risk in fully synthetic health data: model development and validation
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7704280/ https://www.ncbi.nlm.nih.gov/pubmed/33196453 http://dx.doi.org/10.2196/23139
work_keys_str_mv	AT elemamkhaled evaluatingidentitydisclosureriskinfullysynthetichealthdatamodeldevelopmentandvalidation AT mosqueralucy evaluatingidentitydisclosureriskinfullysynthetichealthdatamodeldevelopmentandvalidation AT bassjason evaluatingidentitydisclosureriskinfullysynthetichealthdatamodeldevelopmentandvalidation

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation

Ejemplares similares