Cargando…

Validating a membership disclosure metric for synthetic health data

BACKGROUND: One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in...

Descripción completa

Detalles Bibliográficos
Autores principales:	El Emam, Khaled, Mosquera, Lucy, Fang, Xi
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Research and Applications
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9553223/ https://www.ncbi.nlm.nih.gov/pubmed/36238080 http://dx.doi.org/10.1093/jamiaopen/ooac083

_version_	1784806420461387776
author	El Emam, Khaled Mosquera, Lucy Fang, Xi
author_facet	El Emam, Khaled Mosquera, Lucy Fang, Xi
author_sort	El Emam, Khaled
collection	PubMed
description	BACKGROUND: One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter. OBJECTIVE: Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. MATERIALS AND METHODS: We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack. RESULTS: The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets. CONCLUSIONS: Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.
format	Online Article Text
id	pubmed-9553223
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-95532232022-10-12 Validating a membership disclosure metric for synthetic health data El Emam, Khaled Mosquera, Lucy Fang, Xi JAMIA Open Research and Applications BACKGROUND: One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter. OBJECTIVE: Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. MATERIALS AND METHODS: We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack. RESULTS: The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets. CONCLUSIONS: Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data. Oxford University Press 2022-10-11 /pmc/articles/PMC9553223/ /pubmed/36238080 http://dx.doi.org/10.1093/jamiaopen/ooac083 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of the American Medical Informatics Association. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research and Applications El Emam, Khaled Mosquera, Lucy Fang, Xi Validating a membership disclosure metric for synthetic health data
title	Validating a membership disclosure metric for synthetic health data
title_full	Validating a membership disclosure metric for synthetic health data
title_fullStr	Validating a membership disclosure metric for synthetic health data
title_full_unstemmed	Validating a membership disclosure metric for synthetic health data
title_short	Validating a membership disclosure metric for synthetic health data
title_sort	validating a membership disclosure metric for synthetic health data
topic	Research and Applications
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9553223/ https://www.ncbi.nlm.nih.gov/pubmed/36238080 http://dx.doi.org/10.1093/jamiaopen/ooac083
work_keys_str_mv	AT elemamkhaled validatingamembershipdisclosuremetricforsynthetichealthdata AT mosqueralucy validatingamembershipdisclosuremetricforsynthetichealthdata AT fangxi validatingamembershipdisclosuremetricforsynthetichealthdata

Validating a membership disclosure metric for synthetic health data

Ejemplares similares