Cargando…

Measuring re-identification risk using a synthetic estimator to enable data sharing

BACKGROUND: One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators m...

Descripción completa

Detalles Bibliográficos
Autores principales: Jiang, Yangdi, Mosquera, Lucy, Jiang, Bei, Kong, Linglong, El Emam, Khaled
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9205507/
https://www.ncbi.nlm.nih.gov/pubmed/35714132
http://dx.doi.org/10.1371/journal.pone.0269097
_version_ 1784729148048015360
author Jiang, Yangdi
Mosquera, Lucy
Jiang, Bei
Kong, Linglong
El Emam, Khaled
author_facet Jiang, Yangdi
Mosquera, Lucy
Jiang, Bei
Kong, Linglong
El Emam, Khaled
author_sort Jiang, Yangdi
collection PubMed
description BACKGROUND: One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population. OBJECTIVES: Develop an accurate risk estimator for the sample-to-population attack. METHODS: A type of estimator based on creating a synthetic variant of a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack. The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error. Two estimators were considered, a Gaussian copula and a d-vine copula. They were compared against three other estimators proposed in the literature. RESULTS: Taking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values. This was significantly more accurate than existing methods. A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance. The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey dataset. CONCLUSIONS: The average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared.
format Online
Article
Text
id pubmed-9205507
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-92055072022-06-18 Measuring re-identification risk using a synthetic estimator to enable data sharing Jiang, Yangdi Mosquera, Lucy Jiang, Bei Kong, Linglong El Emam, Khaled PLoS One Research Article BACKGROUND: One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population. OBJECTIVES: Develop an accurate risk estimator for the sample-to-population attack. METHODS: A type of estimator based on creating a synthetic variant of a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack. The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error. Two estimators were considered, a Gaussian copula and a d-vine copula. They were compared against three other estimators proposed in the literature. RESULTS: Taking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values. This was significantly more accurate than existing methods. A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance. The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey dataset. CONCLUSIONS: The average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared. Public Library of Science 2022-06-17 /pmc/articles/PMC9205507/ /pubmed/35714132 http://dx.doi.org/10.1371/journal.pone.0269097 Text en © 2022 Jiang et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Jiang, Yangdi
Mosquera, Lucy
Jiang, Bei
Kong, Linglong
El Emam, Khaled
Measuring re-identification risk using a synthetic estimator to enable data sharing
title Measuring re-identification risk using a synthetic estimator to enable data sharing
title_full Measuring re-identification risk using a synthetic estimator to enable data sharing
title_fullStr Measuring re-identification risk using a synthetic estimator to enable data sharing
title_full_unstemmed Measuring re-identification risk using a synthetic estimator to enable data sharing
title_short Measuring re-identification risk using a synthetic estimator to enable data sharing
title_sort measuring re-identification risk using a synthetic estimator to enable data sharing
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9205507/
https://www.ncbi.nlm.nih.gov/pubmed/35714132
http://dx.doi.org/10.1371/journal.pone.0269097
work_keys_str_mv AT jiangyangdi measuringreidentificationriskusingasyntheticestimatortoenabledatasharing
AT mosqueralucy measuringreidentificationriskusingasyntheticestimatortoenabledatasharing
AT jiangbei measuringreidentificationriskusingasyntheticestimatortoenabledatasharing
AT konglinglong measuringreidentificationriskusingasyntheticestimatortoenabledatasharing
AT elemamkhaled measuringreidentificationriskusingasyntheticestimatortoenabledatasharing