Cargando…
Measuring re-identification risk using a synthetic estimator to enable data sharing
BACKGROUND: One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators m...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9205507/ https://www.ncbi.nlm.nih.gov/pubmed/35714132 http://dx.doi.org/10.1371/journal.pone.0269097 |
_version_ | 1784729148048015360 |
---|---|
author | Jiang, Yangdi Mosquera, Lucy Jiang, Bei Kong, Linglong El Emam, Khaled |
author_facet | Jiang, Yangdi Mosquera, Lucy Jiang, Bei Kong, Linglong El Emam, Khaled |
author_sort | Jiang, Yangdi |
collection | PubMed |
description | BACKGROUND: One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population. OBJECTIVES: Develop an accurate risk estimator for the sample-to-population attack. METHODS: A type of estimator based on creating a synthetic variant of a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack. The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error. Two estimators were considered, a Gaussian copula and a d-vine copula. They were compared against three other estimators proposed in the literature. RESULTS: Taking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values. This was significantly more accurate than existing methods. A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance. The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey dataset. CONCLUSIONS: The average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared. |
format | Online Article Text |
id | pubmed-9205507 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-92055072022-06-18 Measuring re-identification risk using a synthetic estimator to enable data sharing Jiang, Yangdi Mosquera, Lucy Jiang, Bei Kong, Linglong El Emam, Khaled PLoS One Research Article BACKGROUND: One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population. OBJECTIVES: Develop an accurate risk estimator for the sample-to-population attack. METHODS: A type of estimator based on creating a synthetic variant of a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack. The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error. Two estimators were considered, a Gaussian copula and a d-vine copula. They were compared against three other estimators proposed in the literature. RESULTS: Taking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values. This was significantly more accurate than existing methods. A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance. The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey dataset. CONCLUSIONS: The average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared. Public Library of Science 2022-06-17 /pmc/articles/PMC9205507/ /pubmed/35714132 http://dx.doi.org/10.1371/journal.pone.0269097 Text en © 2022 Jiang et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Jiang, Yangdi Mosquera, Lucy Jiang, Bei Kong, Linglong El Emam, Khaled Measuring re-identification risk using a synthetic estimator to enable data sharing |
title | Measuring re-identification risk using a synthetic estimator to enable data sharing |
title_full | Measuring re-identification risk using a synthetic estimator to enable data sharing |
title_fullStr | Measuring re-identification risk using a synthetic estimator to enable data sharing |
title_full_unstemmed | Measuring re-identification risk using a synthetic estimator to enable data sharing |
title_short | Measuring re-identification risk using a synthetic estimator to enable data sharing |
title_sort | measuring re-identification risk using a synthetic estimator to enable data sharing |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9205507/ https://www.ncbi.nlm.nih.gov/pubmed/35714132 http://dx.doi.org/10.1371/journal.pone.0269097 |
work_keys_str_mv | AT jiangyangdi measuringreidentificationriskusingasyntheticestimatortoenabledatasharing AT mosqueralucy measuringreidentificationriskusingasyntheticestimatortoenabledatasharing AT jiangbei measuringreidentificationriskusingasyntheticestimatortoenabledatasharing AT konglinglong measuringreidentificationriskusingasyntheticestimatortoenabledatasharing AT elemamkhaled measuringreidentificationriskusingasyntheticestimatortoenabledatasharing |