Cargando…
Estimating the success of re-identifications in incomplete datasets using generative models
While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We her...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6650473/ https://www.ncbi.nlm.nih.gov/pubmed/31337762 http://dx.doi.org/10.1038/s41467-019-10933-3 |
_version_ | 1783438135325097984 |
---|---|
author | Rocher, Luc Hendrickx, Julien M. de Montjoye, Yves-Alexandre |
author_facet | Rocher, Luc Hendrickx, Julien M. de Montjoye, Yves-Alexandre |
author_sort | Rocher, Luc |
collection | PubMed |
description | While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model. |
format | Online Article Text |
id | pubmed-6650473 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-66504732019-07-25 Estimating the success of re-identifications in incomplete datasets using generative models Rocher, Luc Hendrickx, Julien M. de Montjoye, Yves-Alexandre Nat Commun Article While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model. Nature Publishing Group UK 2019-07-23 /pmc/articles/PMC6650473/ /pubmed/31337762 http://dx.doi.org/10.1038/s41467-019-10933-3 Text en © The Author(s) 2019 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. |
spellingShingle | Article Rocher, Luc Hendrickx, Julien M. de Montjoye, Yves-Alexandre Estimating the success of re-identifications in incomplete datasets using generative models |
title | Estimating the success of re-identifications in incomplete datasets using generative models |
title_full | Estimating the success of re-identifications in incomplete datasets using generative models |
title_fullStr | Estimating the success of re-identifications in incomplete datasets using generative models |
title_full_unstemmed | Estimating the success of re-identifications in incomplete datasets using generative models |
title_short | Estimating the success of re-identifications in incomplete datasets using generative models |
title_sort | estimating the success of re-identifications in incomplete datasets using generative models |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6650473/ https://www.ncbi.nlm.nih.gov/pubmed/31337762 http://dx.doi.org/10.1038/s41467-019-10933-3 |
work_keys_str_mv | AT rocherluc estimatingthesuccessofreidentificationsinincompletedatasetsusinggenerativemodels AT hendrickxjulienm estimatingthesuccessofreidentificationsinincompletedatasetsusinggenerativemodels AT demontjoyeyvesalexandre estimatingthesuccessofreidentificationsinincompletedatasetsusinggenerativemodels |