Cargando…

Estimating the success of re-identifications in incomplete datasets using generative models

While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We her...

Descripción completa

Detalles Bibliográficos
Autores principales: Rocher, Luc, Hendrickx, Julien M., de Montjoye, Yves-Alexandre
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6650473/
https://www.ncbi.nlm.nih.gov/pubmed/31337762
http://dx.doi.org/10.1038/s41467-019-10933-3
_version_ 1783438135325097984
author Rocher, Luc
Hendrickx, Julien M.
de Montjoye, Yves-Alexandre
author_facet Rocher, Luc
Hendrickx, Julien M.
de Montjoye, Yves-Alexandre
author_sort Rocher, Luc
collection PubMed
description While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.
format Online
Article
Text
id pubmed-6650473
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-66504732019-07-25 Estimating the success of re-identifications in incomplete datasets using generative models Rocher, Luc Hendrickx, Julien M. de Montjoye, Yves-Alexandre Nat Commun Article While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model. Nature Publishing Group UK 2019-07-23 /pmc/articles/PMC6650473/ /pubmed/31337762 http://dx.doi.org/10.1038/s41467-019-10933-3 Text en © The Author(s) 2019 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle Article
Rocher, Luc
Hendrickx, Julien M.
de Montjoye, Yves-Alexandre
Estimating the success of re-identifications in incomplete datasets using generative models
title Estimating the success of re-identifications in incomplete datasets using generative models
title_full Estimating the success of re-identifications in incomplete datasets using generative models
title_fullStr Estimating the success of re-identifications in incomplete datasets using generative models
title_full_unstemmed Estimating the success of re-identifications in incomplete datasets using generative models
title_short Estimating the success of re-identifications in incomplete datasets using generative models
title_sort estimating the success of re-identifications in incomplete datasets using generative models
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6650473/
https://www.ncbi.nlm.nih.gov/pubmed/31337762
http://dx.doi.org/10.1038/s41467-019-10933-3
work_keys_str_mv AT rocherluc estimatingthesuccessofreidentificationsinincompletedatasetsusinggenerativemodels
AT hendrickxjulienm estimatingthesuccessofreidentificationsinincompletedatasetsusinggenerativemodels
AT demontjoyeyvesalexandre estimatingthesuccessofreidentificationsinincompletedatasetsusinggenerativemodels