Cargando…

Estimating the re-identification risk of clinical data sets

BACKGROUND: De-identification is a common way to protect patient privacy when disclosing clinical data for secondary purposes, such as research. One type of attack that de-identification protects against is linking the disclosed patient data with public and semi-public registries. Uniqueness is a co...

Descripción completa

Detalles Bibliográficos
Autores principales: Dankar, Fida Kamal, El Emam, Khaled, Neisa, Angelica, Roffey, Tyson
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3583146/
https://www.ncbi.nlm.nih.gov/pubmed/22776564
http://dx.doi.org/10.1186/1472-6947-12-66
_version_ 1782260671346376704
author Dankar, Fida Kamal
El Emam, Khaled
Neisa, Angelica
Roffey, Tyson
author_facet Dankar, Fida Kamal
El Emam, Khaled
Neisa, Angelica
Roffey, Tyson
author_sort Dankar, Fida Kamal
collection PubMed
description BACKGROUND: De-identification is a common way to protect patient privacy when disclosing clinical data for secondary purposes, such as research. One type of attack that de-identification protects against is linking the disclosed patient data with public and semi-public registries. Uniqueness is a commonly used measure of re-identification risk under this attack. If uniqueness can be measured accurately then the risk from this kind of attack can be managed. In practice, it is often not possible to measure uniqueness directly, therefore it must be estimated. METHODS: We evaluated the accuracy of uniqueness estimators on clinically relevant data sets. Four candidate estimators were identified because they were evaluated in the past and found to have good accuracy or because they were new and not evaluated comparatively before: the Zayatz estimator, slide negative binomial estimator, Pitman’s estimator, and mu-argus. A Monte Carlo simulation was performed to evaluate the uniqueness estimators on six clinically relevant data sets. We varied the sampling fraction and the uniqueness in the population (the value being estimated). The median relative error and inter-quartile range of the uniqueness estimates was measured across 1000 runs. RESULTS: There was no single estimator that performed well across all of the conditions. We developed a decision rule which selected between the Pitman, slide negative binomial and Zayatz estimators depending on the sampling fraction and the difference between estimates. This decision rule had the best consistent median relative error across multiple conditions and data sets. CONCLUSION: This study identified an accurate decision rule that can be used by health privacy researchers and disclosure control professionals to estimate uniqueness in clinical data sets. The decision rule provides a reliable way to measure re-identification risk.
format Online
Article
Text
id pubmed-3583146
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-35831462013-03-11 Estimating the re-identification risk of clinical data sets Dankar, Fida Kamal El Emam, Khaled Neisa, Angelica Roffey, Tyson BMC Med Inform Decis Mak Research Article BACKGROUND: De-identification is a common way to protect patient privacy when disclosing clinical data for secondary purposes, such as research. One type of attack that de-identification protects against is linking the disclosed patient data with public and semi-public registries. Uniqueness is a commonly used measure of re-identification risk under this attack. If uniqueness can be measured accurately then the risk from this kind of attack can be managed. In practice, it is often not possible to measure uniqueness directly, therefore it must be estimated. METHODS: We evaluated the accuracy of uniqueness estimators on clinically relevant data sets. Four candidate estimators were identified because they were evaluated in the past and found to have good accuracy or because they were new and not evaluated comparatively before: the Zayatz estimator, slide negative binomial estimator, Pitman’s estimator, and mu-argus. A Monte Carlo simulation was performed to evaluate the uniqueness estimators on six clinically relevant data sets. We varied the sampling fraction and the uniqueness in the population (the value being estimated). The median relative error and inter-quartile range of the uniqueness estimates was measured across 1000 runs. RESULTS: There was no single estimator that performed well across all of the conditions. We developed a decision rule which selected between the Pitman, slide negative binomial and Zayatz estimators depending on the sampling fraction and the difference between estimates. This decision rule had the best consistent median relative error across multiple conditions and data sets. CONCLUSION: This study identified an accurate decision rule that can be used by health privacy researchers and disclosure control professionals to estimate uniqueness in clinical data sets. The decision rule provides a reliable way to measure re-identification risk. BioMed Central 2012-07-09 /pmc/articles/PMC3583146/ /pubmed/22776564 http://dx.doi.org/10.1186/1472-6947-12-66 Text en Copyright © 2012 Dankar et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Dankar, Fida Kamal
El Emam, Khaled
Neisa, Angelica
Roffey, Tyson
Estimating the re-identification risk of clinical data sets
title Estimating the re-identification risk of clinical data sets
title_full Estimating the re-identification risk of clinical data sets
title_fullStr Estimating the re-identification risk of clinical data sets
title_full_unstemmed Estimating the re-identification risk of clinical data sets
title_short Estimating the re-identification risk of clinical data sets
title_sort estimating the re-identification risk of clinical data sets
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3583146/
https://www.ncbi.nlm.nih.gov/pubmed/22776564
http://dx.doi.org/10.1186/1472-6947-12-66
work_keys_str_mv AT dankarfidakamal estimatingthereidentificationriskofclinicaldatasets
AT elemamkhaled estimatingthereidentificationriskofclinicaldatasets
AT neisaangelica estimatingthereidentificationriskofclinicaldatasets
AT roffeytyson estimatingthereidentificationriskofclinicaldatasets