Cargando…

A method for managing re-identification risk from small geographic areas in Canada

BACKGROUND: A common disclosure control practice for health datasets is to identify small geographic areas and either suppress records from these small areas or aggregate them into larger ones. A recent study provided a method for deciding when an area is too small based on the uniqueness criterion....

Descripción completa

Detalles Bibliográficos
Autores principales:	El Emam, Khaled, Brown, Ann, AbdelMalik, Philip, Neisa, Angelica, Walker, Mark, Bottomley, Jim, Roffey, Tyson
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2858714/ https://www.ncbi.nlm.nih.gov/pubmed/20361870 http://dx.doi.org/10.1186/1472-6947-10-18

_version_	1782180440330731520
author	El Emam, Khaled Brown, Ann AbdelMalik, Philip Neisa, Angelica Walker, Mark Bottomley, Jim Roffey, Tyson
author_facet	El Emam, Khaled Brown, Ann AbdelMalik, Philip Neisa, Angelica Walker, Mark Bottomley, Jim Roffey, Tyson
author_sort	El Emam, Khaled
collection	PubMed
description	BACKGROUND: A common disclosure control practice for health datasets is to identify small geographic areas and either suppress records from these small areas or aggregate them into larger ones. A recent study provided a method for deciding when an area is too small based on the uniqueness criterion. The uniqueness criterion stipulates that an the area is no longer too small when the proportion of unique individuals on the relevant variables (the quasi-identifiers) approaches zero. However, using a uniqueness value of zero is quite a stringent threshold, and is only suitable when the risks from data disclosure are quite high. Other uniqueness thresholds that have been proposed for health data are 5% and 20%. METHODS: We estimated uniqueness for urban Forward Sortation Areas (FSAs) by using the 2001 long form Canadian census data representing 20% of the population. We then constructed two logistic regression models to predict when the uniqueness is greater than the 5% and 20% thresholds, and validated their predictive accuracy using 10-fold cross-validation. Predictor variables included the population size of the FSA and the maximum number of possible values on the quasi-identifiers (the number of equivalence classes). RESULTS: All model parameters were significant and the models had very high prediction accuracy, with specificity above 0.9, and sensitivity at 0.87 and 0.74 for the 5% and 20% threshold models respectively. The application of the models was illustrated with an analysis of the Ontario newborn registry and an emergency department dataset. At the higher thresholds considerably fewer records compared to the 0% threshold would be considered to be in small areas and therefore undergo disclosure control actions. We have also included concrete guidance for data custodians in deciding which one of the three uniqueness thresholds to use (0%, 5%, 20%), depending on the mitigating controls that the data recipients have in place, the potential invasion of privacy if the data is disclosed, and the motives and capacity of the data recipient to re-identify the data. CONCLUSION: The models we developed can be used to manage the re-identification risk from small geographic areas. Being able to choose among three possible thresholds, a data custodian can adjust the definition of "small geographic area" to the nature of the data and recipient.
format	Text
id	pubmed-2858714
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-28587142010-04-23 A method for managing re-identification risk from small geographic areas in Canada El Emam, Khaled Brown, Ann AbdelMalik, Philip Neisa, Angelica Walker, Mark Bottomley, Jim Roffey, Tyson BMC Med Inform Decis Mak Research Article BACKGROUND: A common disclosure control practice for health datasets is to identify small geographic areas and either suppress records from these small areas or aggregate them into larger ones. A recent study provided a method for deciding when an area is too small based on the uniqueness criterion. The uniqueness criterion stipulates that an the area is no longer too small when the proportion of unique individuals on the relevant variables (the quasi-identifiers) approaches zero. However, using a uniqueness value of zero is quite a stringent threshold, and is only suitable when the risks from data disclosure are quite high. Other uniqueness thresholds that have been proposed for health data are 5% and 20%. METHODS: We estimated uniqueness for urban Forward Sortation Areas (FSAs) by using the 2001 long form Canadian census data representing 20% of the population. We then constructed two logistic regression models to predict when the uniqueness is greater than the 5% and 20% thresholds, and validated their predictive accuracy using 10-fold cross-validation. Predictor variables included the population size of the FSA and the maximum number of possible values on the quasi-identifiers (the number of equivalence classes). RESULTS: All model parameters were significant and the models had very high prediction accuracy, with specificity above 0.9, and sensitivity at 0.87 and 0.74 for the 5% and 20% threshold models respectively. The application of the models was illustrated with an analysis of the Ontario newborn registry and an emergency department dataset. At the higher thresholds considerably fewer records compared to the 0% threshold would be considered to be in small areas and therefore undergo disclosure control actions. We have also included concrete guidance for data custodians in deciding which one of the three uniqueness thresholds to use (0%, 5%, 20%), depending on the mitigating controls that the data recipients have in place, the potential invasion of privacy if the data is disclosed, and the motives and capacity of the data recipient to re-identify the data. CONCLUSION: The models we developed can be used to manage the re-identification risk from small geographic areas. Being able to choose among three possible thresholds, a data custodian can adjust the definition of "small geographic area" to the nature of the data and recipient. BioMed Central 2010-04-02 /pmc/articles/PMC2858714/ /pubmed/20361870 http://dx.doi.org/10.1186/1472-6947-10-18 Text en Copyright ©2010 El Emam et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article El Emam, Khaled Brown, Ann AbdelMalik, Philip Neisa, Angelica Walker, Mark Bottomley, Jim Roffey, Tyson A method for managing re-identification risk from small geographic areas in Canada
title	A method for managing re-identification risk from small geographic areas in Canada
title_full	A method for managing re-identification risk from small geographic areas in Canada
title_fullStr	A method for managing re-identification risk from small geographic areas in Canada
title_full_unstemmed	A method for managing re-identification risk from small geographic areas in Canada
title_short	A method for managing re-identification risk from small geographic areas in Canada
title_sort	method for managing re-identification risk from small geographic areas in canada
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2858714/ https://www.ncbi.nlm.nih.gov/pubmed/20361870 http://dx.doi.org/10.1186/1472-6947-10-18
work_keys_str_mv	AT elemamkhaled amethodformanagingreidentificationriskfromsmallgeographicareasincanada AT brownann amethodformanagingreidentificationriskfromsmallgeographicareasincanada AT abdelmalikphilip amethodformanagingreidentificationriskfromsmallgeographicareasincanada AT neisaangelica amethodformanagingreidentificationriskfromsmallgeographicareasincanada AT walkermark amethodformanagingreidentificationriskfromsmallgeographicareasincanada AT bottomleyjim amethodformanagingreidentificationriskfromsmallgeographicareasincanada AT roffeytyson amethodformanagingreidentificationriskfromsmallgeographicareasincanada AT elemamkhaled methodformanagingreidentificationriskfromsmallgeographicareasincanada AT brownann methodformanagingreidentificationriskfromsmallgeographicareasincanada AT abdelmalikphilip methodformanagingreidentificationriskfromsmallgeographicareasincanada AT neisaangelica methodformanagingreidentificationriskfromsmallgeographicareasincanada AT walkermark methodformanagingreidentificationriskfromsmallgeographicareasincanada AT bottomleyjim methodformanagingreidentificationriskfromsmallgeographicareasincanada AT roffeytyson methodformanagingreidentificationriskfromsmallgeographicareasincanada

A method for managing re-identification risk from small geographic areas in Canada

Ejemplares similares