Cargando…

Efficient and effective pruning strategies for health data de-identification

BACKGROUND: Privacy must be protected when sensitive biomedical data is shared, e.g. for research purposes. Data de-identification is an important safeguard, where datasets are transformed to meet two conflicting objectives: minimizing re-identification risks while maximizing data quality. Typically...

Descripción completa

Detalles Bibliográficos
Autores principales: Prasser, Fabian, Kohlmayer, Florian, Kuhn, Klaus A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4851781/
https://www.ncbi.nlm.nih.gov/pubmed/27130179
http://dx.doi.org/10.1186/s12911-016-0287-2
_version_ 1782429859992043520
author Prasser, Fabian
Kohlmayer, Florian
Kuhn, Klaus A.
author_facet Prasser, Fabian
Kohlmayer, Florian
Kuhn, Klaus A.
author_sort Prasser, Fabian
collection PubMed
description BACKGROUND: Privacy must be protected when sensitive biomedical data is shared, e.g. for research purposes. Data de-identification is an important safeguard, where datasets are transformed to meet two conflicting objectives: minimizing re-identification risks while maximizing data quality. Typically, de-identification methods search a solution space of possible data transformations to find a good solution to a given de-identification problem. In this process, parts of the search space must be excluded to maintain scalability. OBJECTIVES: The set of transformations which are solution candidates is typically narrowed down by storing the results obtained during the search process and then using them to predict properties of the output of other transformations in terms of privacy (first objective) and data quality (second objective). However, due to the exponential growth of the size of the search space, previous implementations of this method are not well-suited when datasets contain many attributes which need to be protected. As this is often the case with biomedical research data, e.g. as a result of longitudinal collection, we have developed a novel method. METHODS: Our approach combines the mathematical concept of antichains with a data structure inspired by prefix trees to represent properties of a large number of data transformations while requiring only a minimal amount of information to be stored. To analyze the improvements which can be achieved by adopting our method, we have integrated it into an existing algorithm and we have also implemented a simple best-first branch and bound search (BFS) algorithm as a first step towards methods which fully exploit our approach. We have evaluated these implementations with several real-world datasets and the k-anonymity privacy model. RESULTS: When integrated into existing de-identification algorithms for low-dimensional data, our approach reduced memory requirements by up to one order of magnitude and execution times by up to 25 %. This allowed us to increase the size of solution spaces which could be processed by almost a factor of 10. When using the simple BFS method, we were able to further increase the size of the solution space by a factor of three. When used as a heuristic strategy for high-dimensional data, the BFS approach outperformed a state-of-the-art algorithm by up to 12 % in terms of the quality of output data. CONCLUSIONS: This work shows that implementing methods of data de-identification for real-world applications is a challenging task. Our approach solves a problem often faced by data custodians: a lack of scalability of de-identification software when used with datasets having realistic schemas and volumes. The method described in this article has been implemented into ARX, an open source de-identification software for biomedical data.
format Online
Article
Text
id pubmed-4851781
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-48517812016-05-01 Efficient and effective pruning strategies for health data de-identification Prasser, Fabian Kohlmayer, Florian Kuhn, Klaus A. BMC Med Inform Decis Mak Technical Advance BACKGROUND: Privacy must be protected when sensitive biomedical data is shared, e.g. for research purposes. Data de-identification is an important safeguard, where datasets are transformed to meet two conflicting objectives: minimizing re-identification risks while maximizing data quality. Typically, de-identification methods search a solution space of possible data transformations to find a good solution to a given de-identification problem. In this process, parts of the search space must be excluded to maintain scalability. OBJECTIVES: The set of transformations which are solution candidates is typically narrowed down by storing the results obtained during the search process and then using them to predict properties of the output of other transformations in terms of privacy (first objective) and data quality (second objective). However, due to the exponential growth of the size of the search space, previous implementations of this method are not well-suited when datasets contain many attributes which need to be protected. As this is often the case with biomedical research data, e.g. as a result of longitudinal collection, we have developed a novel method. METHODS: Our approach combines the mathematical concept of antichains with a data structure inspired by prefix trees to represent properties of a large number of data transformations while requiring only a minimal amount of information to be stored. To analyze the improvements which can be achieved by adopting our method, we have integrated it into an existing algorithm and we have also implemented a simple best-first branch and bound search (BFS) algorithm as a first step towards methods which fully exploit our approach. We have evaluated these implementations with several real-world datasets and the k-anonymity privacy model. RESULTS: When integrated into existing de-identification algorithms for low-dimensional data, our approach reduced memory requirements by up to one order of magnitude and execution times by up to 25 %. This allowed us to increase the size of solution spaces which could be processed by almost a factor of 10. When using the simple BFS method, we were able to further increase the size of the solution space by a factor of three. When used as a heuristic strategy for high-dimensional data, the BFS approach outperformed a state-of-the-art algorithm by up to 12 % in terms of the quality of output data. CONCLUSIONS: This work shows that implementing methods of data de-identification for real-world applications is a challenging task. Our approach solves a problem often faced by data custodians: a lack of scalability of de-identification software when used with datasets having realistic schemas and volumes. The method described in this article has been implemented into ARX, an open source de-identification software for biomedical data. BioMed Central 2016-04-30 /pmc/articles/PMC4851781/ /pubmed/27130179 http://dx.doi.org/10.1186/s12911-016-0287-2 Text en © Prasser et al. 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Technical Advance
Prasser, Fabian
Kohlmayer, Florian
Kuhn, Klaus A.
Efficient and effective pruning strategies for health data de-identification
title Efficient and effective pruning strategies for health data de-identification
title_full Efficient and effective pruning strategies for health data de-identification
title_fullStr Efficient and effective pruning strategies for health data de-identification
title_full_unstemmed Efficient and effective pruning strategies for health data de-identification
title_short Efficient and effective pruning strategies for health data de-identification
title_sort efficient and effective pruning strategies for health data de-identification
topic Technical Advance
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4851781/
https://www.ncbi.nlm.nih.gov/pubmed/27130179
http://dx.doi.org/10.1186/s12911-016-0287-2
work_keys_str_mv AT prasserfabian efficientandeffectivepruningstrategiesforhealthdatadeidentification
AT kohlmayerflorian efficientandeffectivepruningstrategiesforhealthdatadeidentification
AT kuhnklausa efficientandeffectivepruningstrategiesforhealthdatadeidentification