Cargando…

Efficient and effective pruning strategies for health data de-identification

BACKGROUND: Privacy must be protected when sensitive biomedical data is shared, e.g. for research purposes. Data de-identification is an important safeguard, where datasets are transformed to meet two conflicting objectives: minimizing re-identification risks while maximizing data quality. Typically...

Descripción completa

Detalles Bibliográficos
Autores principales:	Prasser, Fabian, Kohlmayer, Florian, Kuhn, Klaus A.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Technical Advance
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4851781/ https://www.ncbi.nlm.nih.gov/pubmed/27130179 http://dx.doi.org/10.1186/s12911-016-0287-2

_version_	1782429859992043520
author	Prasser, Fabian Kohlmayer, Florian Kuhn, Klaus A.
author_facet	Prasser, Fabian Kohlmayer, Florian Kuhn, Klaus A.
author_sort	Prasser, Fabian
collection	PubMed
description	BACKGROUND: Privacy must be protected when sensitive biomedical data is shared, e.g. for research purposes. Data de-identification is an important safeguard, where datasets are transformed to meet two conflicting objectives: minimizing re-identification risks while maximizing data quality. Typically, de-identification methods search a solution space of possible data transformations to find a good solution to a given de-identification problem. In this process, parts of the search space must be excluded to maintain scalability. OBJECTIVES: The set of transformations which are solution candidates is typically narrowed down by storing the results obtained during the search process and then using them to predict properties of the output of other transformations in terms of privacy (first objective) and data quality (second objective). However, due to the exponential growth of the size of the search space, previous implementations of this method are not well-suited when datasets contain many attributes which need to be protected. As this is often the case with biomedical research data, e.g. as a result of longitudinal collection, we have developed a novel method. METHODS: Our approach combines the mathematical concept of antichains with a data structure inspired by prefix trees to represent properties of a large number of data transformations while requiring only a minimal amount of information to be stored. To analyze the improvements which can be achieved by adopting our method, we have integrated it into an existing algorithm and we have also implemented a simple best-first branch and bound search (BFS) algorithm as a first step towards methods which fully exploit our approach. We have evaluated these implementations with several real-world datasets and the k-anonymity privacy model. RESULTS: When integrated into existing de-identification algorithms for low-dimensional data, our approach reduced memory requirements by up to one order of magnitude and execution times by up to 25 %. This allowed us to increase the size of solution spaces which could be processed by almost a factor of 10. When using the simple BFS method, we were able to further increase the size of the solution space by a factor of three. When used as a heuristic strategy for high-dimensional data, the BFS approach outperformed a state-of-the-art algorithm by up to 12 % in terms of the quality of output data. CONCLUSIONS: This work shows that implementing methods of data de-identification for real-world applications is a challenging task. Our approach solves a problem often faced by data custodians: a lack of scalability of de-identification software when used with datasets having realistic schemas and volumes. The method described in this article has been implemented into ARX, an open source de-identification software for biomedical data.
format	Online Article Text
id	pubmed-4851781
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-48517812016-05-01 Efficient and effective pruning strategies for health data de-identification Prasser, Fabian Kohlmayer, Florian Kuhn, Klaus A. BMC Med Inform Decis Mak Technical Advance BACKGROUND: Privacy must be protected when sensitive biomedical data is shared, e.g. for research purposes. Data de-identification is an important safeguard, where datasets are transformed to meet two conflicting objectives: minimizing re-identification risks while maximizing data quality. Typically, de-identification methods search a solution space of possible data transformations to find a good solution to a given de-identification problem. In this process, parts of the search space must be excluded to maintain scalability. OBJECTIVES: The set of transformations which are solution candidates is typically narrowed down by storing the results obtained during the search process and then using them to predict properties of the output of other transformations in terms of privacy (first objective) and data quality (second objective). However, due to the exponential growth of the size of the search space, previous implementations of this method are not well-suited when datasets contain many attributes which need to be protected. As this is often the case with biomedical research data, e.g. as a result of longitudinal collection, we have developed a novel method. METHODS: Our approach combines the mathematical concept of antichains with a data structure inspired by prefix trees to represent properties of a large number of data transformations while requiring only a minimal amount of information to be stored. To analyze the improvements which can be achieved by adopting our method, we have integrated it into an existing algorithm and we have also implemented a simple best-first branch and bound search (BFS) algorithm as a first step towards methods which fully exploit our approach. We have evaluated these implementations with several real-world datasets and the k-anonymity privacy model. RESULTS: When integrated into existing de-identification algorithms for low-dimensional data, our approach reduced memory requirements by up to one order of magnitude and execution times by up to 25 %. This allowed us to increase the size of solution spaces which could be processed by almost a factor of 10. When using the simple BFS method, we were able to further increase the size of the solution space by a factor of three. When used as a heuristic strategy for high-dimensional data, the BFS approach outperformed a state-of-the-art algorithm by up to 12 % in terms of the quality of output data. CONCLUSIONS: This work shows that implementing methods of data de-identification for real-world applications is a challenging task. Our approach solves a problem often faced by data custodians: a lack of scalability of de-identification software when used with datasets having realistic schemas and volumes. The method described in this article has been implemented into ARX, an open source de-identification software for biomedical data. BioMed Central 2016-04-30 /pmc/articles/PMC4851781/ /pubmed/27130179 http://dx.doi.org/10.1186/s12911-016-0287-2 Text en © Prasser et al. 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Technical Advance Prasser, Fabian Kohlmayer, Florian Kuhn, Klaus A. Efficient and effective pruning strategies for health data de-identification
title	Efficient and effective pruning strategies for health data de-identification
title_full	Efficient and effective pruning strategies for health data de-identification
title_fullStr	Efficient and effective pruning strategies for health data de-identification
title_full_unstemmed	Efficient and effective pruning strategies for health data de-identification
title_short	Efficient and effective pruning strategies for health data de-identification
title_sort	efficient and effective pruning strategies for health data de-identification
topic	Technical Advance
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4851781/ https://www.ncbi.nlm.nih.gov/pubmed/27130179 http://dx.doi.org/10.1186/s12911-016-0287-2
work_keys_str_mv	AT prasserfabian efficientandeffectivepruningstrategiesforhealthdatadeidentification AT kohlmayerflorian efficientandeffectivepruningstrategiesforhealthdatadeidentification AT kuhnklausa efficientandeffectivepruningstrategiesforhealthdatadeidentification

Efficient and effective pruning strategies for health data de-identification

Ejemplares similares