Cargando…

Estimating parameters for probabilistic linkage of privacy-preserved datasets

BACKGROUND: Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimate...

Descripción completa

Detalles Bibliográficos
Autores principales: Brown, Adrian P., Randall, Sean M., Ferrante, Anna M., Semmens, James B., Boyd, James H.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5504757/
https://www.ncbi.nlm.nih.gov/pubmed/28693507
http://dx.doi.org/10.1186/s12874-017-0370-0
_version_ 1783249340462006272
author Brown, Adrian P.
Randall, Sean M.
Ferrante, Anna M.
Semmens, James B.
Boyd, James H.
author_facet Brown, Adrian P.
Randall, Sean M.
Ferrante, Anna M.
Semmens, James B.
Boyd, James H.
author_sort Brown, Adrian P.
collection PubMed
description BACKGROUND: Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. METHODS: Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. RESULTS: Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. CONCLUSIONS: The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets.
format Online
Article
Text
id pubmed-5504757
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-55047572017-07-12 Estimating parameters for probabilistic linkage of privacy-preserved datasets Brown, Adrian P. Randall, Sean M. Ferrante, Anna M. Semmens, James B. Boyd, James H. BMC Med Res Methodol Research Article BACKGROUND: Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. METHODS: Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. RESULTS: Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. CONCLUSIONS: The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets. BioMed Central 2017-07-10 /pmc/articles/PMC5504757/ /pubmed/28693507 http://dx.doi.org/10.1186/s12874-017-0370-0 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Brown, Adrian P.
Randall, Sean M.
Ferrante, Anna M.
Semmens, James B.
Boyd, James H.
Estimating parameters for probabilistic linkage of privacy-preserved datasets
title Estimating parameters for probabilistic linkage of privacy-preserved datasets
title_full Estimating parameters for probabilistic linkage of privacy-preserved datasets
title_fullStr Estimating parameters for probabilistic linkage of privacy-preserved datasets
title_full_unstemmed Estimating parameters for probabilistic linkage of privacy-preserved datasets
title_short Estimating parameters for probabilistic linkage of privacy-preserved datasets
title_sort estimating parameters for probabilistic linkage of privacy-preserved datasets
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5504757/
https://www.ncbi.nlm.nih.gov/pubmed/28693507
http://dx.doi.org/10.1186/s12874-017-0370-0
work_keys_str_mv AT brownadrianp estimatingparametersforprobabilisticlinkageofprivacypreserveddatasets
AT randallseanm estimatingparametersforprobabilisticlinkageofprivacypreserveddatasets
AT ferranteannam estimatingparametersforprobabilisticlinkageofprivacypreserveddatasets
AT semmensjamesb estimatingparametersforprobabilisticlinkageofprivacypreserveddatasets
AT boydjamesh estimatingparametersforprobabilisticlinkageofprivacypreserveddatasets