Cargando…
Estimating parameters for probabilistic linkage of privacy-preserved datasets
BACKGROUND: Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimate...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5504757/ https://www.ncbi.nlm.nih.gov/pubmed/28693507 http://dx.doi.org/10.1186/s12874-017-0370-0 |
_version_ | 1783249340462006272 |
---|---|
author | Brown, Adrian P. Randall, Sean M. Ferrante, Anna M. Semmens, James B. Boyd, James H. |
author_facet | Brown, Adrian P. Randall, Sean M. Ferrante, Anna M. Semmens, James B. Boyd, James H. |
author_sort | Brown, Adrian P. |
collection | PubMed |
description | BACKGROUND: Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. METHODS: Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. RESULTS: Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. CONCLUSIONS: The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets. |
format | Online Article Text |
id | pubmed-5504757 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-55047572017-07-12 Estimating parameters for probabilistic linkage of privacy-preserved datasets Brown, Adrian P. Randall, Sean M. Ferrante, Anna M. Semmens, James B. Boyd, James H. BMC Med Res Methodol Research Article BACKGROUND: Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. METHODS: Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. RESULTS: Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. CONCLUSIONS: The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets. BioMed Central 2017-07-10 /pmc/articles/PMC5504757/ /pubmed/28693507 http://dx.doi.org/10.1186/s12874-017-0370-0 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Brown, Adrian P. Randall, Sean M. Ferrante, Anna M. Semmens, James B. Boyd, James H. Estimating parameters for probabilistic linkage of privacy-preserved datasets |
title | Estimating parameters for probabilistic linkage of privacy-preserved datasets |
title_full | Estimating parameters for probabilistic linkage of privacy-preserved datasets |
title_fullStr | Estimating parameters for probabilistic linkage of privacy-preserved datasets |
title_full_unstemmed | Estimating parameters for probabilistic linkage of privacy-preserved datasets |
title_short | Estimating parameters for probabilistic linkage of privacy-preserved datasets |
title_sort | estimating parameters for probabilistic linkage of privacy-preserved datasets |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5504757/ https://www.ncbi.nlm.nih.gov/pubmed/28693507 http://dx.doi.org/10.1186/s12874-017-0370-0 |
work_keys_str_mv | AT brownadrianp estimatingparametersforprobabilisticlinkageofprivacypreserveddatasets AT randallseanm estimatingparametersforprobabilisticlinkageofprivacypreserveddatasets AT ferranteannam estimatingparametersforprobabilisticlinkageofprivacypreserveddatasets AT semmensjamesb estimatingparametersforprobabilisticlinkageofprivacypreserveddatasets AT boydjamesh estimatingparametersforprobabilisticlinkageofprivacypreserveddatasets |