Cargando…

Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage

INTRODUCTION: The need for increased privacy protection in data linkage has driven the development of privacy-preserving record linkage (PPRL) techniques. A popular technique using Bloom filters with cryptographic analyses, modifications, and hashing variations to optimise privacy has been the focus...

Descripción completa

Detalles Bibliográficos
Autores principales: Brown, AP, Randall, SM, Boyd, JH, Ferrante, AM
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Swansea University 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7482522/
https://www.ncbi.nlm.nih.gov/pubmed/32935029
http://dx.doi.org/10.23889/ijpds.v4i1.1095
_version_ 1783580803799711744
author Brown, AP
Randall, SM
Boyd, JH
Ferrante, AM
author_facet Brown, AP
Randall, SM
Boyd, JH
Ferrante, AM
author_sort Brown, AP
collection PubMed
description INTRODUCTION: The need for increased privacy protection in data linkage has driven the development of privacy-preserving record linkage (PPRL) techniques. A popular technique using Bloom filters with cryptographic analyses, modifications, and hashing variations to optimise privacy has been the focus of much research in this area. With few applications of Bloom filters within a probabilistic framework, there is limited information on whether approximate matches between Bloom filtered fields can improve linkage quality. OBJECTIVES: In this study, we evaluate the effectiveness of three approximate comparison methods for Bloom filters within the context of the Fellegi-Sunter model of recording linkage: Sørensen–Dice coefficient, Jaccard similarity and Hamming distance. METHODS: Using synthetic datasets with introduced errors to simulate datasets with a range of data quality and a large real-world administrative health dataset, the research estimated partial weight curves for converting similarity scores (for each approximate comparison method) to partial weights at both field and dataset level. Deduplication linkages were run on each dataset using these partial weight curves. This was to compare the resulting quality of the approximate comparison techniques with linkages using simple cut-off similarity values and only exact matching. RESULTS: Linkages using approximate comparisons produced significantly better quality results than those using exact comparisons only. Field level partial weight curves for a specific dataset produced the best quality results. The Sørensen-Dice coefficient and Jaccard similarity produced the most consistent results across a spectrum of synthetic and real-world datasets. CONCLUSION: The use of Bloom filter similarity comparisons for probabilistic record linkage can produce linkage quality results which are comparable to Jaro-Winkler string similarities with unencrypted linkages. Probabilistic linkages using Bloom filters benefit significantly from the use of similarity comparisons, with partial weight curves producing the best results, even when not optimised for that particular dataset
format Online
Article
Text
id pubmed-7482522
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Swansea University
record_format MEDLINE/PubMed
spelling pubmed-74825222020-09-14 Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage Brown, AP Randall, SM Boyd, JH Ferrante, AM Int J Popul Data Sci Population Data Science INTRODUCTION: The need for increased privacy protection in data linkage has driven the development of privacy-preserving record linkage (PPRL) techniques. A popular technique using Bloom filters with cryptographic analyses, modifications, and hashing variations to optimise privacy has been the focus of much research in this area. With few applications of Bloom filters within a probabilistic framework, there is limited information on whether approximate matches between Bloom filtered fields can improve linkage quality. OBJECTIVES: In this study, we evaluate the effectiveness of three approximate comparison methods for Bloom filters within the context of the Fellegi-Sunter model of recording linkage: Sørensen–Dice coefficient, Jaccard similarity and Hamming distance. METHODS: Using synthetic datasets with introduced errors to simulate datasets with a range of data quality and a large real-world administrative health dataset, the research estimated partial weight curves for converting similarity scores (for each approximate comparison method) to partial weights at both field and dataset level. Deduplication linkages were run on each dataset using these partial weight curves. This was to compare the resulting quality of the approximate comparison techniques with linkages using simple cut-off similarity values and only exact matching. RESULTS: Linkages using approximate comparisons produced significantly better quality results than those using exact comparisons only. Field level partial weight curves for a specific dataset produced the best quality results. The Sørensen-Dice coefficient and Jaccard similarity produced the most consistent results across a spectrum of synthetic and real-world datasets. CONCLUSION: The use of Bloom filter similarity comparisons for probabilistic record linkage can produce linkage quality results which are comparable to Jaro-Winkler string similarities with unencrypted linkages. Probabilistic linkages using Bloom filters benefit significantly from the use of similarity comparisons, with partial weight curves producing the best results, even when not optimised for that particular dataset Swansea University 2019-05-23 /pmc/articles/PMC7482522/ /pubmed/32935029 http://dx.doi.org/10.23889/ijpds.v4i1.1095 Text en https://creativecommons.org/licences/by/4.0/ This work is licenced under a Creative Commons Attribution 4.0 International License.
spellingShingle Population Data Science
Brown, AP
Randall, SM
Boyd, JH
Ferrante, AM
Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage
title Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage
title_full Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage
title_fullStr Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage
title_full_unstemmed Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage
title_short Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage
title_sort evaluation of approximate comparison methods on bloom filters for probabilistic linkage
topic Population Data Science
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7482522/
https://www.ncbi.nlm.nih.gov/pubmed/32935029
http://dx.doi.org/10.23889/ijpds.v4i1.1095
work_keys_str_mv AT brownap evaluationofapproximatecomparisonmethodsonbloomfiltersforprobabilisticlinkage
AT randallsm evaluationofapproximatecomparisonmethodsonbloomfiltersforprobabilisticlinkage
AT boydjh evaluationofapproximatecomparisonmethodsonbloomfiltersforprobabilisticlinkage
AT ferranteam evaluationofapproximatecomparisonmethodsonbloomfiltersforprobabilisticlinkage