Cargando…

Efficient Record Linkage Algorithms Using Complete Linkage Clustering

Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone...

Descripción completa

Detalles Bibliográficos
Autores principales: Mamun, Abdullah-Al, Aseltine, Robert, Rajasekaran, Sanguthevar
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4849582/
https://www.ncbi.nlm.nih.gov/pubmed/27124604
http://dx.doi.org/10.1371/journal.pone.0154446
_version_ 1782429555184631808
author Mamun, Abdullah-Al
Aseltine, Robert
Rajasekaran, Sanguthevar
author_facet Mamun, Abdullah-Al
Aseltine, Robert
Rajasekaran, Sanguthevar
author_sort Mamun, Abdullah-Al
collection PubMed
description Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a sub-routine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times.
format Online
Article
Text
id pubmed-4849582
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-48495822016-05-07 Efficient Record Linkage Algorithms Using Complete Linkage Clustering Mamun, Abdullah-Al Aseltine, Robert Rajasekaran, Sanguthevar PLoS One Research Article Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a sub-routine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times. Public Library of Science 2016-04-28 /pmc/articles/PMC4849582/ /pubmed/27124604 http://dx.doi.org/10.1371/journal.pone.0154446 Text en © 2016 Mamun et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Mamun, Abdullah-Al
Aseltine, Robert
Rajasekaran, Sanguthevar
Efficient Record Linkage Algorithms Using Complete Linkage Clustering
title Efficient Record Linkage Algorithms Using Complete Linkage Clustering
title_full Efficient Record Linkage Algorithms Using Complete Linkage Clustering
title_fullStr Efficient Record Linkage Algorithms Using Complete Linkage Clustering
title_full_unstemmed Efficient Record Linkage Algorithms Using Complete Linkage Clustering
title_short Efficient Record Linkage Algorithms Using Complete Linkage Clustering
title_sort efficient record linkage algorithms using complete linkage clustering
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4849582/
https://www.ncbi.nlm.nih.gov/pubmed/27124604
http://dx.doi.org/10.1371/journal.pone.0154446
work_keys_str_mv AT mamunabdullahal efficientrecordlinkagealgorithmsusingcompletelinkageclustering
AT aseltinerobert efficientrecordlinkagealgorithmsusingcompletelinkageclustering
AT rajasekaransanguthevar efficientrecordlinkagealgorithmsusingcompletelinkageclustering