Cargando…

Theoretical limits of microclustering for record linkage

There has been substantial recent interest in record linkage, where one attempts to group the records pertaining to the same entities from one or more large databases that lack unique identifiers. This can be viewed as a type of microclustering, with few observations per cluster and a very large num...

Descripción completa

Detalles Bibliográficos
Autores principales: Johndrow, J E, Lum, K, Dunson, D B
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5963577/
https://www.ncbi.nlm.nih.gov/pubmed/29880978
http://dx.doi.org/10.1093/biomet/asy003
_version_ 1783325049686589440
author Johndrow, J E
Lum, K
Dunson, D B
author_facet Johndrow, J E
Lum, K
Dunson, D B
author_sort Johndrow, J E
collection PubMed
description There has been substantial recent interest in record linkage, where one attempts to group the records pertaining to the same entities from one or more large databases that lack unique identifiers. This can be viewed as a type of microclustering, with few observations per cluster and a very large number of clusters. We show that the problem is fundamentally hard from a theoretical perspective and, even in idealized cases, accurate entity resolution is effectively impossible unless the number of entities is small relative to the number of records and/or the separation between records from different entities is extremely large. These results suggest conservatism in interpretation of the results of record linkage, support collection of additional data to more accurately disambiguate the entities, and motivate a focus on coarser inference. For example, results from a simulation study suggest that sometimes one may obtain accurate results for population size estimation even when fine-scale entity resolution is inaccurate.
format Online
Article
Text
id pubmed-5963577
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-59635772019-06-01 Theoretical limits of microclustering for record linkage Johndrow, J E Lum, K Dunson, D B Biometrika Articles There has been substantial recent interest in record linkage, where one attempts to group the records pertaining to the same entities from one or more large databases that lack unique identifiers. This can be viewed as a type of microclustering, with few observations per cluster and a very large number of clusters. We show that the problem is fundamentally hard from a theoretical perspective and, even in idealized cases, accurate entity resolution is effectively impossible unless the number of entities is small relative to the number of records and/or the separation between records from different entities is extremely large. These results suggest conservatism in interpretation of the results of record linkage, support collection of additional data to more accurately disambiguate the entities, and motivate a focus on coarser inference. For example, results from a simulation study suggest that sometimes one may obtain accurate results for population size estimation even when fine-scale entity resolution is inaccurate. Oxford University Press 2018-06 2018-03-19 /pmc/articles/PMC5963577/ /pubmed/29880978 http://dx.doi.org/10.1093/biomet/asy003 Text en © 2018 Biometrika Trust http://academic.oup.com/journals/pages/about_us/legal/notices This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)
spellingShingle Articles
Johndrow, J E
Lum, K
Dunson, D B
Theoretical limits of microclustering for record linkage
title Theoretical limits of microclustering for record linkage
title_full Theoretical limits of microclustering for record linkage
title_fullStr Theoretical limits of microclustering for record linkage
title_full_unstemmed Theoretical limits of microclustering for record linkage
title_short Theoretical limits of microclustering for record linkage
title_sort theoretical limits of microclustering for record linkage
topic Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5963577/
https://www.ncbi.nlm.nih.gov/pubmed/29880978
http://dx.doi.org/10.1093/biomet/asy003
work_keys_str_mv AT johndrowje theoreticallimitsofmicroclusteringforrecordlinkage
AT lumk theoreticallimitsofmicroclusteringforrecordlinkage
AT dunsondb theoreticallimitsofmicroclusteringforrecordlinkage