Cargando…

No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile

While linking records across large administrative datasets [“big data”] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have deve...

Descripción completa

Detalles Bibliográficos
Autores principales: Tahamont, Sarah, Jelveh, Zubin, McNeill, Melissa, Yan, Shi, Chalfin, Aaron, Hansen, Benjamin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10072450/
https://www.ncbi.nlm.nih.gov/pubmed/37014897
http://dx.doi.org/10.1371/journal.pone.0283811
_version_ 1785019383452532736
author Tahamont, Sarah
Jelveh, Zubin
McNeill, Melissa
Yan, Shi
Chalfin, Aaron
Hansen, Benjamin
author_facet Tahamont, Sarah
Jelveh, Zubin
McNeill, Melissa
Yan, Shi
Chalfin, Aaron
Hansen, Benjamin
author_sort Tahamont, Sarah
collection PubMed
description While linking records across large administrative datasets [“big data”] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to “ground-truth” examples—matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use “active learning” algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.
format Online
Article
Text
id pubmed-10072450
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-100724502023-04-05 No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile Tahamont, Sarah Jelveh, Zubin McNeill, Melissa Yan, Shi Chalfin, Aaron Hansen, Benjamin PLoS One Research Article While linking records across large administrative datasets [“big data”] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to “ground-truth” examples—matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use “active learning” algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool. Public Library of Science 2023-04-04 /pmc/articles/PMC10072450/ /pubmed/37014897 http://dx.doi.org/10.1371/journal.pone.0283811 Text en © 2023 Tahamont et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Tahamont, Sarah
Jelveh, Zubin
McNeill, Melissa
Yan, Shi
Chalfin, Aaron
Hansen, Benjamin
No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile
title No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile
title_full No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile
title_fullStr No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile
title_full_unstemmed No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile
title_short No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile
title_sort no ground truth? no problem: improving administrative data linking using active learning and a little bit of guile
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10072450/
https://www.ncbi.nlm.nih.gov/pubmed/37014897
http://dx.doi.org/10.1371/journal.pone.0283811
work_keys_str_mv AT tahamontsarah nogroundtruthnoproblemimprovingadministrativedatalinkingusingactivelearningandalittlebitofguile
AT jelvehzubin nogroundtruthnoproblemimprovingadministrativedatalinkingusingactivelearningandalittlebitofguile
AT mcneillmelissa nogroundtruthnoproblemimprovingadministrativedatalinkingusingactivelearningandalittlebitofguile
AT yanshi nogroundtruthnoproblemimprovingadministrativedatalinkingusingactivelearningandalittlebitofguile
AT chalfinaaron nogroundtruthnoproblemimprovingadministrativedatalinkingusingactivelearningandalittlebitofguile
AT hansenbenjamin nogroundtruthnoproblemimprovingadministrativedatalinkingusingactivelearningandalittlebitofguile