Cargando…

Comparing record linkage software programs and algorithms using real-world data

Linkage of medical databases, including insurer claims and electronic health records (EHRs), is increasingly common. However, few studies have investigated the behavior and output of linkage software. To determine how linkage quality is affected by different algorithms, blocking variables, methods f...

Descripción completa

Detalles Bibliográficos
Autores principales: Karr, Alan F., Taylor, Matthew T., West, Suzanne L., Setoguchi, Soko, Kou, Tzuyung D., Gerhard, Tobias, Horton, Daniel B.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6759179/
https://www.ncbi.nlm.nih.gov/pubmed/31550255
http://dx.doi.org/10.1371/journal.pone.0221459
_version_ 1783453652529184768
author Karr, Alan F.
Taylor, Matthew T.
West, Suzanne L.
Setoguchi, Soko
Kou, Tzuyung D.
Gerhard, Tobias
Horton, Daniel B.
author_facet Karr, Alan F.
Taylor, Matthew T.
West, Suzanne L.
Setoguchi, Soko
Kou, Tzuyung D.
Gerhard, Tobias
Horton, Daniel B.
author_sort Karr, Alan F.
collection PubMed
description Linkage of medical databases, including insurer claims and electronic health records (EHRs), is increasingly common. However, few studies have investigated the behavior and output of linkage software. To determine how linkage quality is affected by different algorithms, blocking variables, methods for string matching and weight determination, and decision rules, we compared the performance of 4 nonproprietary linkage software packages linking patient identifiers from noninteroperable inpatient and outpatient EHRs. We linked datasets using first and last name, gender, and date of birth (DOB). We evaluated DOB and year of birth (YOB) as blocking variables and used exact and inexact matching methods. We compared the weights assigned to record pairs and evaluated how matching weights corresponded to a gold standard, medical record number. Deduplicated datasets contained 69,523 inpatient and 176,154 outpatient records, respectively. Linkage runs blocking on DOB produced weights ranging in number from 8 for exact matching to 64,273 for inexact matching. Linkage runs blocking on YOB produced 8 to 916,806 weights. Exact matching matched record pairs with identical test characteristics (sensitivity 90.48%, specificity 99.78%) for the highest ranked group, but algorithms differentially prioritized certain variables. Inexact matching behaved more variably, leading to dramatic differences in sensitivity (range 0.04–93.36%) and positive predictive value (PPV) (range 86.67–97.35%), even for the most highly ranked record pairs. Blocking on DOB led to higher PPV of highly ranked record pairs. An ensemble approach based on averaging scaled matching weights led to modestly improved accuracy. In summary, we found few differences in the rankings of record pairs with the highest matching weights across 4 linkage packages. Performance was more consistent for exact string matching than for inexact string matching. Most methods and software packages performed similarly when comparing matching accuracy with the gold standard. In some settings, an ensemble matching approach may outperform individual linkage algorithms.
format Online
Article
Text
id pubmed-6759179
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-67591792019-10-04 Comparing record linkage software programs and algorithms using real-world data Karr, Alan F. Taylor, Matthew T. West, Suzanne L. Setoguchi, Soko Kou, Tzuyung D. Gerhard, Tobias Horton, Daniel B. PLoS One Research Article Linkage of medical databases, including insurer claims and electronic health records (EHRs), is increasingly common. However, few studies have investigated the behavior and output of linkage software. To determine how linkage quality is affected by different algorithms, blocking variables, methods for string matching and weight determination, and decision rules, we compared the performance of 4 nonproprietary linkage software packages linking patient identifiers from noninteroperable inpatient and outpatient EHRs. We linked datasets using first and last name, gender, and date of birth (DOB). We evaluated DOB and year of birth (YOB) as blocking variables and used exact and inexact matching methods. We compared the weights assigned to record pairs and evaluated how matching weights corresponded to a gold standard, medical record number. Deduplicated datasets contained 69,523 inpatient and 176,154 outpatient records, respectively. Linkage runs blocking on DOB produced weights ranging in number from 8 for exact matching to 64,273 for inexact matching. Linkage runs blocking on YOB produced 8 to 916,806 weights. Exact matching matched record pairs with identical test characteristics (sensitivity 90.48%, specificity 99.78%) for the highest ranked group, but algorithms differentially prioritized certain variables. Inexact matching behaved more variably, leading to dramatic differences in sensitivity (range 0.04–93.36%) and positive predictive value (PPV) (range 86.67–97.35%), even for the most highly ranked record pairs. Blocking on DOB led to higher PPV of highly ranked record pairs. An ensemble approach based on averaging scaled matching weights led to modestly improved accuracy. In summary, we found few differences in the rankings of record pairs with the highest matching weights across 4 linkage packages. Performance was more consistent for exact string matching than for inexact string matching. Most methods and software packages performed similarly when comparing matching accuracy with the gold standard. In some settings, an ensemble matching approach may outperform individual linkage algorithms. Public Library of Science 2019-09-24 /pmc/articles/PMC6759179/ /pubmed/31550255 http://dx.doi.org/10.1371/journal.pone.0221459 Text en © 2019 Karr et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Karr, Alan F.
Taylor, Matthew T.
West, Suzanne L.
Setoguchi, Soko
Kou, Tzuyung D.
Gerhard, Tobias
Horton, Daniel B.
Comparing record linkage software programs and algorithms using real-world data
title Comparing record linkage software programs and algorithms using real-world data
title_full Comparing record linkage software programs and algorithms using real-world data
title_fullStr Comparing record linkage software programs and algorithms using real-world data
title_full_unstemmed Comparing record linkage software programs and algorithms using real-world data
title_short Comparing record linkage software programs and algorithms using real-world data
title_sort comparing record linkage software programs and algorithms using real-world data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6759179/
https://www.ncbi.nlm.nih.gov/pubmed/31550255
http://dx.doi.org/10.1371/journal.pone.0221459
work_keys_str_mv AT karralanf comparingrecordlinkagesoftwareprogramsandalgorithmsusingrealworlddata
AT taylormatthewt comparingrecordlinkagesoftwareprogramsandalgorithmsusingrealworlddata
AT westsuzannel comparingrecordlinkagesoftwareprogramsandalgorithmsusingrealworlddata
AT setoguchisoko comparingrecordlinkagesoftwareprogramsandalgorithmsusingrealworlddata
AT koutzuyungd comparingrecordlinkagesoftwareprogramsandalgorithmsusingrealworlddata
AT gerhardtobias comparingrecordlinkagesoftwareprogramsandalgorithmsusingrealworlddata
AT hortondanielb comparingrecordlinkagesoftwareprogramsandalgorithmsusingrealworlddata