Cargando…
Comparing record linkage software programs and algorithms using real-world data
Linkage of medical databases, including insurer claims and electronic health records (EHRs), is increasingly common. However, few studies have investigated the behavior and output of linkage software. To determine how linkage quality is affected by different algorithms, blocking variables, methods f...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6759179/ https://www.ncbi.nlm.nih.gov/pubmed/31550255 http://dx.doi.org/10.1371/journal.pone.0221459 |
_version_ | 1783453652529184768 |
---|---|
author | Karr, Alan F. Taylor, Matthew T. West, Suzanne L. Setoguchi, Soko Kou, Tzuyung D. Gerhard, Tobias Horton, Daniel B. |
author_facet | Karr, Alan F. Taylor, Matthew T. West, Suzanne L. Setoguchi, Soko Kou, Tzuyung D. Gerhard, Tobias Horton, Daniel B. |
author_sort | Karr, Alan F. |
collection | PubMed |
description | Linkage of medical databases, including insurer claims and electronic health records (EHRs), is increasingly common. However, few studies have investigated the behavior and output of linkage software. To determine how linkage quality is affected by different algorithms, blocking variables, methods for string matching and weight determination, and decision rules, we compared the performance of 4 nonproprietary linkage software packages linking patient identifiers from noninteroperable inpatient and outpatient EHRs. We linked datasets using first and last name, gender, and date of birth (DOB). We evaluated DOB and year of birth (YOB) as blocking variables and used exact and inexact matching methods. We compared the weights assigned to record pairs and evaluated how matching weights corresponded to a gold standard, medical record number. Deduplicated datasets contained 69,523 inpatient and 176,154 outpatient records, respectively. Linkage runs blocking on DOB produced weights ranging in number from 8 for exact matching to 64,273 for inexact matching. Linkage runs blocking on YOB produced 8 to 916,806 weights. Exact matching matched record pairs with identical test characteristics (sensitivity 90.48%, specificity 99.78%) for the highest ranked group, but algorithms differentially prioritized certain variables. Inexact matching behaved more variably, leading to dramatic differences in sensitivity (range 0.04–93.36%) and positive predictive value (PPV) (range 86.67–97.35%), even for the most highly ranked record pairs. Blocking on DOB led to higher PPV of highly ranked record pairs. An ensemble approach based on averaging scaled matching weights led to modestly improved accuracy. In summary, we found few differences in the rankings of record pairs with the highest matching weights across 4 linkage packages. Performance was more consistent for exact string matching than for inexact string matching. Most methods and software packages performed similarly when comparing matching accuracy with the gold standard. In some settings, an ensemble matching approach may outperform individual linkage algorithms. |
format | Online Article Text |
id | pubmed-6759179 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-67591792019-10-04 Comparing record linkage software programs and algorithms using real-world data Karr, Alan F. Taylor, Matthew T. West, Suzanne L. Setoguchi, Soko Kou, Tzuyung D. Gerhard, Tobias Horton, Daniel B. PLoS One Research Article Linkage of medical databases, including insurer claims and electronic health records (EHRs), is increasingly common. However, few studies have investigated the behavior and output of linkage software. To determine how linkage quality is affected by different algorithms, blocking variables, methods for string matching and weight determination, and decision rules, we compared the performance of 4 nonproprietary linkage software packages linking patient identifiers from noninteroperable inpatient and outpatient EHRs. We linked datasets using first and last name, gender, and date of birth (DOB). We evaluated DOB and year of birth (YOB) as blocking variables and used exact and inexact matching methods. We compared the weights assigned to record pairs and evaluated how matching weights corresponded to a gold standard, medical record number. Deduplicated datasets contained 69,523 inpatient and 176,154 outpatient records, respectively. Linkage runs blocking on DOB produced weights ranging in number from 8 for exact matching to 64,273 for inexact matching. Linkage runs blocking on YOB produced 8 to 916,806 weights. Exact matching matched record pairs with identical test characteristics (sensitivity 90.48%, specificity 99.78%) for the highest ranked group, but algorithms differentially prioritized certain variables. Inexact matching behaved more variably, leading to dramatic differences in sensitivity (range 0.04–93.36%) and positive predictive value (PPV) (range 86.67–97.35%), even for the most highly ranked record pairs. Blocking on DOB led to higher PPV of highly ranked record pairs. An ensemble approach based on averaging scaled matching weights led to modestly improved accuracy. In summary, we found few differences in the rankings of record pairs with the highest matching weights across 4 linkage packages. Performance was more consistent for exact string matching than for inexact string matching. Most methods and software packages performed similarly when comparing matching accuracy with the gold standard. In some settings, an ensemble matching approach may outperform individual linkage algorithms. Public Library of Science 2019-09-24 /pmc/articles/PMC6759179/ /pubmed/31550255 http://dx.doi.org/10.1371/journal.pone.0221459 Text en © 2019 Karr et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Karr, Alan F. Taylor, Matthew T. West, Suzanne L. Setoguchi, Soko Kou, Tzuyung D. Gerhard, Tobias Horton, Daniel B. Comparing record linkage software programs and algorithms using real-world data |
title | Comparing record linkage software programs and algorithms using real-world data |
title_full | Comparing record linkage software programs and algorithms using real-world data |
title_fullStr | Comparing record linkage software programs and algorithms using real-world data |
title_full_unstemmed | Comparing record linkage software programs and algorithms using real-world data |
title_short | Comparing record linkage software programs and algorithms using real-world data |
title_sort | comparing record linkage software programs and algorithms using real-world data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6759179/ https://www.ncbi.nlm.nih.gov/pubmed/31550255 http://dx.doi.org/10.1371/journal.pone.0221459 |
work_keys_str_mv | AT karralanf comparingrecordlinkagesoftwareprogramsandalgorithmsusingrealworlddata AT taylormatthewt comparingrecordlinkagesoftwareprogramsandalgorithmsusingrealworlddata AT westsuzannel comparingrecordlinkagesoftwareprogramsandalgorithmsusingrealworlddata AT setoguchisoko comparingrecordlinkagesoftwareprogramsandalgorithmsusingrealworlddata AT koutzuyungd comparingrecordlinkagesoftwareprogramsandalgorithmsusingrealworlddata AT gerhardtobias comparingrecordlinkagesoftwareprogramsandalgorithmsusingrealworlddata AT hortondanielb comparingrecordlinkagesoftwareprogramsandalgorithmsusingrealworlddata |