Cargando…

Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies

BACKGROUND: The Enhanced Matching System (EMS) is a probabilistic record linkage program developed by the tuberculosis section at Public Health England to match data for individuals across two datasets. This paper outlines how EMS works and investigates its accuracy for linkage across public health...

Descripción completa

Detalles Bibliográficos
Autores principales: Aldridge, Robert W., Shaji, Kunju, Hayward, Andrew C., Abubakar, Ibrahim
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4547731/
https://www.ncbi.nlm.nih.gov/pubmed/26302242
http://dx.doi.org/10.1371/journal.pone.0136179
_version_ 1782387101464002560
author Aldridge, Robert W.
Shaji, Kunju
Hayward, Andrew C.
Abubakar, Ibrahim
author_facet Aldridge, Robert W.
Shaji, Kunju
Hayward, Andrew C.
Abubakar, Ibrahim
author_sort Aldridge, Robert W.
collection PubMed
description BACKGROUND: The Enhanced Matching System (EMS) is a probabilistic record linkage program developed by the tuberculosis section at Public Health England to match data for individuals across two datasets. This paper outlines how EMS works and investigates its accuracy for linkage across public health datasets. METHODS: EMS is a configurable Microsoft SQL Server database program. To examine the accuracy of EMS, two public health databases were matched using National Health Service (NHS) numbers as a gold standard unique identifier. Probabilistic linkage was then performed on the same two datasets without inclusion of NHS number. Sensitivity analyses were carried out to examine the effect of varying matching process parameters. RESULTS: Exact matching using NHS number between two datasets (containing 5931 and 1759 records) identified 1071 matched pairs. EMS probabilistic linkage identified 1068 record pairs. The sensitivity of probabilistic linkage was calculated as 99.5% (95%CI: 98.9, 99.8), specificity 100.0% (95%CI: 99.9, 100.0), positive predictive value 99.8% (95%CI: 99.3, 100.0), and negative predictive value 99.9% (95%CI: 99.8, 100.0). Probabilistic matching was most accurate when including address variables and using the automatically generated threshold for determining links with manual review. CONCLUSION: With the establishment of national electronic datasets across health and social care, EMS enables previously unanswerable research questions to be tackled with confidence in the accuracy of the linkage process. In scenarios where a small sample is being matched into a very large database (such as national records of hospital attendance) then, compared to results presented in this analysis, the positive predictive value or sensitivity may drop according to the prevalence of matches between databases. Despite this possible limitation, probabilistic linkage has great potential to be used where exact matching using a common identifier is not possible, including in low-income settings, and for vulnerable groups such as homeless populations, where the absence of unique identifiers and lower data quality has historically hindered the ability to identify individuals across datasets.
format Online
Article
Text
id pubmed-4547731
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-45477312015-09-01 Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies Aldridge, Robert W. Shaji, Kunju Hayward, Andrew C. Abubakar, Ibrahim PLoS One Research Article BACKGROUND: The Enhanced Matching System (EMS) is a probabilistic record linkage program developed by the tuberculosis section at Public Health England to match data for individuals across two datasets. This paper outlines how EMS works and investigates its accuracy for linkage across public health datasets. METHODS: EMS is a configurable Microsoft SQL Server database program. To examine the accuracy of EMS, two public health databases were matched using National Health Service (NHS) numbers as a gold standard unique identifier. Probabilistic linkage was then performed on the same two datasets without inclusion of NHS number. Sensitivity analyses were carried out to examine the effect of varying matching process parameters. RESULTS: Exact matching using NHS number between two datasets (containing 5931 and 1759 records) identified 1071 matched pairs. EMS probabilistic linkage identified 1068 record pairs. The sensitivity of probabilistic linkage was calculated as 99.5% (95%CI: 98.9, 99.8), specificity 100.0% (95%CI: 99.9, 100.0), positive predictive value 99.8% (95%CI: 99.3, 100.0), and negative predictive value 99.9% (95%CI: 99.8, 100.0). Probabilistic matching was most accurate when including address variables and using the automatically generated threshold for determining links with manual review. CONCLUSION: With the establishment of national electronic datasets across health and social care, EMS enables previously unanswerable research questions to be tackled with confidence in the accuracy of the linkage process. In scenarios where a small sample is being matched into a very large database (such as national records of hospital attendance) then, compared to results presented in this analysis, the positive predictive value or sensitivity may drop according to the prevalence of matches between databases. Despite this possible limitation, probabilistic linkage has great potential to be used where exact matching using a common identifier is not possible, including in low-income settings, and for vulnerable groups such as homeless populations, where the absence of unique identifiers and lower data quality has historically hindered the ability to identify individuals across datasets. Public Library of Science 2015-08-24 /pmc/articles/PMC4547731/ /pubmed/26302242 http://dx.doi.org/10.1371/journal.pone.0136179 Text en © 2015 Aldridge et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Aldridge, Robert W.
Shaji, Kunju
Hayward, Andrew C.
Abubakar, Ibrahim
Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies
title Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies
title_full Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies
title_fullStr Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies
title_full_unstemmed Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies
title_short Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies
title_sort accuracy of probabilistic linkage using the enhanced matching system for public health and epidemiological studies
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4547731/
https://www.ncbi.nlm.nih.gov/pubmed/26302242
http://dx.doi.org/10.1371/journal.pone.0136179
work_keys_str_mv AT aldridgerobertw accuracyofprobabilisticlinkageusingtheenhancedmatchingsystemforpublichealthandepidemiologicalstudies
AT shajikunju accuracyofprobabilisticlinkageusingtheenhancedmatchingsystemforpublichealthandepidemiologicalstudies
AT haywardandrewc accuracyofprobabilisticlinkageusingtheenhancedmatchingsystemforpublichealthandepidemiologicalstudies
AT abubakaribrahim accuracyofprobabilisticlinkageusingtheenhancedmatchingsystemforpublichealthandepidemiologicalstudies