Cargando…

Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis

OBJECTIVE: To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs. METHODS: The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algor...

Descripción completa

Detalles Bibliográficos
Autores principales: de Oliveira, Gisele Pinto, Bierrenbach, Ana Luiza de Souza, de Camargo, Kenneth Rochel, Coeli, Cláudia Medina, Pinheiro, Rejane Sobrino
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Faculdade de Saúde Pública da Universidade de São Paulo 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988803/
https://www.ncbi.nlm.nih.gov/pubmed/27556963
http://dx.doi.org/10.1590/S1518-8787.2016050006327
_version_ 1782448477792370688
author de Oliveira, Gisele Pinto
Bierrenbach, Ana Luiza de Souza
de Camargo, Kenneth Rochel
Coeli, Cláudia Medina
Pinheiro, Rejane Sobrino
author_facet de Oliveira, Gisele Pinto
Bierrenbach, Ana Luiza de Souza
de Camargo, Kenneth Rochel
Coeli, Cláudia Medina
Pinheiro, Rejane Sobrino
author_sort de Oliveira, Gisele Pinto
collection PubMed
description OBJECTIVE: To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs. METHODS: The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algorithm was developed using a set of 70 rules, based on the combination of fragments of the key variables with or without modification (Soundex or substring). Each rule was formed by three or more fragments. The probabilistic approach required a cutoff point for the score, above which the links would be automatically classified as belonging to the same individual. The cutoff point was obtained by linkage of the Notifiable Diseases Information System – Tuberculosis database with itself, subsequent manual review and ROC curves and precision-recall. Sensitivity and specificity for accurate analysis were calculated. RESULTS: Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively. The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used. CONCLUSIONS: The two techniques showed a high level of correlation for pair classification. Although deterministic linkage identified more duplicate records than probabilistic linkage, the latter retrieved records not identified by the former. User need and experience should be considered when choosing the best technique to be used.
format Online
Article
Text
id pubmed-4988803
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Faculdade de Saúde Pública da Universidade de São Paulo
record_format MEDLINE/PubMed
spelling pubmed-49888032016-08-29 Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis de Oliveira, Gisele Pinto Bierrenbach, Ana Luiza de Souza de Camargo, Kenneth Rochel Coeli, Cláudia Medina Pinheiro, Rejane Sobrino Rev Saude Publica Original Articles OBJECTIVE: To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs. METHODS: The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algorithm was developed using a set of 70 rules, based on the combination of fragments of the key variables with or without modification (Soundex or substring). Each rule was formed by three or more fragments. The probabilistic approach required a cutoff point for the score, above which the links would be automatically classified as belonging to the same individual. The cutoff point was obtained by linkage of the Notifiable Diseases Information System – Tuberculosis database with itself, subsequent manual review and ROC curves and precision-recall. Sensitivity and specificity for accurate analysis were calculated. RESULTS: Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively. The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used. CONCLUSIONS: The two techniques showed a high level of correlation for pair classification. Although deterministic linkage identified more duplicate records than probabilistic linkage, the latter retrieved records not identified by the former. User need and experience should be considered when choosing the best technique to be used. Faculdade de Saúde Pública da Universidade de São Paulo 2016-08-16 /pmc/articles/PMC4988803/ /pubmed/27556963 http://dx.doi.org/10.1590/S1518-8787.2016050006327 Text en http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Articles
de Oliveira, Gisele Pinto
Bierrenbach, Ana Luiza de Souza
de Camargo, Kenneth Rochel
Coeli, Cláudia Medina
Pinheiro, Rejane Sobrino
Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
title Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
title_full Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
title_fullStr Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
title_full_unstemmed Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
title_short Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
title_sort accuracy of probabilistic and deterministic record linkage: the case of tuberculosis
topic Original Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988803/
https://www.ncbi.nlm.nih.gov/pubmed/27556963
http://dx.doi.org/10.1590/S1518-8787.2016050006327
work_keys_str_mv AT deoliveiragiselepinto accuracyofprobabilisticanddeterministicrecordlinkagethecaseoftuberculosis
AT bierrenbachanaluizadesouza accuracyofprobabilisticanddeterministicrecordlinkagethecaseoftuberculosis
AT decamargokennethrochel accuracyofprobabilisticanddeterministicrecordlinkagethecaseoftuberculosis
AT coeliclaudiamedina accuracyofprobabilisticanddeterministicrecordlinkagethecaseoftuberculosis
AT pinheirorejanesobrino accuracyofprobabilisticanddeterministicrecordlinkagethecaseoftuberculosis