Cargando…

Tucuxi-BLAST: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach

BACKGROUND: Public health research frequently requires the integration of information from different data sources. However, errors in the records and the high computational costs involved make linking large administrative databases using record linkage (RL) methodologies a major challenge. METHODS:...

Descripción completa

Detalles Bibliográficos
Autores principales: Araujo, José Deney, Santos-e-Silva, Juan Carlo, Costa-Martins, André Guilherme, Sampaio, Vanderson, de Castro, Daniel Barros, de Souza, Robson F., Giddaluru, Jeevan, Ramos, Pablo Ivan P., Pita, Robespierre, Barreto, Mauricio L., Barral-Netto, Manoel, Nakaya, Helder I.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9281601/
https://www.ncbi.nlm.nih.gov/pubmed/35846888
http://dx.doi.org/10.7717/peerj.13507
_version_ 1784746916633903104
author Araujo, José Deney
Santos-e-Silva, Juan Carlo
Costa-Martins, André Guilherme
Sampaio, Vanderson
de Castro, Daniel Barros
de Souza, Robson F.
Giddaluru, Jeevan
Ramos, Pablo Ivan P.
Pita, Robespierre
Barreto, Mauricio L.
Barral-Netto, Manoel
Nakaya, Helder I.
author_facet Araujo, José Deney
Santos-e-Silva, Juan Carlo
Costa-Martins, André Guilherme
Sampaio, Vanderson
de Castro, Daniel Barros
de Souza, Robson F.
Giddaluru, Jeevan
Ramos, Pablo Ivan P.
Pita, Robespierre
Barreto, Mauricio L.
Barral-Netto, Manoel
Nakaya, Helder I.
author_sort Araujo, José Deney
collection PubMed
description BACKGROUND: Public health research frequently requires the integration of information from different data sources. However, errors in the records and the high computational costs involved make linking large administrative databases using record linkage (RL) methodologies a major challenge. METHODS: We present Tucuxi-BLAST, a versatile tool for probabilistic RL that utilizes a DNA-encoded approach to encrypt, analyze and link massive administrative databases. Tucuxi-BLAST encodes the identification records into DNA. BLASTn algorithm is then used to align the sequences between databases. We tested and benchmarked on a simulated database containing records for 300 million individuals and also on four large administrative databases containing real data on Brazilian patients. RESULTS: Our method was able to overcome misspellings and typographical errors in administrative databases. In processing the RL of the largest simulated dataset (200k records), the state-of-the-art method took 5 days and 7 h to perform the RL, while Tucuxi-BLAST only took 23 h. When compared with five existing RL tools applied to a gold-standard dataset from real health-related databases, Tucuxi-BLAST had the highest accuracy and speed. By repurposing genomic tools, Tucuxi-BLAST can improve data-driven medical research and provide a fast and accurate way to link individual information across several administrative databases.
format Online
Article
Text
id pubmed-9281601
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-92816012022-07-15 Tucuxi-BLAST: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach Araujo, José Deney Santos-e-Silva, Juan Carlo Costa-Martins, André Guilherme Sampaio, Vanderson de Castro, Daniel Barros de Souza, Robson F. Giddaluru, Jeevan Ramos, Pablo Ivan P. Pita, Robespierre Barreto, Mauricio L. Barral-Netto, Manoel Nakaya, Helder I. PeerJ Bioinformatics BACKGROUND: Public health research frequently requires the integration of information from different data sources. However, errors in the records and the high computational costs involved make linking large administrative databases using record linkage (RL) methodologies a major challenge. METHODS: We present Tucuxi-BLAST, a versatile tool for probabilistic RL that utilizes a DNA-encoded approach to encrypt, analyze and link massive administrative databases. Tucuxi-BLAST encodes the identification records into DNA. BLASTn algorithm is then used to align the sequences between databases. We tested and benchmarked on a simulated database containing records for 300 million individuals and also on four large administrative databases containing real data on Brazilian patients. RESULTS: Our method was able to overcome misspellings and typographical errors in administrative databases. In processing the RL of the largest simulated dataset (200k records), the state-of-the-art method took 5 days and 7 h to perform the RL, while Tucuxi-BLAST only took 23 h. When compared with five existing RL tools applied to a gold-standard dataset from real health-related databases, Tucuxi-BLAST had the highest accuracy and speed. By repurposing genomic tools, Tucuxi-BLAST can improve data-driven medical research and provide a fast and accurate way to link individual information across several administrative databases. PeerJ Inc. 2022-07-11 /pmc/articles/PMC9281601/ /pubmed/35846888 http://dx.doi.org/10.7717/peerj.13507 Text en © 2022 Araujo et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Araujo, José Deney
Santos-e-Silva, Juan Carlo
Costa-Martins, André Guilherme
Sampaio, Vanderson
de Castro, Daniel Barros
de Souza, Robson F.
Giddaluru, Jeevan
Ramos, Pablo Ivan P.
Pita, Robespierre
Barreto, Mauricio L.
Barral-Netto, Manoel
Nakaya, Helder I.
Tucuxi-BLAST: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
title Tucuxi-BLAST: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
title_full Tucuxi-BLAST: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
title_fullStr Tucuxi-BLAST: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
title_full_unstemmed Tucuxi-BLAST: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
title_short Tucuxi-BLAST: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach
title_sort tucuxi-blast: enabling fast and accurate record linkage of large-scale health-related administrative databases through a dna-encoded approach
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9281601/
https://www.ncbi.nlm.nih.gov/pubmed/35846888
http://dx.doi.org/10.7717/peerj.13507
work_keys_str_mv AT araujojosedeney tucuxiblastenablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT santosesilvajuancarlo tucuxiblastenablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT costamartinsandreguilherme tucuxiblastenablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT sampaiovanderson tucuxiblastenablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT decastrodanielbarros tucuxiblastenablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT desouzarobsonf tucuxiblastenablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT giddalurujeevan tucuxiblastenablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT ramospabloivanp tucuxiblastenablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT pitarobespierre tucuxiblastenablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT barretomauriciol tucuxiblastenablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT barralnettomanoel tucuxiblastenablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach
AT nakayahelderi tucuxiblastenablingfastandaccuraterecordlinkageoflargescalehealthrelatedadministrativedatabasesthroughadnaencodedapproach