Cargando…

CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability

BACKGROUND: Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challeng...

Descripción completa

Detalles Bibliográficos
Autores principales: Barbosa, George C. G., Ali, M. Sanni, Araujo, Bruno, Reis, Sandra, Sena, Samila, Ichihara, Maria Y. T., Pescarini, Julia, Fiaccone, Rosemeire L., Amorim, Leila D., Pita, Robespierre, Barreto, Marcos E., Smeeth, Liam, Barreto, Mauricio L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7654019/
https://www.ncbi.nlm.nih.gov/pubmed/33167998
http://dx.doi.org/10.1186/s12911-020-01285-w
_version_ 1783607994693451776
author Barbosa, George C. G.
Ali, M. Sanni
Araujo, Bruno
Reis, Sandra
Sena, Samila
Ichihara, Maria Y. T.
Pescarini, Julia
Fiaccone, Rosemeire L.
Amorim, Leila D.
Pita, Robespierre
Barreto, Marcos E.
Smeeth, Liam
Barreto, Mauricio L.
author_facet Barbosa, George C. G.
Ali, M. Sanni
Araujo, Bruno
Reis, Sandra
Sena, Samila
Ichihara, Maria Y. T.
Pescarini, Julia
Fiaccone, Rosemeire L.
Amorim, Leila D.
Pita, Robespierre
Barreto, Marcos E.
Smeeth, Liam
Barreto, Mauricio L.
author_sort Barbosa, George C. G.
collection PubMed
description BACKGROUND: Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. METHODS: We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. RESULTS: Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. CONCLUSION: CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures.
format Online
Article
Text
id pubmed-7654019
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-76540192020-11-10 CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability Barbosa, George C. G. Ali, M. Sanni Araujo, Bruno Reis, Sandra Sena, Samila Ichihara, Maria Y. T. Pescarini, Julia Fiaccone, Rosemeire L. Amorim, Leila D. Pita, Robespierre Barreto, Marcos E. Smeeth, Liam Barreto, Mauricio L. BMC Med Inform Decis Mak Research Article BACKGROUND: Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. METHODS: We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. RESULTS: Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. CONCLUSION: CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures. BioMed Central 2020-11-09 /pmc/articles/PMC7654019/ /pubmed/33167998 http://dx.doi.org/10.1186/s12911-020-01285-w Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Barbosa, George C. G.
Ali, M. Sanni
Araujo, Bruno
Reis, Sandra
Sena, Samila
Ichihara, Maria Y. T.
Pescarini, Julia
Fiaccone, Rosemeire L.
Amorim, Leila D.
Pita, Robespierre
Barreto, Marcos E.
Smeeth, Liam
Barreto, Mauricio L.
CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
title CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
title_full CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
title_fullStr CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
title_full_unstemmed CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
title_short CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
title_sort cidacs-rl: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7654019/
https://www.ncbi.nlm.nih.gov/pubmed/33167998
http://dx.doi.org/10.1186/s12911-020-01285-w
work_keys_str_mv AT barbosageorgecg cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT alimsanni cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT araujobruno cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT reissandra cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT senasamila cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT ichiharamariayt cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT pescarinijulia cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT fiacconerosemeirel cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT amorimleilad cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT pitarobespierre cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT barretomarcose cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT smeethliam cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability
AT barretomauriciol cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability