Cargando…
CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
BACKGROUND: Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challeng...
Autores principales: | , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7654019/ https://www.ncbi.nlm.nih.gov/pubmed/33167998 http://dx.doi.org/10.1186/s12911-020-01285-w |
_version_ | 1783607994693451776 |
---|---|
author | Barbosa, George C. G. Ali, M. Sanni Araujo, Bruno Reis, Sandra Sena, Samila Ichihara, Maria Y. T. Pescarini, Julia Fiaccone, Rosemeire L. Amorim, Leila D. Pita, Robespierre Barreto, Marcos E. Smeeth, Liam Barreto, Mauricio L. |
author_facet | Barbosa, George C. G. Ali, M. Sanni Araujo, Bruno Reis, Sandra Sena, Samila Ichihara, Maria Y. T. Pescarini, Julia Fiaccone, Rosemeire L. Amorim, Leila D. Pita, Robespierre Barreto, Marcos E. Smeeth, Liam Barreto, Mauricio L. |
author_sort | Barbosa, George C. G. |
collection | PubMed |
description | BACKGROUND: Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. METHODS: We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. RESULTS: Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. CONCLUSION: CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures. |
format | Online Article Text |
id | pubmed-7654019 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-76540192020-11-10 CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability Barbosa, George C. G. Ali, M. Sanni Araujo, Bruno Reis, Sandra Sena, Samila Ichihara, Maria Y. T. Pescarini, Julia Fiaccone, Rosemeire L. Amorim, Leila D. Pita, Robespierre Barreto, Marcos E. Smeeth, Liam Barreto, Mauricio L. BMC Med Inform Decis Mak Research Article BACKGROUND: Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. METHODS: We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. RESULTS: Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. CONCLUSION: CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures. BioMed Central 2020-11-09 /pmc/articles/PMC7654019/ /pubmed/33167998 http://dx.doi.org/10.1186/s12911-020-01285-w Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Barbosa, George C. G. Ali, M. Sanni Araujo, Bruno Reis, Sandra Sena, Samila Ichihara, Maria Y. T. Pescarini, Julia Fiaccone, Rosemeire L. Amorim, Leila D. Pita, Robespierre Barreto, Marcos E. Smeeth, Liam Barreto, Mauricio L. CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability |
title | CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability |
title_full | CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability |
title_fullStr | CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability |
title_full_unstemmed | CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability |
title_short | CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability |
title_sort | cidacs-rl: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7654019/ https://www.ncbi.nlm.nih.gov/pubmed/33167998 http://dx.doi.org/10.1186/s12911-020-01285-w |
work_keys_str_mv | AT barbosageorgecg cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT alimsanni cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT araujobruno cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT reissandra cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT senasamila cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT ichiharamariayt cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT pescarinijulia cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT fiacconerosemeirel cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT amorimleilad cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT pitarobespierre cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT barretomarcose cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT smeethliam cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability AT barretomauriciol cidacsrlanovelindexingsearchandscoringbasedrecordlinkagesystemforhugedatasetswithhighaccuracyandscalability |