Cargando…

Efficient algorithms for fast integration on large data sets from multiple sources

BACKGROUND: Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data in...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mi, Tian, Rajasekaran, Sanguthevar, Aseltine, Robert
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439324/ https://www.ncbi.nlm.nih.gov/pubmed/22741525 http://dx.doi.org/10.1186/1472-6947-12-59

_version_	1782242981222285312
author	Mi, Tian Rajasekaran, Sanguthevar Aseltine, Robert
author_facet	Mi, Tian Rajasekaran, Sanguthevar Aseltine, Robert
author_sort	Mi, Tian
collection	PubMed
description	BACKGROUND: Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently. METHODS: Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block. RESULTS: We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach. CONCLUSIONS: In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records.
format	Online Article Text
id	pubmed-3439324
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-34393242012-09-17 Efficient algorithms for fast integration on large data sets from multiple sources Mi, Tian Rajasekaran, Sanguthevar Aseltine, Robert BMC Med Inform Decis Mak Research Article BACKGROUND: Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently. METHODS: Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block. RESULTS: We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach. CONCLUSIONS: In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records. BioMed Central 2012-06-28 /pmc/articles/PMC3439324/ /pubmed/22741525 http://dx.doi.org/10.1186/1472-6947-12-59 Text en Copyright ©2012 Mi et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Mi, Tian Rajasekaran, Sanguthevar Aseltine, Robert Efficient algorithms for fast integration on large data sets from multiple sources
title	Efficient algorithms for fast integration on large data sets from multiple sources
title_full	Efficient algorithms for fast integration on large data sets from multiple sources
title_fullStr	Efficient algorithms for fast integration on large data sets from multiple sources
title_full_unstemmed	Efficient algorithms for fast integration on large data sets from multiple sources
title_short	Efficient algorithms for fast integration on large data sets from multiple sources
title_sort	efficient algorithms for fast integration on large data sets from multiple sources
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439324/ https://www.ncbi.nlm.nih.gov/pubmed/22741525 http://dx.doi.org/10.1186/1472-6947-12-59
work_keys_str_mv	AT mitian efficientalgorithmsforfastintegrationonlargedatasetsfrommultiplesources AT rajasekaransanguthevar efficientalgorithmsforfastintegrationonlargedatasetsfrommultiplesources AT aseltinerobert efficientalgorithmsforfastintegrationonlargedatasetsfrommultiplesources

Efficient algorithms for fast integration on large data sets from multiple sources

Ejemplares similares