Cargando…

A proficient cost reduction framework for de-duplication of records in data integration

BACKGROUND: Record de-duplication is a process of identifying the records referring to the same entity. It has a pivotal role in data mining applications, which involves the integration of multiple data sources and data cleansing. It has been a challenging task due to its computational complexity an...

Descripción completa

Detalles Bibliográficos
Autores principales: Sohail, Asif, Yousaf, Muhammad Murtaza
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4828843/
https://www.ncbi.nlm.nih.gov/pubmed/27067004
http://dx.doi.org/10.1186/s12911-016-0280-9
_version_ 1782426662213779456
author Sohail, Asif
Yousaf, Muhammad Murtaza
author_facet Sohail, Asif
Yousaf, Muhammad Murtaza
author_sort Sohail, Asif
collection PubMed
description BACKGROUND: Record de-duplication is a process of identifying the records referring to the same entity. It has a pivotal role in data mining applications, which involves the integration of multiple data sources and data cleansing. It has been a challenging task due to its computational complexity and variations in data representations across different data sources. Blocking and windowing are the commonly used methods for reducing the number of record comparisons during record de-duplication. Both blocking and windowing require tuning of a certain set of parameters, such as the choice of a particular variant of blocking or windowing, the selection of appropriate window size for different datasets etc. METHODS: In this paper, we have proposed a framework that employs blocking and windowing techniques in succession, such that figuring out the parameters is not required. We have also evaluated the impact of different configurations on dirty and massively dirty datasets. To evaluate the proposed framework, experiments are performed using Febrl (Freely Extensible Biomedical Record Linkage). RESULTS: The proposed framework is comprehensively evaluated using a variety of quality and complexity parameters such as reduction ratio, precision, recall etc. It is observed that the proposed framework significantly reduces the number of record comparisons. CONCLUSIONS: The selection of the linkage key is a critical performance factor for record linkage. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12911-016-0280-9) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4828843
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-48288432016-04-13 A proficient cost reduction framework for de-duplication of records in data integration Sohail, Asif Yousaf, Muhammad Murtaza BMC Med Inform Decis Mak Research Article BACKGROUND: Record de-duplication is a process of identifying the records referring to the same entity. It has a pivotal role in data mining applications, which involves the integration of multiple data sources and data cleansing. It has been a challenging task due to its computational complexity and variations in data representations across different data sources. Blocking and windowing are the commonly used methods for reducing the number of record comparisons during record de-duplication. Both blocking and windowing require tuning of a certain set of parameters, such as the choice of a particular variant of blocking or windowing, the selection of appropriate window size for different datasets etc. METHODS: In this paper, we have proposed a framework that employs blocking and windowing techniques in succession, such that figuring out the parameters is not required. We have also evaluated the impact of different configurations on dirty and massively dirty datasets. To evaluate the proposed framework, experiments are performed using Febrl (Freely Extensible Biomedical Record Linkage). RESULTS: The proposed framework is comprehensively evaluated using a variety of quality and complexity parameters such as reduction ratio, precision, recall etc. It is observed that the proposed framework significantly reduces the number of record comparisons. CONCLUSIONS: The selection of the linkage key is a critical performance factor for record linkage. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12911-016-0280-9) contains supplementary material, which is available to authorized users. BioMed Central 2016-04-12 /pmc/articles/PMC4828843/ /pubmed/27067004 http://dx.doi.org/10.1186/s12911-016-0280-9 Text en © Sohail and Yousaf. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Sohail, Asif
Yousaf, Muhammad Murtaza
A proficient cost reduction framework for de-duplication of records in data integration
title A proficient cost reduction framework for de-duplication of records in data integration
title_full A proficient cost reduction framework for de-duplication of records in data integration
title_fullStr A proficient cost reduction framework for de-duplication of records in data integration
title_full_unstemmed A proficient cost reduction framework for de-duplication of records in data integration
title_short A proficient cost reduction framework for de-duplication of records in data integration
title_sort proficient cost reduction framework for de-duplication of records in data integration
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4828843/
https://www.ncbi.nlm.nih.gov/pubmed/27067004
http://dx.doi.org/10.1186/s12911-016-0280-9
work_keys_str_mv AT sohailasif aproficientcostreductionframeworkfordeduplicationofrecordsindataintegration
AT yousafmuhammadmurtaza aproficientcostreductionframeworkfordeduplicationofrecordsindataintegration
AT sohailasif proficientcostreductionframeworkfordeduplicationofrecordsindataintegration
AT yousafmuhammadmurtaza proficientcostreductionframeworkfordeduplicationofrecordsindataintegration