Cargando…

An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data

A crucial step in improving data quality is to discover semantic relationships between data. Functional dependencies are rules that describe semantic relationships between data in relational databases and have been applied to improve data quality recently. However, traditional functional discovery a...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wu, Wanqing, Mao, Wenyu
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9142976/ https://www.ncbi.nlm.nih.gov/pubmed/35632261 http://dx.doi.org/10.3390/s22103856

_version_	1784715691916525568
author	Wu, Wanqing Mao, Wenyu
author_facet	Wu, Wanqing Mao, Wenyu
author_sort	Wu, Wanqing
collection	PubMed
description	A crucial step in improving data quality is to discover semantic relationships between data. Functional dependencies are rules that describe semantic relationships between data in relational databases and have been applied to improve data quality recently. However, traditional functional discovery algorithms applied to distributed data may lead to errors and the inability to scale to large-scale data. To solve the above problems, we propose a novel distributed functional dependency discovery algorithm based on Apache Spark, which can effectively discover functional dependencies in large-scale data. The basic idea is to use data redistribution to discover functional dependencies in parallel on multiple nodes. In this algorithm, we take a sampling approach to quickly remove invalid functional dependencies and propose a greedy-based task assignment strategy to balance the load. In addition, the prefix tree is used to store intermediate computation results during the validation process to avoid repeated computation of equivalence classes. Experimental results on real and synthetic datasets show that the proposed algorithm in this paper is more efficient than existing methods while ensuring accuracy.
format	Online Article Text
id	pubmed-9142976
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-91429762022-05-29 An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data Wu, Wanqing Mao, Wenyu Sensors (Basel) Article A crucial step in improving data quality is to discover semantic relationships between data. Functional dependencies are rules that describe semantic relationships between data in relational databases and have been applied to improve data quality recently. However, traditional functional discovery algorithms applied to distributed data may lead to errors and the inability to scale to large-scale data. To solve the above problems, we propose a novel distributed functional dependency discovery algorithm based on Apache Spark, which can effectively discover functional dependencies in large-scale data. The basic idea is to use data redistribution to discover functional dependencies in parallel on multiple nodes. In this algorithm, we take a sampling approach to quickly remove invalid functional dependencies and propose a greedy-based task assignment strategy to balance the load. In addition, the prefix tree is used to store intermediate computation results during the validation process to avoid repeated computation of equivalence classes. Experimental results on real and synthetic datasets show that the proposed algorithm in this paper is more efficient than existing methods while ensuring accuracy. MDPI 2022-05-19 /pmc/articles/PMC9142976/ /pubmed/35632261 http://dx.doi.org/10.3390/s22103856 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Wu, Wanqing Mao, Wenyu An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data
title	An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data
title_full	An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data
title_fullStr	An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data
title_full_unstemmed	An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data
title_short	An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data
title_sort	efficient and scalable algorithm to mine functional dependencies from distributed big data
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9142976/ https://www.ncbi.nlm.nih.gov/pubmed/35632261 http://dx.doi.org/10.3390/s22103856
work_keys_str_mv	AT wuwanqing anefficientandscalablealgorithmtominefunctionaldependenciesfromdistributedbigdata AT maowenyu anefficientandscalablealgorithmtominefunctionaldependenciesfromdistributedbigdata AT wuwanqing efficientandscalablealgorithmtominefunctionaldependenciesfromdistributedbigdata AT maowenyu efficientandscalablealgorithmtominefunctionaldependenciesfromdistributedbigdata

An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data

Ejemplares similares