Cargando…

Efficient string similarity join in multi-core and distributed systems

In big data area a significant challenge about string similarity join is to find all similar pairs more efficiently. In this paper, we propose a parallel processing framework for efficient string similarity join. First, the input is split into some disjoint small subsets according to the joint frequ...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yan, Cairong, Zhao, Xue, Zhang, Qinglong, Huang, Yongfeng
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5344375/ https://www.ncbi.nlm.nih.gov/pubmed/28278177 http://dx.doi.org/10.1371/journal.pone.0172526

_version_	1782513528472600576
author	Yan, Cairong Zhao, Xue Zhang, Qinglong Huang, Yongfeng
author_facet	Yan, Cairong Zhao, Xue Zhang, Qinglong Huang, Yongfeng
author_sort	Yan, Cairong
collection	PubMed
description	In big data area a significant challenge about string similarity join is to find all similar pairs more efficiently. In this paper, we propose a parallel processing framework for efficient string similarity join. First, the input is split into some disjoint small subsets according to the joint frequency distribution and the interval distribution of strings. Then the filter-verification strategy is adopted in the computation of string similarity for each subset so that the number of candidate pairs is reduced before an effective pruning strategy is used to improve the performance. Finally, the operation of string join is executed in parallel. Para-Join algorithm based on the multi-threading technique is proposed to implement the framework in a multi-core system while Pada-Join algorithm based on Spark platform is proposed to implement the framework in a cluster system. We prove that Para-Join and Pada-Join cannot only avoid reduplicate computation but also ensure the completeness of the result. Experimental results show that Para-Join can achieve high efficiency and significantly outperform than state-of-the-art approaches, meanwhile, Pada-Join can work on large datasets.
format	Online Article Text
id	pubmed-5344375
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-53443752017-03-29 Efficient string similarity join in multi-core and distributed systems Yan, Cairong Zhao, Xue Zhang, Qinglong Huang, Yongfeng PLoS One Research Article In big data area a significant challenge about string similarity join is to find all similar pairs more efficiently. In this paper, we propose a parallel processing framework for efficient string similarity join. First, the input is split into some disjoint small subsets according to the joint frequency distribution and the interval distribution of strings. Then the filter-verification strategy is adopted in the computation of string similarity for each subset so that the number of candidate pairs is reduced before an effective pruning strategy is used to improve the performance. Finally, the operation of string join is executed in parallel. Para-Join algorithm based on the multi-threading technique is proposed to implement the framework in a multi-core system while Pada-Join algorithm based on Spark platform is proposed to implement the framework in a cluster system. We prove that Para-Join and Pada-Join cannot only avoid reduplicate computation but also ensure the completeness of the result. Experimental results show that Para-Join can achieve high efficiency and significantly outperform than state-of-the-art approaches, meanwhile, Pada-Join can work on large datasets. Public Library of Science 2017-03-09 /pmc/articles/PMC5344375/ /pubmed/28278177 http://dx.doi.org/10.1371/journal.pone.0172526 Text en © 2017 Yan et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Yan, Cairong Zhao, Xue Zhang, Qinglong Huang, Yongfeng Efficient string similarity join in multi-core and distributed systems
title	Efficient string similarity join in multi-core and distributed systems
title_full	Efficient string similarity join in multi-core and distributed systems
title_fullStr	Efficient string similarity join in multi-core and distributed systems
title_full_unstemmed	Efficient string similarity join in multi-core and distributed systems
title_short	Efficient string similarity join in multi-core and distributed systems
title_sort	efficient string similarity join in multi-core and distributed systems
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5344375/ https://www.ncbi.nlm.nih.gov/pubmed/28278177 http://dx.doi.org/10.1371/journal.pone.0172526
work_keys_str_mv	AT yancairong efficientstringsimilarityjoininmulticoreanddistributedsystems AT zhaoxue efficientstringsimilarityjoininmulticoreanddistributedsystems AT zhangqinglong efficientstringsimilarityjoininmulticoreanddistributedsystems AT huangyongfeng efficientstringsimilarityjoininmulticoreanddistributedsystems

Efficient string similarity join in multi-core and distributed systems

Ejemplares similares