Cargando…

Efficient and Scalable Graph Similarity Joins in MapReduce

Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constrai...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Yifan, Zhao, Xiang, Xiao, Chuan, Zhang, Weiming, Tang, Jiuyang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi Publishing Corporation 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4121100/
https://www.ncbi.nlm.nih.gov/pubmed/25121135
http://dx.doi.org/10.1155/2014/749028
_version_ 1782329173545582592
author Chen, Yifan
Zhao, Xiang
Xiao, Chuan
Zhang, Weiming
Tang, Jiuyang
author_facet Chen, Yifan
Zhao, Xiang
Xiao, Chuan
Zhang, Weiming
Tang, Jiuyang
author_sort Chen, Yifan
collection PubMed
description Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given threshold. Leveraging the MapReduce programming model, we propose MGSJoin, a scalable algorithm following the filtering-verification framework for efficient graph similarity joins. It relies on counting overlapping graph signatures for filtering out nonpromising candidates. With the potential issue of too many key-value pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of key-value pairs. Furthermore, we integrate the multiway join strategy to boost the verification, where a MapReduce-based method is proposed for GED calculation. The superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results.
format Online
Article
Text
id pubmed-4121100
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Hindawi Publishing Corporation
record_format MEDLINE/PubMed
spelling pubmed-41211002014-08-12 Efficient and Scalable Graph Similarity Joins in MapReduce Chen, Yifan Zhao, Xiang Xiao, Chuan Zhang, Weiming Tang, Jiuyang ScientificWorldJournal Research Article Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given threshold. Leveraging the MapReduce programming model, we propose MGSJoin, a scalable algorithm following the filtering-verification framework for efficient graph similarity joins. It relies on counting overlapping graph signatures for filtering out nonpromising candidates. With the potential issue of too many key-value pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of key-value pairs. Furthermore, we integrate the multiway join strategy to boost the verification, where a MapReduce-based method is proposed for GED calculation. The superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results. Hindawi Publishing Corporation 2014 2014-07-08 /pmc/articles/PMC4121100/ /pubmed/25121135 http://dx.doi.org/10.1155/2014/749028 Text en Copyright © 2014 Yifan Chen et al. https://creativecommons.org/licenses/by/3.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Chen, Yifan
Zhao, Xiang
Xiao, Chuan
Zhang, Weiming
Tang, Jiuyang
Efficient and Scalable Graph Similarity Joins in MapReduce
title Efficient and Scalable Graph Similarity Joins in MapReduce
title_full Efficient and Scalable Graph Similarity Joins in MapReduce
title_fullStr Efficient and Scalable Graph Similarity Joins in MapReduce
title_full_unstemmed Efficient and Scalable Graph Similarity Joins in MapReduce
title_short Efficient and Scalable Graph Similarity Joins in MapReduce
title_sort efficient and scalable graph similarity joins in mapreduce
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4121100/
https://www.ncbi.nlm.nih.gov/pubmed/25121135
http://dx.doi.org/10.1155/2014/749028
work_keys_str_mv AT chenyifan efficientandscalablegraphsimilarityjoinsinmapreduce
AT zhaoxiang efficientandscalablegraphsimilarityjoinsinmapreduce
AT xiaochuan efficientandscalablegraphsimilarityjoinsinmapreduce
AT zhangweiming efficientandscalablegraphsimilarityjoinsinmapreduce
AT tangjiuyang efficientandscalablegraphsimilarityjoinsinmapreduce