Cargando…

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network

As the size of networks increases, it is becoming important to analyze large-scale network data. A network clustering algorithm is useful for analysis of network data. Conventional network clustering algorithms in a single machine environment rather than a parallel machine environment are actively b...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kim, Jungrim, Shin, Mincheol, Kim, Jeongwoo, Park, Chihyun, Lee, Sujin, Woo, Jaemin, Kim, Hyerim, Seo, Dongmin, Yu, Seokjong, Park, Sanghyun
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2018
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6179193/ https://www.ncbi.nlm.nih.gov/pubmed/30303961 http://dx.doi.org/10.1371/journal.pone.0203670

_version_	1783362058368057344
author	Kim, Jungrim Shin, Mincheol Kim, Jeongwoo Park, Chihyun Lee, Sujin Woo, Jaemin Kim, Hyerim Seo, Dongmin Yu, Seokjong Park, Sanghyun
author_facet	Kim, Jungrim Shin, Mincheol Kim, Jeongwoo Park, Chihyun Lee, Sujin Woo, Jaemin Kim, Hyerim Seo, Dongmin Yu, Seokjong Park, Sanghyun
author_sort	Kim, Jungrim
collection	PubMed
description	As the size of networks increases, it is becoming important to analyze large-scale network data. A network clustering algorithm is useful for analysis of network data. Conventional network clustering algorithms in a single machine environment rather than a parallel machine environment are actively being researched. However, these algorithms cannot analyze large-scale network data because of memory size issues. As a solution, we propose a network clustering algorithm for large-scale network data analysis using Apache Spark by changing the paradigm of the conventional clustering algorithm to improve its efficiency in the Apache Spark environment. We also apply optimization approaches such as Bloom filter and shuffle selection to reduce memory usage and execution time. By evaluating our proposed algorithm based on an average normalized cut, we confirmed that the algorithm can analyze diverse large-scale network datasets such as biological, co-authorship, internet topology and social networks. Experimental results show that the proposed algorithm can develop more accurate clusters than comparative algorithms with less memory usage. Furthermore, we confirm the proposed optimization approaches and the scalability of the proposed algorithm. In addition, we validate that clusters found from the proposed algorithm can represent biologically meaningful functions.
format	Online Article Text
id	pubmed-6179193
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-61791932018-10-19 CASS: A distributed network clustering algorithm based on structure similarity for large-scale network Kim, Jungrim Shin, Mincheol Kim, Jeongwoo Park, Chihyun Lee, Sujin Woo, Jaemin Kim, Hyerim Seo, Dongmin Yu, Seokjong Park, Sanghyun PLoS One Research Article As the size of networks increases, it is becoming important to analyze large-scale network data. A network clustering algorithm is useful for analysis of network data. Conventional network clustering algorithms in a single machine environment rather than a parallel machine environment are actively being researched. However, these algorithms cannot analyze large-scale network data because of memory size issues. As a solution, we propose a network clustering algorithm for large-scale network data analysis using Apache Spark by changing the paradigm of the conventional clustering algorithm to improve its efficiency in the Apache Spark environment. We also apply optimization approaches such as Bloom filter and shuffle selection to reduce memory usage and execution time. By evaluating our proposed algorithm based on an average normalized cut, we confirmed that the algorithm can analyze diverse large-scale network datasets such as biological, co-authorship, internet topology and social networks. Experimental results show that the proposed algorithm can develop more accurate clusters than comparative algorithms with less memory usage. Furthermore, we confirm the proposed optimization approaches and the scalability of the proposed algorithm. In addition, we validate that clusters found from the proposed algorithm can represent biologically meaningful functions. Public Library of Science 2018-10-10 /pmc/articles/PMC6179193/ /pubmed/30303961 http://dx.doi.org/10.1371/journal.pone.0203670 Text en © 2018 Kim et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Kim, Jungrim Shin, Mincheol Kim, Jeongwoo Park, Chihyun Lee, Sujin Woo, Jaemin Kim, Hyerim Seo, Dongmin Yu, Seokjong Park, Sanghyun CASS: A distributed network clustering algorithm based on structure similarity for large-scale network
title	CASS: A distributed network clustering algorithm based on structure similarity for large-scale network
title_full	CASS: A distributed network clustering algorithm based on structure similarity for large-scale network
title_fullStr	CASS: A distributed network clustering algorithm based on structure similarity for large-scale network
title_full_unstemmed	CASS: A distributed network clustering algorithm based on structure similarity for large-scale network
title_short	CASS: A distributed network clustering algorithm based on structure similarity for large-scale network
title_sort	cass: a distributed network clustering algorithm based on structure similarity for large-scale network
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6179193/ https://www.ncbi.nlm.nih.gov/pubmed/30303961 http://dx.doi.org/10.1371/journal.pone.0203670
work_keys_str_mv	AT kimjungrim cassadistributednetworkclusteringalgorithmbasedonstructuresimilarityforlargescalenetwork AT shinmincheol cassadistributednetworkclusteringalgorithmbasedonstructuresimilarityforlargescalenetwork AT kimjeongwoo cassadistributednetworkclusteringalgorithmbasedonstructuresimilarityforlargescalenetwork AT parkchihyun cassadistributednetworkclusteringalgorithmbasedonstructuresimilarityforlargescalenetwork AT leesujin cassadistributednetworkclusteringalgorithmbasedonstructuresimilarityforlargescalenetwork AT woojaemin cassadistributednetworkclusteringalgorithmbasedonstructuresimilarityforlargescalenetwork AT kimhyerim cassadistributednetworkclusteringalgorithmbasedonstructuresimilarityforlargescalenetwork AT seodongmin cassadistributednetworkclusteringalgorithmbasedonstructuresimilarityforlargescalenetwork AT yuseokjong cassadistributednetworkclusteringalgorithmbasedonstructuresimilarityforlargescalenetwork AT parksanghyun cassadistributednetworkclusteringalgorithmbasedonstructuresimilarityforlargescalenetwork

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network

Ejemplares similares