Cargando…

Parallel Clustering Algorithm for Large-Scale Biological Data Sets

BACKGROUNDS: Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many oth...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Minchao, Zhang, Wu, Ding, Wang, Dai, Dongbo, Zhang, Huiran, Xie, Hao, Chen, Luonan, Guo, Yike, Xie, Jiang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3976248/
https://www.ncbi.nlm.nih.gov/pubmed/24705246
http://dx.doi.org/10.1371/journal.pone.0091315
_version_ 1782310259328548864
author Wang, Minchao
Zhang, Wu
Ding, Wang
Dai, Dongbo
Zhang, Huiran
Xie, Hao
Chen, Luonan
Guo, Yike
Xie, Jiang
author_facet Wang, Minchao
Zhang, Wu
Ding, Wang
Dai, Dongbo
Zhang, Huiran
Xie, Hao
Chen, Luonan
Guo, Yike
Xie, Jiang
author_sort Wang, Minchao
collection PubMed
description BACKGROUNDS: Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. METHODS: Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. RESULT: A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.
format Online
Article
Text
id pubmed-3976248
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-39762482014-04-08 Parallel Clustering Algorithm for Large-Scale Biological Data Sets Wang, Minchao Zhang, Wu Ding, Wang Dai, Dongbo Zhang, Huiran Xie, Hao Chen, Luonan Guo, Yike Xie, Jiang PLoS One Research Article BACKGROUNDS: Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. METHODS: Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. RESULT: A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies. Public Library of Science 2014-04-04 /pmc/articles/PMC3976248/ /pubmed/24705246 http://dx.doi.org/10.1371/journal.pone.0091315 Text en © 2014 Wang et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Wang, Minchao
Zhang, Wu
Ding, Wang
Dai, Dongbo
Zhang, Huiran
Xie, Hao
Chen, Luonan
Guo, Yike
Xie, Jiang
Parallel Clustering Algorithm for Large-Scale Biological Data Sets
title Parallel Clustering Algorithm for Large-Scale Biological Data Sets
title_full Parallel Clustering Algorithm for Large-Scale Biological Data Sets
title_fullStr Parallel Clustering Algorithm for Large-Scale Biological Data Sets
title_full_unstemmed Parallel Clustering Algorithm for Large-Scale Biological Data Sets
title_short Parallel Clustering Algorithm for Large-Scale Biological Data Sets
title_sort parallel clustering algorithm for large-scale biological data sets
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3976248/
https://www.ncbi.nlm.nih.gov/pubmed/24705246
http://dx.doi.org/10.1371/journal.pone.0091315
work_keys_str_mv AT wangminchao parallelclusteringalgorithmforlargescalebiologicaldatasets
AT zhangwu parallelclusteringalgorithmforlargescalebiologicaldatasets
AT dingwang parallelclusteringalgorithmforlargescalebiologicaldatasets
AT daidongbo parallelclusteringalgorithmforlargescalebiologicaldatasets
AT zhanghuiran parallelclusteringalgorithmforlargescalebiologicaldatasets
AT xiehao parallelclusteringalgorithmforlargescalebiologicaldatasets
AT chenluonan parallelclusteringalgorithmforlargescalebiologicaldatasets
AT guoyike parallelclusteringalgorithmforlargescalebiologicaldatasets
AT xiejiang parallelclusteringalgorithmforlargescalebiologicaldatasets