Cargando…
Gclust: A Parallel Clustering Tool for Microbial Genomic Data
The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing cluste...
Autores principales: | , , , , , , , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7056916/ https://www.ncbi.nlm.nih.gov/pubmed/31917259 http://dx.doi.org/10.1016/j.gpb.2018.10.008 |
_version_ | 1783503559878246400 |
---|---|
author | Li, Ruilin He, Xiaoyu Dai, Chuangchuang Zhu, Haidong Lang, Xianyu Chen, Wei Li, Xiaodong Zhao, Dan Zhang, Yu Han, Xinyin Niu, Tie Zhao, Yi Cao, Rongqiang He, Rong Lu, Zhonghua Chi, Xuebin Li, Weizhong Niu, Beifang |
author_facet | Li, Ruilin He, Xiaoyu Dai, Chuangchuang Zhu, Haidong Lang, Xianyu Chen, Wei Li, Xiaodong Zhao, Dan Zhang, Yu Han, Xinyin Niu, Tie Zhao, Yi Cao, Rongqiang He, Rong Lu, Zhonghua Chi, Xuebin Li, Weizhong Niu, Beifang |
author_sort | Li, Ruilin |
collection | PubMed |
description | The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust. |
format | Online Article Text |
id | pubmed-7056916 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-70569162020-03-09 Gclust: A Parallel Clustering Tool for Microbial Genomic Data Li, Ruilin He, Xiaoyu Dai, Chuangchuang Zhu, Haidong Lang, Xianyu Chen, Wei Li, Xiaodong Zhao, Dan Zhang, Yu Han, Xinyin Niu, Tie Zhao, Yi Cao, Rongqiang He, Rong Lu, Zhonghua Chi, Xuebin Li, Weizhong Niu, Beifang Genomics Proteomics Bioinformatics Method The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust. Elsevier 2019-10 2020-01-07 /pmc/articles/PMC7056916/ /pubmed/31917259 http://dx.doi.org/10.1016/j.gpb.2018.10.008 Text en © 2019 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Method Li, Ruilin He, Xiaoyu Dai, Chuangchuang Zhu, Haidong Lang, Xianyu Chen, Wei Li, Xiaodong Zhao, Dan Zhang, Yu Han, Xinyin Niu, Tie Zhao, Yi Cao, Rongqiang He, Rong Lu, Zhonghua Chi, Xuebin Li, Weizhong Niu, Beifang Gclust: A Parallel Clustering Tool for Microbial Genomic Data |
title | Gclust: A Parallel Clustering Tool for Microbial Genomic Data |
title_full | Gclust: A Parallel Clustering Tool for Microbial Genomic Data |
title_fullStr | Gclust: A Parallel Clustering Tool for Microbial Genomic Data |
title_full_unstemmed | Gclust: A Parallel Clustering Tool for Microbial Genomic Data |
title_short | Gclust: A Parallel Clustering Tool for Microbial Genomic Data |
title_sort | gclust: a parallel clustering tool for microbial genomic data |
topic | Method |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7056916/ https://www.ncbi.nlm.nih.gov/pubmed/31917259 http://dx.doi.org/10.1016/j.gpb.2018.10.008 |
work_keys_str_mv | AT liruilin gclustaparallelclusteringtoolformicrobialgenomicdata AT hexiaoyu gclustaparallelclusteringtoolformicrobialgenomicdata AT daichuangchuang gclustaparallelclusteringtoolformicrobialgenomicdata AT zhuhaidong gclustaparallelclusteringtoolformicrobialgenomicdata AT langxianyu gclustaparallelclusteringtoolformicrobialgenomicdata AT chenwei gclustaparallelclusteringtoolformicrobialgenomicdata AT lixiaodong gclustaparallelclusteringtoolformicrobialgenomicdata AT zhaodan gclustaparallelclusteringtoolformicrobialgenomicdata AT zhangyu gclustaparallelclusteringtoolformicrobialgenomicdata AT hanxinyin gclustaparallelclusteringtoolformicrobialgenomicdata AT niutie gclustaparallelclusteringtoolformicrobialgenomicdata AT zhaoyi gclustaparallelclusteringtoolformicrobialgenomicdata AT caorongqiang gclustaparallelclusteringtoolformicrobialgenomicdata AT herong gclustaparallelclusteringtoolformicrobialgenomicdata AT luzhonghua gclustaparallelclusteringtoolformicrobialgenomicdata AT chixuebin gclustaparallelclusteringtoolformicrobialgenomicdata AT liweizhong gclustaparallelclusteringtoolformicrobialgenomicdata AT niubeifang gclustaparallelclusteringtoolformicrobialgenomicdata |