Cargando…

Gclust: A Parallel Clustering Tool for Microbial Genomic Data

The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing cluste...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Ruilin, He, Xiaoyu, Dai, Chuangchuang, Zhu, Haidong, Lang, Xianyu, Chen, Wei, Li, Xiaodong, Zhao, Dan, Zhang, Yu, Han, Xinyin, Niu, Tie, Zhao, Yi, Cao, Rongqiang, He, Rong, Lu, Zhonghua, Chi, Xuebin, Li, Weizhong, Niu, Beifang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7056916/
https://www.ncbi.nlm.nih.gov/pubmed/31917259
http://dx.doi.org/10.1016/j.gpb.2018.10.008
_version_ 1783503559878246400
author Li, Ruilin
He, Xiaoyu
Dai, Chuangchuang
Zhu, Haidong
Lang, Xianyu
Chen, Wei
Li, Xiaodong
Zhao, Dan
Zhang, Yu
Han, Xinyin
Niu, Tie
Zhao, Yi
Cao, Rongqiang
He, Rong
Lu, Zhonghua
Chi, Xuebin
Li, Weizhong
Niu, Beifang
author_facet Li, Ruilin
He, Xiaoyu
Dai, Chuangchuang
Zhu, Haidong
Lang, Xianyu
Chen, Wei
Li, Xiaodong
Zhao, Dan
Zhang, Yu
Han, Xinyin
Niu, Tie
Zhao, Yi
Cao, Rongqiang
He, Rong
Lu, Zhonghua
Chi, Xuebin
Li, Weizhong
Niu, Beifang
author_sort Li, Ruilin
collection PubMed
description The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.
format Online
Article
Text
id pubmed-7056916
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-70569162020-03-09 Gclust: A Parallel Clustering Tool for Microbial Genomic Data Li, Ruilin He, Xiaoyu Dai, Chuangchuang Zhu, Haidong Lang, Xianyu Chen, Wei Li, Xiaodong Zhao, Dan Zhang, Yu Han, Xinyin Niu, Tie Zhao, Yi Cao, Rongqiang He, Rong Lu, Zhonghua Chi, Xuebin Li, Weizhong Niu, Beifang Genomics Proteomics Bioinformatics Method The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust. Elsevier 2019-10 2020-01-07 /pmc/articles/PMC7056916/ /pubmed/31917259 http://dx.doi.org/10.1016/j.gpb.2018.10.008 Text en © 2019 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Method
Li, Ruilin
He, Xiaoyu
Dai, Chuangchuang
Zhu, Haidong
Lang, Xianyu
Chen, Wei
Li, Xiaodong
Zhao, Dan
Zhang, Yu
Han, Xinyin
Niu, Tie
Zhao, Yi
Cao, Rongqiang
He, Rong
Lu, Zhonghua
Chi, Xuebin
Li, Weizhong
Niu, Beifang
Gclust: A Parallel Clustering Tool for Microbial Genomic Data
title Gclust: A Parallel Clustering Tool for Microbial Genomic Data
title_full Gclust: A Parallel Clustering Tool for Microbial Genomic Data
title_fullStr Gclust: A Parallel Clustering Tool for Microbial Genomic Data
title_full_unstemmed Gclust: A Parallel Clustering Tool for Microbial Genomic Data
title_short Gclust: A Parallel Clustering Tool for Microbial Genomic Data
title_sort gclust: a parallel clustering tool for microbial genomic data
topic Method
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7056916/
https://www.ncbi.nlm.nih.gov/pubmed/31917259
http://dx.doi.org/10.1016/j.gpb.2018.10.008
work_keys_str_mv AT liruilin gclustaparallelclusteringtoolformicrobialgenomicdata
AT hexiaoyu gclustaparallelclusteringtoolformicrobialgenomicdata
AT daichuangchuang gclustaparallelclusteringtoolformicrobialgenomicdata
AT zhuhaidong gclustaparallelclusteringtoolformicrobialgenomicdata
AT langxianyu gclustaparallelclusteringtoolformicrobialgenomicdata
AT chenwei gclustaparallelclusteringtoolformicrobialgenomicdata
AT lixiaodong gclustaparallelclusteringtoolformicrobialgenomicdata
AT zhaodan gclustaparallelclusteringtoolformicrobialgenomicdata
AT zhangyu gclustaparallelclusteringtoolformicrobialgenomicdata
AT hanxinyin gclustaparallelclusteringtoolformicrobialgenomicdata
AT niutie gclustaparallelclusteringtoolformicrobialgenomicdata
AT zhaoyi gclustaparallelclusteringtoolformicrobialgenomicdata
AT caorongqiang gclustaparallelclusteringtoolformicrobialgenomicdata
AT herong gclustaparallelclusteringtoolformicrobialgenomicdata
AT luzhonghua gclustaparallelclusteringtoolformicrobialgenomicdata
AT chixuebin gclustaparallelclusteringtoolformicrobialgenomicdata
AT liweizhong gclustaparallelclusteringtoolformicrobialgenomicdata
AT niubeifang gclustaparallelclusteringtoolformicrobialgenomicdata