Cargando…

RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches

We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms....

Descripción completa

Detalles Bibliográficos
Autores principales: Xu, Xiaoming, Yin, Zekun, Yan, Lifeng, Zhang, Hao, Xu, Borui, Wei, Yanjie, Niu, Beifang, Schmidt, Bertil, Liu, Weiguo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10190105/
https://www.ncbi.nlm.nih.gov/pubmed/37198663
http://dx.doi.org/10.1186/s13059-023-02961-6
_version_ 1785043219474546688
author Xu, Xiaoming
Yin, Zekun
Yan, Lifeng
Zhang, Hao
Xu, Borui
Wei, Yanjie
Niu, Beifang
Schmidt, Bertil
Liu, Weiguo
author_facet Xu, Xiaoming
Yin, Zekun
Yan, Lifeng
Zhang, Hao
Xu, Borui
Wei, Yanjie
Niu, Beifang
Schmidt, Bertil
Liu, Weiguo
author_sort Xu, Xiaoming
collection PubMed
description We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-023-02961-6.
format Online
Article
Text
id pubmed-10190105
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-101901052023-05-18 RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches Xu, Xiaoming Yin, Zekun Yan, Lifeng Zhang, Hao Xu, Borui Wei, Yanjie Niu, Beifang Schmidt, Bertil Liu, Weiguo Genome Biol Software We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-023-02961-6. BioMed Central 2023-05-17 /pmc/articles/PMC10190105/ /pubmed/37198663 http://dx.doi.org/10.1186/s13059-023-02961-6 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Xu, Xiaoming
Yin, Zekun
Yan, Lifeng
Zhang, Hao
Xu, Borui
Wei, Yanjie
Niu, Beifang
Schmidt, Bertil
Liu, Weiguo
RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches
title RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches
title_full RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches
title_fullStr RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches
title_full_unstemmed RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches
title_short RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches
title_sort rabbittclust: enabling fast clustering analysis of millions of bacteria genomes with minhash sketches
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10190105/
https://www.ncbi.nlm.nih.gov/pubmed/37198663
http://dx.doi.org/10.1186/s13059-023-02961-6
work_keys_str_mv AT xuxiaoming rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches
AT yinzekun rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches
AT yanlifeng rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches
AT zhanghao rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches
AT xuborui rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches
AT weiyanjie rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches
AT niubeifang rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches
AT schmidtbertil rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches
AT liuweiguo rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches