Cargando…
RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches
We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms....
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10190105/ https://www.ncbi.nlm.nih.gov/pubmed/37198663 http://dx.doi.org/10.1186/s13059-023-02961-6 |
_version_ | 1785043219474546688 |
---|---|
author | Xu, Xiaoming Yin, Zekun Yan, Lifeng Zhang, Hao Xu, Borui Wei, Yanjie Niu, Beifang Schmidt, Bertil Liu, Weiguo |
author_facet | Xu, Xiaoming Yin, Zekun Yan, Lifeng Zhang, Hao Xu, Borui Wei, Yanjie Niu, Beifang Schmidt, Bertil Liu, Weiguo |
author_sort | Xu, Xiaoming |
collection | PubMed |
description | We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-023-02961-6. |
format | Online Article Text |
id | pubmed-10190105 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-101901052023-05-18 RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches Xu, Xiaoming Yin, Zekun Yan, Lifeng Zhang, Hao Xu, Borui Wei, Yanjie Niu, Beifang Schmidt, Bertil Liu, Weiguo Genome Biol Software We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-023-02961-6. BioMed Central 2023-05-17 /pmc/articles/PMC10190105/ /pubmed/37198663 http://dx.doi.org/10.1186/s13059-023-02961-6 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Software Xu, Xiaoming Yin, Zekun Yan, Lifeng Zhang, Hao Xu, Borui Wei, Yanjie Niu, Beifang Schmidt, Bertil Liu, Weiguo RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches |
title | RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches |
title_full | RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches |
title_fullStr | RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches |
title_full_unstemmed | RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches |
title_short | RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches |
title_sort | rabbittclust: enabling fast clustering analysis of millions of bacteria genomes with minhash sketches |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10190105/ https://www.ncbi.nlm.nih.gov/pubmed/37198663 http://dx.doi.org/10.1186/s13059-023-02961-6 |
work_keys_str_mv | AT xuxiaoming rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches AT yinzekun rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches AT yanlifeng rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches AT zhanghao rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches AT xuborui rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches AT weiyanjie rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches AT niubeifang rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches AT schmidtbertil rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches AT liuweiguo rabbittclustenablingfastclusteringanalysisofmillionsofbacteriagenomeswithminhashsketches |