Cargando…

RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches

We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms....

Descripción completa

Detalles Bibliográficos
Autores principales: Xu, Xiaoming, Yin, Zekun, Yan, Lifeng, Zhang, Hao, Xu, Borui, Wei, Yanjie, Niu, Beifang, Schmidt, Bertil, Liu, Weiguo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10190105/
https://www.ncbi.nlm.nih.gov/pubmed/37198663
http://dx.doi.org/10.1186/s13059-023-02961-6
Descripción
Sumario:We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-023-02961-6.