Cargando…

MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores

BACKGROUND: Tools for accurately clustering biological sequences are among the most important tools in computational biology. Two pioneering tools for clustering sequences are CD-HIT and UCLUST, both of which are fast and consume reasonable amounts of memory; however, there is a big room for improve...

Descripción completa

Detalles Bibliográficos
Autor principal:	Girgis, Hani Z.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2022
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9171953/ https://www.ncbi.nlm.nih.gov/pubmed/35668366 http://dx.doi.org/10.1186/s12864-022-08619-0

_version_	1784721783824318464
author	Girgis, Hani Z.
author_facet	Girgis, Hani Z.
author_sort	Girgis, Hani Z.
collection	PubMed
description	BACKGROUND: Tools for accurately clustering biological sequences are among the most important tools in computational biology. Two pioneering tools for clustering sequences are CD-HIT and UCLUST, both of which are fast and consume reasonable amounts of memory; however, there is a big room for improvement in terms of cluster quality. Motivated by this opportunity for improving cluster quality, we applied the mean shift algorithm in MeShClust v1.0. The mean shift algorithm is an instance of unsupervised learning. Its strong theoretical foundation guarantees the convergence to the true cluster centers. Our implementation of the mean shift algorithm in MeShClust v1.0 was a step forward. In this work, we scale up the algorithm by adapting an out-of-core strategy while utilizing alignment-free identity scores in a new tool: MeShClust v3.0. RESULTS: We evaluated CD-HIT, MeShClust v1.0, MeShClust v3.0, and UCLUST on 22 synthetic sets and five real sets. These data sets were designed or selected for testing the tools in terms of scalability and different similarity levels among sequences comprising clusters. On the synthetic data sets, MeShClust v3.0 outperformed the related tools on all sets in terms of cluster quality. On two real data sets obtained from human microbiome and maize transposons, MeShClust v3.0 outperformed the related tools by wide margins, achieving 55%–300% improvement in cluster quality. On another set that includes degenerate viral sequences, MeShClust v3.0 came third. On two bacterial sets, MeShClust v3.0 was the only applicable tool because of the long sequences in these sets. MeShClust v3.0 requires more time and memory than the related tools; almost all personal computers at the time of this writing can accommodate such requirements. MeShClust v3.0 can estimate an important parameter that controls cluster membership with high accuracy. CONCLUSIONS: These results demonstrate the high quality of clusters produced by MeShClust v3.0 and its ability to apply the mean shift algorithm to large data sets and long sequences. Because clustering tools are utilized in many studies, providing high-quality clusters will help with deriving accurate biological knowledge. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12864-022-08619-0).
format	Online Article Text
id	pubmed-9171953
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-91719532022-06-08 MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores Girgis, Hani Z. BMC Genomics Software BACKGROUND: Tools for accurately clustering biological sequences are among the most important tools in computational biology. Two pioneering tools for clustering sequences are CD-HIT and UCLUST, both of which are fast and consume reasonable amounts of memory; however, there is a big room for improvement in terms of cluster quality. Motivated by this opportunity for improving cluster quality, we applied the mean shift algorithm in MeShClust v1.0. The mean shift algorithm is an instance of unsupervised learning. Its strong theoretical foundation guarantees the convergence to the true cluster centers. Our implementation of the mean shift algorithm in MeShClust v1.0 was a step forward. In this work, we scale up the algorithm by adapting an out-of-core strategy while utilizing alignment-free identity scores in a new tool: MeShClust v3.0. RESULTS: We evaluated CD-HIT, MeShClust v1.0, MeShClust v3.0, and UCLUST on 22 synthetic sets and five real sets. These data sets were designed or selected for testing the tools in terms of scalability and different similarity levels among sequences comprising clusters. On the synthetic data sets, MeShClust v3.0 outperformed the related tools on all sets in terms of cluster quality. On two real data sets obtained from human microbiome and maize transposons, MeShClust v3.0 outperformed the related tools by wide margins, achieving 55%–300% improvement in cluster quality. On another set that includes degenerate viral sequences, MeShClust v3.0 came third. On two bacterial sets, MeShClust v3.0 was the only applicable tool because of the long sequences in these sets. MeShClust v3.0 requires more time and memory than the related tools; almost all personal computers at the time of this writing can accommodate such requirements. MeShClust v3.0 can estimate an important parameter that controls cluster membership with high accuracy. CONCLUSIONS: These results demonstrate the high quality of clusters produced by MeShClust v3.0 and its ability to apply the mean shift algorithm to large data sets and long sequences. Because clustering tools are utilized in many studies, providing high-quality clusters will help with deriving accurate biological knowledge. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s12864-022-08619-0). BioMed Central 2022-06-06 /pmc/articles/PMC9171953/ /pubmed/35668366 http://dx.doi.org/10.1186/s12864-022-08619-0 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Software Girgis, Hani Z. MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores
title	MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores
title_full	MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores
title_fullStr	MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores
title_full_unstemmed	MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores
title_short	MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores
title_sort	meshclust v3.0: high-quality clustering of dna sequences using the mean shift algorithm and alignment-free identity scores
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9171953/ https://www.ncbi.nlm.nih.gov/pubmed/35668366 http://dx.doi.org/10.1186/s12864-022-08619-0
work_keys_str_mv	AT girgishaniz meshclustv30highqualityclusteringofdnasequencesusingthemeanshiftalgorithmandalignmentfreeidentityscores

MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores

Ejemplares similares