Cargando…

DNACLUST: accurate and efficient clustering of phylogenetic marker genes

BACKGROUND: Clustering is a fundamental operation in the analysis of biological sequence data. New DNA sequencing technologies have dramatically increased the rate at which we can generate data, resulting in datasets that cannot be efficiently analyzed by traditional clustering methods. This is part...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ghodsi, Mohammadreza, Liu, Bo, Pop, Mihai
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3213679/ https://www.ncbi.nlm.nih.gov/pubmed/21718538 http://dx.doi.org/10.1186/1471-2105-12-271

_version_	1782216170765549568
author	Ghodsi, Mohammadreza Liu, Bo Pop, Mihai
author_facet	Ghodsi, Mohammadreza Liu, Bo Pop, Mihai
author_sort	Ghodsi, Mohammadreza
collection	PubMed
description	BACKGROUND: Clustering is a fundamental operation in the analysis of biological sequence data. New DNA sequencing technologies have dramatically increased the rate at which we can generate data, resulting in datasets that cannot be efficiently analyzed by traditional clustering methods. This is particularly true in the context of taxonomic profiling of microbial communities through direct sequencing of phylogenetic markers (e.g. 16S rRNA) - the domain that motivated the work described in this paper. Many analysis approaches rely on an initial clustering step aimed at identifying sequences that belong to the same operational taxonomic unit (OTU). When defining OTUs (which have no universally accepted definition), scientists must balance a trade-off between computational efficiency and biological accuracy, as accurately estimating an environment's phylogenetic composition requires computationally-intensive analyses. We propose that efficient and mathematically well defined clustering methods can benefit existing taxonomic profiling approaches in two ways: (i) the resulting clusters can be substituted for OTUs in certain applications; and (ii) the clustering effectively reduces the size of the data-sets that need to be analyzed by complex phylogenetic pipelines (e.g., only one sequence per cluster needs to be provided to downstream analyses). RESULTS: To address the challenges outlined above, we developed DNACLUST, a fast clustering tool specifically designed for clustering highly-similar DNA sequences. Given a set of sequences and a sequence similarity threshold, DNACLUST creates clusters whose radius is guaranteed not to exceed the specified threshold. Underlying DNACLUST is a greedy clustering strategy that owes its performance to novel sequence alignment and k-mer based filtering algorithms. DNACLUST can also produce multiple sequence alignments for every cluster, allowing users to manually inspect clustering results, and enabling more detailed analyses of the clustered data. CONCLUSIONS: We compare DNACLUST to two popular clustering tools: CD-HIT and UCLUST. We show that DNACLUST is about an order of magnitude faster than CD-HIT and UCLUST (exact mode) and comparable in speed to UCLUST (approximate mode). The performance of DNACLUST improves as the similarity threshold is increased (tight clusters) making it well suited for rapidly removing duplicates and near-duplicates from a dataset, thereby reducing the size of the data being analyzed through more elaborate approaches.
format	Online Article Text
id	pubmed-3213679
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-32136792011-11-12 DNACLUST: accurate and efficient clustering of phylogenetic marker genes Ghodsi, Mohammadreza Liu, Bo Pop, Mihai BMC Bioinformatics Methodology Article BACKGROUND: Clustering is a fundamental operation in the analysis of biological sequence data. New DNA sequencing technologies have dramatically increased the rate at which we can generate data, resulting in datasets that cannot be efficiently analyzed by traditional clustering methods. This is particularly true in the context of taxonomic profiling of microbial communities through direct sequencing of phylogenetic markers (e.g. 16S rRNA) - the domain that motivated the work described in this paper. Many analysis approaches rely on an initial clustering step aimed at identifying sequences that belong to the same operational taxonomic unit (OTU). When defining OTUs (which have no universally accepted definition), scientists must balance a trade-off between computational efficiency and biological accuracy, as accurately estimating an environment's phylogenetic composition requires computationally-intensive analyses. We propose that efficient and mathematically well defined clustering methods can benefit existing taxonomic profiling approaches in two ways: (i) the resulting clusters can be substituted for OTUs in certain applications; and (ii) the clustering effectively reduces the size of the data-sets that need to be analyzed by complex phylogenetic pipelines (e.g., only one sequence per cluster needs to be provided to downstream analyses). RESULTS: To address the challenges outlined above, we developed DNACLUST, a fast clustering tool specifically designed for clustering highly-similar DNA sequences. Given a set of sequences and a sequence similarity threshold, DNACLUST creates clusters whose radius is guaranteed not to exceed the specified threshold. Underlying DNACLUST is a greedy clustering strategy that owes its performance to novel sequence alignment and k-mer based filtering algorithms. DNACLUST can also produce multiple sequence alignments for every cluster, allowing users to manually inspect clustering results, and enabling more detailed analyses of the clustered data. CONCLUSIONS: We compare DNACLUST to two popular clustering tools: CD-HIT and UCLUST. We show that DNACLUST is about an order of magnitude faster than CD-HIT and UCLUST (exact mode) and comparable in speed to UCLUST (approximate mode). The performance of DNACLUST improves as the similarity threshold is increased (tight clusters) making it well suited for rapidly removing duplicates and near-duplicates from a dataset, thereby reducing the size of the data being analyzed through more elaborate approaches. BioMed Central 2011-06-30 /pmc/articles/PMC3213679/ /pubmed/21718538 http://dx.doi.org/10.1186/1471-2105-12-271 Text en Copyright ©2011 Ghodsi et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Ghodsi, Mohammadreza Liu, Bo Pop, Mihai DNACLUST: accurate and efficient clustering of phylogenetic marker genes
title	DNACLUST: accurate and efficient clustering of phylogenetic marker genes
title_full	DNACLUST: accurate and efficient clustering of phylogenetic marker genes
title_fullStr	DNACLUST: accurate and efficient clustering of phylogenetic marker genes
title_full_unstemmed	DNACLUST: accurate and efficient clustering of phylogenetic marker genes
title_short	DNACLUST: accurate and efficient clustering of phylogenetic marker genes
title_sort	dnaclust: accurate and efficient clustering of phylogenetic marker genes
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3213679/ https://www.ncbi.nlm.nih.gov/pubmed/21718538 http://dx.doi.org/10.1186/1471-2105-12-271
work_keys_str_mv	AT ghodsimohammadreza dnaclustaccurateandefficientclusteringofphylogeneticmarkergenes AT liubo dnaclustaccurateandefficientclusteringofphylogeneticmarkergenes AT popmihai dnaclustaccurateandefficientclusteringofphylogeneticmarkergenes

DNACLUST: accurate and efficient clustering of phylogenetic marker genes

Ejemplares similares