Cargando…

SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets

16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial com...

Descripción completa

Detalles Bibliográficos
Autores principales:	Luan, Tu, Muralidharan, Harihara Subrahmaniam, Alshehri, Marwan, Mittra, Ipsa, Pop, Mihai
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Methods Online
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10164572/ https://www.ncbi.nlm.nih.gov/pubmed/36912074 http://dx.doi.org/10.1093/nar/gkad158

_version_	1785038097981898752
author	Luan, Tu Muralidharan, Harihara Subrahmaniam Alshehri, Marwan Mittra, Ipsa Pop, Mihai
author_facet	Luan, Tu Muralidharan, Harihara Subrahmaniam Alshehri, Marwan Mittra, Ipsa Pop, Mihai
author_sort	Luan, Tu
collection	PubMed
description	16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We propose an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the data set, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the data set. The experiments also show that SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate) is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST. The algorithm is implemented in the open-source package SCRAPT. The source code used to generate the results presented in this paper is available at https://github.com/hsmurali/SCRAPT.
format	Online Article Text
id	pubmed-10164572
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-101645722023-05-08 SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets Luan, Tu Muralidharan, Harihara Subrahmaniam Alshehri, Marwan Mittra, Ipsa Pop, Mihai Nucleic Acids Res Methods Online 16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We propose an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the data set, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the data set. The experiments also show that SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate) is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST. The algorithm is implemented in the open-source package SCRAPT. The source code used to generate the results presented in this paper is available at https://github.com/hsmurali/SCRAPT. Oxford University Press 2023-03-13 /pmc/articles/PMC10164572/ /pubmed/36912074 http://dx.doi.org/10.1093/nar/gkad158 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of Nucleic Acids Research. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methods Online Luan, Tu Muralidharan, Harihara Subrahmaniam Alshehri, Marwan Mittra, Ipsa Pop, Mihai SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets
title	SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets
title_full	SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets
title_fullStr	SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets
title_full_unstemmed	SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets
title_short	SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets
title_sort	scrapt: an iterative algorithm for clustering large 16s rrna gene data sets
topic	Methods Online
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10164572/ https://www.ncbi.nlm.nih.gov/pubmed/36912074 http://dx.doi.org/10.1093/nar/gkad158
work_keys_str_mv	AT luantu scraptaniterativealgorithmforclusteringlarge16srrnagenedatasets AT muralidharanhariharasubrahmaniam scraptaniterativealgorithmforclusteringlarge16srrnagenedatasets AT alshehrimarwan scraptaniterativealgorithmforclusteringlarge16srrnagenedatasets AT mittraipsa scraptaniterativealgorithmforclusteringlarge16srrnagenedatasets AT popmihai scraptaniterativealgorithmforclusteringlarge16srrnagenedatasets

SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets

Ejemplares similares