Cargando…

Two-Stage Clustering (TSC): A Pipeline for Selecting Operational Taxonomic Units for the High-Throughput Sequencing of PCR Amplicons

Clustering 16S/18S rRNA amplicon sequences into operational taxonomic units (OTUs) is a critical step for the bioinformatic analysis of microbial diversity. Here, we report a pipeline for selecting OTUs with a relatively low computational demand and a high degree of accuracy. This pipeline is referr...

Descripción completa

Detalles Bibliográficos
Autores principales: Jiang, Xiao-Tao, Zhang, Hai, Sheng, Hua-Fang, Wang, Yu, He, Yan, Zou, Fei, Zhou, Hong-Wei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3256218/
https://www.ncbi.nlm.nih.gov/pubmed/22253923
http://dx.doi.org/10.1371/journal.pone.0030230
_version_ 1782221058062942208
author Jiang, Xiao-Tao
Zhang, Hai
Sheng, Hua-Fang
Wang, Yu
He, Yan
Zou, Fei
Zhou, Hong-Wei
author_facet Jiang, Xiao-Tao
Zhang, Hai
Sheng, Hua-Fang
Wang, Yu
He, Yan
Zou, Fei
Zhou, Hong-Wei
author_sort Jiang, Xiao-Tao
collection PubMed
description Clustering 16S/18S rRNA amplicon sequences into operational taxonomic units (OTUs) is a critical step for the bioinformatic analysis of microbial diversity. Here, we report a pipeline for selecting OTUs with a relatively low computational demand and a high degree of accuracy. This pipeline is referred to as two-stage clustering (TSC) because it divides tags into two groups according to their abundance and clusters them sequentially. The more abundant group is clustered using a hierarchical algorithm similar to that in ESPRIT, which has a high degree of accuracy but is computationally costly for large datasets. The rarer group, which includes the majority of tags, is then heuristically clustered to improve efficiency. To further improve the computational efficiency and accuracy, two preclustering steps are implemented. To maintain clustering accuracy, all tags are grouped into an OTU depending on their pairwise Needleman-Wunsch distance. This method not only improved the computational efficiency but also mitigated the spurious OTU estimation from ‘noise’ sequences. In addition, OTUs clustered using TSC showed comparable or improved performance in beta-diversity comparisons compared to existing OTU selection methods. This study suggests that the distribution of sequencing datasets is a useful property for improving the computational efficiency and increasing the clustering accuracy of the high-throughput sequencing of PCR amplicons. The software and user guide are freely available at http://hwzhoulab.smu.edu.cn/paperdata/.
format Online
Article
Text
id pubmed-3256218
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-32562182012-01-17 Two-Stage Clustering (TSC): A Pipeline for Selecting Operational Taxonomic Units for the High-Throughput Sequencing of PCR Amplicons Jiang, Xiao-Tao Zhang, Hai Sheng, Hua-Fang Wang, Yu He, Yan Zou, Fei Zhou, Hong-Wei PLoS One Research Article Clustering 16S/18S rRNA amplicon sequences into operational taxonomic units (OTUs) is a critical step for the bioinformatic analysis of microbial diversity. Here, we report a pipeline for selecting OTUs with a relatively low computational demand and a high degree of accuracy. This pipeline is referred to as two-stage clustering (TSC) because it divides tags into two groups according to their abundance and clusters them sequentially. The more abundant group is clustered using a hierarchical algorithm similar to that in ESPRIT, which has a high degree of accuracy but is computationally costly for large datasets. The rarer group, which includes the majority of tags, is then heuristically clustered to improve efficiency. To further improve the computational efficiency and accuracy, two preclustering steps are implemented. To maintain clustering accuracy, all tags are grouped into an OTU depending on their pairwise Needleman-Wunsch distance. This method not only improved the computational efficiency but also mitigated the spurious OTU estimation from ‘noise’ sequences. In addition, OTUs clustered using TSC showed comparable or improved performance in beta-diversity comparisons compared to existing OTU selection methods. This study suggests that the distribution of sequencing datasets is a useful property for improving the computational efficiency and increasing the clustering accuracy of the high-throughput sequencing of PCR amplicons. The software and user guide are freely available at http://hwzhoulab.smu.edu.cn/paperdata/. Public Library of Science 2012-01-11 /pmc/articles/PMC3256218/ /pubmed/22253923 http://dx.doi.org/10.1371/journal.pone.0030230 Text en Jiang et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Jiang, Xiao-Tao
Zhang, Hai
Sheng, Hua-Fang
Wang, Yu
He, Yan
Zou, Fei
Zhou, Hong-Wei
Two-Stage Clustering (TSC): A Pipeline for Selecting Operational Taxonomic Units for the High-Throughput Sequencing of PCR Amplicons
title Two-Stage Clustering (TSC): A Pipeline for Selecting Operational Taxonomic Units for the High-Throughput Sequencing of PCR Amplicons
title_full Two-Stage Clustering (TSC): A Pipeline for Selecting Operational Taxonomic Units for the High-Throughput Sequencing of PCR Amplicons
title_fullStr Two-Stage Clustering (TSC): A Pipeline for Selecting Operational Taxonomic Units for the High-Throughput Sequencing of PCR Amplicons
title_full_unstemmed Two-Stage Clustering (TSC): A Pipeline for Selecting Operational Taxonomic Units for the High-Throughput Sequencing of PCR Amplicons
title_short Two-Stage Clustering (TSC): A Pipeline for Selecting Operational Taxonomic Units for the High-Throughput Sequencing of PCR Amplicons
title_sort two-stage clustering (tsc): a pipeline for selecting operational taxonomic units for the high-throughput sequencing of pcr amplicons
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3256218/
https://www.ncbi.nlm.nih.gov/pubmed/22253923
http://dx.doi.org/10.1371/journal.pone.0030230
work_keys_str_mv AT jiangxiaotao twostageclusteringtscapipelineforselectingoperationaltaxonomicunitsforthehighthroughputsequencingofpcramplicons
AT zhanghai twostageclusteringtscapipelineforselectingoperationaltaxonomicunitsforthehighthroughputsequencingofpcramplicons
AT shenghuafang twostageclusteringtscapipelineforselectingoperationaltaxonomicunitsforthehighthroughputsequencingofpcramplicons
AT wangyu twostageclusteringtscapipelineforselectingoperationaltaxonomicunitsforthehighthroughputsequencingofpcramplicons
AT heyan twostageclusteringtscapipelineforselectingoperationaltaxonomicunitsforthehighthroughputsequencingofpcramplicons
AT zoufei twostageclusteringtscapipelineforselectingoperationaltaxonomicunitsforthehighthroughputsequencingofpcramplicons
AT zhouhongwei twostageclusteringtscapipelineforselectingoperationaltaxonomicunitsforthehighthroughputsequencingofpcramplicons