Cargando…

CLUSTOM: A Novel Method for Clustering 16S rRNA Next Generation Sequences by Overlap Minimization

The recent nucleic acid sequencing revolution driven by shotgun and high-throughput technologies has led to a rapid increase in the number of sequences for microbial communities. The availability of 16S ribosomal RNA (rRNA) gene sequences from a multitude of natural environments now offers a unique...

Descripción completa

Detalles Bibliográficos
Autores principales: Hwang, Kyuin, Oh, Jeongsu, Kim, Tae-Kyung, Kim, Byung Kwon, Yu, Dong Su, Hou, Bo Kyeng, Caetano-Anollés, Gustavo, Hong, Soon Gyu, Kim, Kyung Mo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3641076/
https://www.ncbi.nlm.nih.gov/pubmed/23650520
http://dx.doi.org/10.1371/journal.pone.0062623
_version_ 1782267975787610112
author Hwang, Kyuin
Oh, Jeongsu
Kim, Tae-Kyung
Kim, Byung Kwon
Yu, Dong Su
Hou, Bo Kyeng
Caetano-Anollés, Gustavo
Hong, Soon Gyu
Kim, Kyung Mo
author_facet Hwang, Kyuin
Oh, Jeongsu
Kim, Tae-Kyung
Kim, Byung Kwon
Yu, Dong Su
Hou, Bo Kyeng
Caetano-Anollés, Gustavo
Hong, Soon Gyu
Kim, Kyung Mo
author_sort Hwang, Kyuin
collection PubMed
description The recent nucleic acid sequencing revolution driven by shotgun and high-throughput technologies has led to a rapid increase in the number of sequences for microbial communities. The availability of 16S ribosomal RNA (rRNA) gene sequences from a multitude of natural environments now offers a unique opportunity to study microbial diversity and community structure. The large volume of sequencing data however makes it time consuming to assign individual sequences to phylotypes by searching them against public databases. Since ribosomal sequences have diverged across prokaryotic species, they can be grouped into clusters that represent operational taxonomic units. However, available clustering programs suffer from overlap of sequence spaces in adjacent clusters. In natural environments, gene sequences are homogenous within species but divergent between species. This evolutionary constraint results in an uneven distribution of genetic distances of genes in sequence space. To cluster 16S rRNA sequences more accurately, it is therefore essential to select core sequences that are located at the centers of the distributions represented by the genetic distance of sequences in taxonomic units. Based on this idea, we here describe a novel sequence clustering algorithm named CLUSTOM that minimizes the overlaps between adjacent clusters. The performance of this algorithm was evaluated in a comparative exercise with existing programs, using the reference sequences of the SILVA database as well as published pyrosequencing datasets. The test revealed that our algorithm achieves higher accuracy than ESPRIT-Tree and mothur, few of the best clustering algorithms. Results indicate that the concept of an uneven distribution of sequence distances can effectively and successfully cluster 16S rRNA gene sequences. The algorithm of CLUSTOM has been implemented both as a web and as a standalone command line application, which are available at http://clustom.kribb.re.kr.
format Online
Article
Text
id pubmed-3641076
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-36410762013-05-06 CLUSTOM: A Novel Method for Clustering 16S rRNA Next Generation Sequences by Overlap Minimization Hwang, Kyuin Oh, Jeongsu Kim, Tae-Kyung Kim, Byung Kwon Yu, Dong Su Hou, Bo Kyeng Caetano-Anollés, Gustavo Hong, Soon Gyu Kim, Kyung Mo PLoS One Research Article The recent nucleic acid sequencing revolution driven by shotgun and high-throughput technologies has led to a rapid increase in the number of sequences for microbial communities. The availability of 16S ribosomal RNA (rRNA) gene sequences from a multitude of natural environments now offers a unique opportunity to study microbial diversity and community structure. The large volume of sequencing data however makes it time consuming to assign individual sequences to phylotypes by searching them against public databases. Since ribosomal sequences have diverged across prokaryotic species, they can be grouped into clusters that represent operational taxonomic units. However, available clustering programs suffer from overlap of sequence spaces in adjacent clusters. In natural environments, gene sequences are homogenous within species but divergent between species. This evolutionary constraint results in an uneven distribution of genetic distances of genes in sequence space. To cluster 16S rRNA sequences more accurately, it is therefore essential to select core sequences that are located at the centers of the distributions represented by the genetic distance of sequences in taxonomic units. Based on this idea, we here describe a novel sequence clustering algorithm named CLUSTOM that minimizes the overlaps between adjacent clusters. The performance of this algorithm was evaluated in a comparative exercise with existing programs, using the reference sequences of the SILVA database as well as published pyrosequencing datasets. The test revealed that our algorithm achieves higher accuracy than ESPRIT-Tree and mothur, few of the best clustering algorithms. Results indicate that the concept of an uneven distribution of sequence distances can effectively and successfully cluster 16S rRNA gene sequences. The algorithm of CLUSTOM has been implemented both as a web and as a standalone command line application, which are available at http://clustom.kribb.re.kr. Public Library of Science 2013-05-01 /pmc/articles/PMC3641076/ /pubmed/23650520 http://dx.doi.org/10.1371/journal.pone.0062623 Text en © 2013 Hwang et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Hwang, Kyuin
Oh, Jeongsu
Kim, Tae-Kyung
Kim, Byung Kwon
Yu, Dong Su
Hou, Bo Kyeng
Caetano-Anollés, Gustavo
Hong, Soon Gyu
Kim, Kyung Mo
CLUSTOM: A Novel Method for Clustering 16S rRNA Next Generation Sequences by Overlap Minimization
title CLUSTOM: A Novel Method for Clustering 16S rRNA Next Generation Sequences by Overlap Minimization
title_full CLUSTOM: A Novel Method for Clustering 16S rRNA Next Generation Sequences by Overlap Minimization
title_fullStr CLUSTOM: A Novel Method for Clustering 16S rRNA Next Generation Sequences by Overlap Minimization
title_full_unstemmed CLUSTOM: A Novel Method for Clustering 16S rRNA Next Generation Sequences by Overlap Minimization
title_short CLUSTOM: A Novel Method for Clustering 16S rRNA Next Generation Sequences by Overlap Minimization
title_sort clustom: a novel method for clustering 16s rrna next generation sequences by overlap minimization
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3641076/
https://www.ncbi.nlm.nih.gov/pubmed/23650520
http://dx.doi.org/10.1371/journal.pone.0062623
work_keys_str_mv AT hwangkyuin clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT ohjeongsu clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT kimtaekyung clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT kimbyungkwon clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT yudongsu clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT houbokyeng clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT caetanoanollesgustavo clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT hongsoongyu clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT kimkyungmo clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization