Cargando…

Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de...

Descripción completa

Detalles Bibliográficos
Autores principales: Rideout, Jai Ram, He, Yan, Navas-Molina, Jose A., Walters, William A., Ursell, Luke K., Gibbons, Sean M., Chase, John, McDonald, Daniel, Gonzalez, Antonio, Robbins-Pianka, Adam, Clemente, Jose C., Gilbert, Jack A., Huse, Susan M., Zhou, Hong-Wei, Knight, Rob, Caporaso, J. Gregory
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4145071/
https://www.ncbi.nlm.nih.gov/pubmed/25177538
http://dx.doi.org/10.7717/peerj.545
_version_ 1782332117988933632
author Rideout, Jai Ram
He, Yan
Navas-Molina, Jose A.
Walters, William A.
Ursell, Luke K.
Gibbons, Sean M.
Chase, John
McDonald, Daniel
Gonzalez, Antonio
Robbins-Pianka, Adam
Clemente, Jose C.
Gilbert, Jack A.
Huse, Susan M.
Zhou, Hong-Wei
Knight, Rob
Caporaso, J. Gregory
author_facet Rideout, Jai Ram
He, Yan
Navas-Molina, Jose A.
Walters, William A.
Ursell, Luke K.
Gibbons, Sean M.
Chase, John
McDonald, Daniel
Gonzalez, Antonio
Robbins-Pianka, Adam
Clemente, Jose C.
Gilbert, Jack A.
Huse, Susan M.
Zhou, Hong-Wei
Knight, Rob
Caporaso, J. Gregory
author_sort Rideout, Jai Ram
collection PubMed
description We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.
format Online
Article
Text
id pubmed-4145071
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-41450712014-08-29 Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences Rideout, Jai Ram He, Yan Navas-Molina, Jose A. Walters, William A. Ursell, Luke K. Gibbons, Sean M. Chase, John McDonald, Daniel Gonzalez, Antonio Robbins-Pianka, Adam Clemente, Jose C. Gilbert, Jack A. Huse, Susan M. Zhou, Hong-Wei Knight, Rob Caporaso, J. Gregory PeerJ Bioinformatics We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME. PeerJ Inc. 2014-08-21 /pmc/articles/PMC4145071/ /pubmed/25177538 http://dx.doi.org/10.7717/peerj.545 Text en © 2014 Rideout et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Rideout, Jai Ram
He, Yan
Navas-Molina, Jose A.
Walters, William A.
Ursell, Luke K.
Gibbons, Sean M.
Chase, John
McDonald, Daniel
Gonzalez, Antonio
Robbins-Pianka, Adam
Clemente, Jose C.
Gilbert, Jack A.
Huse, Susan M.
Zhou, Hong-Wei
Knight, Rob
Caporaso, J. Gregory
Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences
title Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences
title_full Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences
title_fullStr Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences
title_full_unstemmed Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences
title_short Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences
title_sort subsampled open-reference clustering creates consistent, comprehensive otu definitions and scales to billions of sequences
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4145071/
https://www.ncbi.nlm.nih.gov/pubmed/25177538
http://dx.doi.org/10.7717/peerj.545
work_keys_str_mv AT rideoutjairam subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT heyan subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT navasmolinajosea subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT walterswilliama subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT urselllukek subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT gibbonsseanm subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT chasejohn subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT mcdonalddaniel subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT gonzalezantonio subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT robbinspiankaadam subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT clementejosec subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT gilbertjacka subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT husesusanm subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT zhouhongwei subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT knightrob subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences
AT caporasojgregory subsampledopenreferenceclusteringcreatesconsistentcomprehensiveotudefinitionsandscalestobillionsofsequences