Cargando…

Open-Source Sequence Clustering Methods Improve the State Of the Art

Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art ope...

Descripción completa

Detalles Bibliográficos
Autores principales: Kopylova, Evguenia, Navas-Molina, Jose A., Mercier, Céline, Xu, Zhenjiang Zech, Mahé, Frédéric, He, Yan, Zhou, Hong-Wei, Rognes, Torbjørn, Caporaso, J. Gregory, Knight, Rob
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Society of Microbiology 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5069751/
https://www.ncbi.nlm.nih.gov/pubmed/27822515
http://dx.doi.org/10.1128/mSystems.00003-15
_version_ 1782460994902032384
author Kopylova, Evguenia
Navas-Molina, Jose A.
Mercier, Céline
Xu, Zhenjiang Zech
Mahé, Frédéric
He, Yan
Zhou, Hong-Wei
Rognes, Torbjørn
Caporaso, J. Gregory
Knight, Rob
author_facet Kopylova, Evguenia
Navas-Molina, Jose A.
Mercier, Céline
Xu, Zhenjiang Zech
Mahé, Frédéric
He, Yan
Zhou, Hong-Wei
Rognes, Torbjørn
Caporaso, J. Gregory
Knight, Rob
author_sort Kopylova, Evguenia
collection PubMed
description Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1).
format Online
Article
Text
id pubmed-5069751
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher American Society of Microbiology
record_format MEDLINE/PubMed
spelling pubmed-50697512016-11-07 Open-Source Sequence Clustering Methods Improve the State Of the Art Kopylova, Evguenia Navas-Molina, Jose A. Mercier, Céline Xu, Zhenjiang Zech Mahé, Frédéric He, Yan Zhou, Hong-Wei Rognes, Torbjørn Caporaso, J. Gregory Knight, Rob mSystems Research Article Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1). American Society of Microbiology 2016-02-09 /pmc/articles/PMC5069751/ /pubmed/27822515 http://dx.doi.org/10.1128/mSystems.00003-15 Text en Copyright © 2016 Kopylova et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license (http://creativecommons.org/licenses/by/4.0/) .
spellingShingle Research Article
Kopylova, Evguenia
Navas-Molina, Jose A.
Mercier, Céline
Xu, Zhenjiang Zech
Mahé, Frédéric
He, Yan
Zhou, Hong-Wei
Rognes, Torbjørn
Caporaso, J. Gregory
Knight, Rob
Open-Source Sequence Clustering Methods Improve the State Of the Art
title Open-Source Sequence Clustering Methods Improve the State Of the Art
title_full Open-Source Sequence Clustering Methods Improve the State Of the Art
title_fullStr Open-Source Sequence Clustering Methods Improve the State Of the Art
title_full_unstemmed Open-Source Sequence Clustering Methods Improve the State Of the Art
title_short Open-Source Sequence Clustering Methods Improve the State Of the Art
title_sort open-source sequence clustering methods improve the state of the art
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5069751/
https://www.ncbi.nlm.nih.gov/pubmed/27822515
http://dx.doi.org/10.1128/mSystems.00003-15
work_keys_str_mv AT kopylovaevguenia opensourcesequenceclusteringmethodsimprovethestateoftheart
AT navasmolinajosea opensourcesequenceclusteringmethodsimprovethestateoftheart
AT mercierceline opensourcesequenceclusteringmethodsimprovethestateoftheart
AT xuzhenjiangzech opensourcesequenceclusteringmethodsimprovethestateoftheart
AT mahefrederic opensourcesequenceclusteringmethodsimprovethestateoftheart
AT heyan opensourcesequenceclusteringmethodsimprovethestateoftheart
AT zhouhongwei opensourcesequenceclusteringmethodsimprovethestateoftheart
AT rognestorbjørn opensourcesequenceclusteringmethodsimprovethestateoftheart
AT caporasojgregory opensourcesequenceclusteringmethodsimprovethestateoftheart
AT knightrob opensourcesequenceclusteringmethodsimprovethestateoftheart