Cargando…

De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units

Background. 16S rRNA gene sequences are routinely assigned to operational taxonomic units (OTUs) that are then used to analyze complex microbial communities. A number of methods have been employed to carry out the assignment of 16S rRNA gene sequences to OTUs leading to confusion over which method i...

Descripción completa

Detalles Bibliográficos
Autores principales: Westcott, Sarah L., Schloss, Patrick D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4675110/
https://www.ncbi.nlm.nih.gov/pubmed/26664811
http://dx.doi.org/10.7717/peerj.1487
_version_ 1782405013955411968
author Westcott, Sarah L.
Schloss, Patrick D.
author_facet Westcott, Sarah L.
Schloss, Patrick D.
author_sort Westcott, Sarah L.
collection PubMed
description Background. 16S rRNA gene sequences are routinely assigned to operational taxonomic units (OTUs) that are then used to analyze complex microbial communities. A number of methods have been employed to carry out the assignment of 16S rRNA gene sequences to OTUs leading to confusion over which method is optimal. A recent study suggested that a clustering method should be selected based on its ability to generate stable OTU assignments that do not change as additional sequences are added to the dataset. In contrast, we contend that the quality of the OTU assignments, the ability of the method to properly represent the distances between the sequences, is more important. Methods. Our analysis implemented six de novo clustering algorithms including the single linkage, complete linkage, average linkage, abundance-based greedy clustering, distance-based greedy clustering, and Swarm and the open and closed-reference methods. Using two previously published datasets we used the Matthew’s Correlation Coefficient (MCC) to assess the stability and quality of OTU assignments. Results. The stability of OTU assignments did not reflect the quality of the assignments. Depending on the dataset being analyzed, the average linkage and the distance and abundance-based greedy clustering methods generated OTUs that were more likely to represent the actual distances between sequences than the open and closed-reference methods. We also demonstrated that for the greedy algorithms VSEARCH produced assignments that were comparable to those produced by USEARCH making VSEARCH a viable free and open source alternative to USEARCH. Further interrogation of the reference-based methods indicated that when USEARCH or VSEARCH were used to identify the closest reference, the OTU assignments were sensitive to the order of the reference sequences because the reference sequences can be identical over the region being considered. More troubling was the observation that while both USEARCH and VSEARCH have a high level of sensitivity to detect reference sequences, the specificity of those matches was poor relative to the true best match. Discussion. Our analysis calls into question the quality and stability of OTU assignments generated by the open and closed-reference methods as implemented in current version of QIIME. This study demonstrates that de novo methods are the optimal method of assigning sequences into OTUs and that the quality of these assignments needs to be assessed for multiple methods to identify the optimal clustering method for a particular dataset.
format Online
Article
Text
id pubmed-4675110
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-46751102015-12-10 De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units Westcott, Sarah L. Schloss, Patrick D. PeerJ Computational Biology Background. 16S rRNA gene sequences are routinely assigned to operational taxonomic units (OTUs) that are then used to analyze complex microbial communities. A number of methods have been employed to carry out the assignment of 16S rRNA gene sequences to OTUs leading to confusion over which method is optimal. A recent study suggested that a clustering method should be selected based on its ability to generate stable OTU assignments that do not change as additional sequences are added to the dataset. In contrast, we contend that the quality of the OTU assignments, the ability of the method to properly represent the distances between the sequences, is more important. Methods. Our analysis implemented six de novo clustering algorithms including the single linkage, complete linkage, average linkage, abundance-based greedy clustering, distance-based greedy clustering, and Swarm and the open and closed-reference methods. Using two previously published datasets we used the Matthew’s Correlation Coefficient (MCC) to assess the stability and quality of OTU assignments. Results. The stability of OTU assignments did not reflect the quality of the assignments. Depending on the dataset being analyzed, the average linkage and the distance and abundance-based greedy clustering methods generated OTUs that were more likely to represent the actual distances between sequences than the open and closed-reference methods. We also demonstrated that for the greedy algorithms VSEARCH produced assignments that were comparable to those produced by USEARCH making VSEARCH a viable free and open source alternative to USEARCH. Further interrogation of the reference-based methods indicated that when USEARCH or VSEARCH were used to identify the closest reference, the OTU assignments were sensitive to the order of the reference sequences because the reference sequences can be identical over the region being considered. More troubling was the observation that while both USEARCH and VSEARCH have a high level of sensitivity to detect reference sequences, the specificity of those matches was poor relative to the true best match. Discussion. Our analysis calls into question the quality and stability of OTU assignments generated by the open and closed-reference methods as implemented in current version of QIIME. This study demonstrates that de novo methods are the optimal method of assigning sequences into OTUs and that the quality of these assignments needs to be assessed for multiple methods to identify the optimal clustering method for a particular dataset. PeerJ Inc. 2015-12-08 /pmc/articles/PMC4675110/ /pubmed/26664811 http://dx.doi.org/10.7717/peerj.1487 Text en © 2015 Westcott and Schloss http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Computational Biology
Westcott, Sarah L.
Schloss, Patrick D.
De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units
title De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units
title_full De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units
title_fullStr De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units
title_full_unstemmed De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units
title_short De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units
title_sort de novo clustering methods outperform reference-based methods for assigning 16s rrna gene sequences to operational taxonomic units
topic Computational Biology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4675110/
https://www.ncbi.nlm.nih.gov/pubmed/26664811
http://dx.doi.org/10.7717/peerj.1487
work_keys_str_mv AT westcottsarahl denovoclusteringmethodsoutperformreferencebasedmethodsforassigning16srrnagenesequencestooperationaltaxonomicunits
AT schlosspatrickd denovoclusteringmethodsoutperformreferencebasedmethodsforassigning16srrnagenesequencestooperationaltaxonomicunits