Cargando…
Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling
Bacterial diversity is often analyzed using 16S rRNA gene amplicon sequencing. Commonly, sequences are clustered based on similarity cutoffs to obtain groups reflecting molecular species, genera, or families. Due to the amount of the generated sequencing data, greedy algorithms are preferred for the...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9580952/ https://www.ncbi.nlm.nih.gov/pubmed/36304326 http://dx.doi.org/10.3389/fbinf.2022.864597 |
_version_ | 1784812508051144704 |
---|---|
author | Kioukis, Antonios Pourjam, Mohsen Neuhaus, Klaus Lagkouvardos, Ilias |
author_facet | Kioukis, Antonios Pourjam, Mohsen Neuhaus, Klaus Lagkouvardos, Ilias |
author_sort | Kioukis, Antonios |
collection | PubMed |
description | Bacterial diversity is often analyzed using 16S rRNA gene amplicon sequencing. Commonly, sequences are clustered based on similarity cutoffs to obtain groups reflecting molecular species, genera, or families. Due to the amount of the generated sequencing data, greedy algorithms are preferred for their time efficiency. Such algorithms rely only on pairwise sequence similarities. Thus, sometimes sequences with diverse phylogenetic background are clustered together. In contrast, taxonomic classifiers use position specific taxonomic information in assigning a probable taxonomy to a given sequence. Here we introduce Taxonomy Informed Clustering (TIC), a novel approach that utilizes classifier-assigned taxonomy to restrict clustering to only those sequences that share the same taxonomic path. Based on this concept, we offer a complete and automated pipeline for processing of 16S rRNA amplicon datasets in diversity analyses. First, raw reads are processed to form denoised amplicons. Next, the denoised amplicons are taxonomically classified. Finally, the TIC algorithm progressively assigning clusters at molecular species, genus and family levels. TIC outperforms greedy clustering algorithms like USEARCH and VSEARCH in terms of clusters’ purity and entropy, when using data from the Living Tree Project as test samples. Furthermore, we applied TIC on a dataset containing all Bifidobacteriaceae-classified sequences from the IMNGS database. Here, TIC identified evidence for 1000s of novel molecular genera and species. These results highlight the straightforward application of the TIC pipeline and superior results compared to former methods in diversity studies. The pipeline is freely available at: https://github.com/Lagkouvardos/TIC. |
format | Online Article Text |
id | pubmed-9580952 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-95809522022-10-26 Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling Kioukis, Antonios Pourjam, Mohsen Neuhaus, Klaus Lagkouvardos, Ilias Front Bioinform Bioinformatics Bacterial diversity is often analyzed using 16S rRNA gene amplicon sequencing. Commonly, sequences are clustered based on similarity cutoffs to obtain groups reflecting molecular species, genera, or families. Due to the amount of the generated sequencing data, greedy algorithms are preferred for their time efficiency. Such algorithms rely only on pairwise sequence similarities. Thus, sometimes sequences with diverse phylogenetic background are clustered together. In contrast, taxonomic classifiers use position specific taxonomic information in assigning a probable taxonomy to a given sequence. Here we introduce Taxonomy Informed Clustering (TIC), a novel approach that utilizes classifier-assigned taxonomy to restrict clustering to only those sequences that share the same taxonomic path. Based on this concept, we offer a complete and automated pipeline for processing of 16S rRNA amplicon datasets in diversity analyses. First, raw reads are processed to form denoised amplicons. Next, the denoised amplicons are taxonomically classified. Finally, the TIC algorithm progressively assigning clusters at molecular species, genus and family levels. TIC outperforms greedy clustering algorithms like USEARCH and VSEARCH in terms of clusters’ purity and entropy, when using data from the Living Tree Project as test samples. Furthermore, we applied TIC on a dataset containing all Bifidobacteriaceae-classified sequences from the IMNGS database. Here, TIC identified evidence for 1000s of novel molecular genera and species. These results highlight the straightforward application of the TIC pipeline and superior results compared to former methods in diversity studies. The pipeline is freely available at: https://github.com/Lagkouvardos/TIC. Frontiers Media S.A. 2022-04-27 /pmc/articles/PMC9580952/ /pubmed/36304326 http://dx.doi.org/10.3389/fbinf.2022.864597 Text en Copyright © 2022 Kioukis, Pourjam, Neuhaus and Lagkouvardos. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Bioinformatics Kioukis, Antonios Pourjam, Mohsen Neuhaus, Klaus Lagkouvardos, Ilias Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling |
title | Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling |
title_full | Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling |
title_fullStr | Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling |
title_full_unstemmed | Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling |
title_short | Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling |
title_sort | taxonomy informed clustering, an optimized method for purer and more informative clusters in diversity analysis and microbiome profiling |
topic | Bioinformatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9580952/ https://www.ncbi.nlm.nih.gov/pubmed/36304326 http://dx.doi.org/10.3389/fbinf.2022.864597 |
work_keys_str_mv | AT kioukisantonios taxonomyinformedclusteringanoptimizedmethodforpurerandmoreinformativeclustersindiversityanalysisandmicrobiomeprofiling AT pourjammohsen taxonomyinformedclusteringanoptimizedmethodforpurerandmoreinformativeclustersindiversityanalysisandmicrobiomeprofiling AT neuhausklaus taxonomyinformedclusteringanoptimizedmethodforpurerandmoreinformativeclustersindiversityanalysisandmicrobiomeprofiling AT lagkouvardosilias taxonomyinformedclusteringanoptimizedmethodforpurerandmoreinformativeclustersindiversityanalysisandmicrobiomeprofiling |