Cargando…

Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling

Bacterial diversity is often analyzed using 16S rRNA gene amplicon sequencing. Commonly, sequences are clustered based on similarity cutoffs to obtain groups reflecting molecular species, genera, or families. Due to the amount of the generated sequencing data, greedy algorithms are preferred for the...

Descripción completa

Detalles Bibliográficos
Autores principales: Kioukis, Antonios, Pourjam, Mohsen, Neuhaus, Klaus, Lagkouvardos, Ilias
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9580952/
https://www.ncbi.nlm.nih.gov/pubmed/36304326
http://dx.doi.org/10.3389/fbinf.2022.864597
_version_ 1784812508051144704
author Kioukis, Antonios
Pourjam, Mohsen
Neuhaus, Klaus
Lagkouvardos, Ilias
author_facet Kioukis, Antonios
Pourjam, Mohsen
Neuhaus, Klaus
Lagkouvardos, Ilias
author_sort Kioukis, Antonios
collection PubMed
description Bacterial diversity is often analyzed using 16S rRNA gene amplicon sequencing. Commonly, sequences are clustered based on similarity cutoffs to obtain groups reflecting molecular species, genera, or families. Due to the amount of the generated sequencing data, greedy algorithms are preferred for their time efficiency. Such algorithms rely only on pairwise sequence similarities. Thus, sometimes sequences with diverse phylogenetic background are clustered together. In contrast, taxonomic classifiers use position specific taxonomic information in assigning a probable taxonomy to a given sequence. Here we introduce Taxonomy Informed Clustering (TIC), a novel approach that utilizes classifier-assigned taxonomy to restrict clustering to only those sequences that share the same taxonomic path. Based on this concept, we offer a complete and automated pipeline for processing of 16S rRNA amplicon datasets in diversity analyses. First, raw reads are processed to form denoised amplicons. Next, the denoised amplicons are taxonomically classified. Finally, the TIC algorithm progressively assigning clusters at molecular species, genus and family levels. TIC outperforms greedy clustering algorithms like USEARCH and VSEARCH in terms of clusters’ purity and entropy, when using data from the Living Tree Project as test samples. Furthermore, we applied TIC on a dataset containing all Bifidobacteriaceae-classified sequences from the IMNGS database. Here, TIC identified evidence for 1000s of novel molecular genera and species. These results highlight the straightforward application of the TIC pipeline and superior results compared to former methods in diversity studies. The pipeline is freely available at: https://github.com/Lagkouvardos/TIC.
format Online
Article
Text
id pubmed-9580952
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-95809522022-10-26 Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling Kioukis, Antonios Pourjam, Mohsen Neuhaus, Klaus Lagkouvardos, Ilias Front Bioinform Bioinformatics Bacterial diversity is often analyzed using 16S rRNA gene amplicon sequencing. Commonly, sequences are clustered based on similarity cutoffs to obtain groups reflecting molecular species, genera, or families. Due to the amount of the generated sequencing data, greedy algorithms are preferred for their time efficiency. Such algorithms rely only on pairwise sequence similarities. Thus, sometimes sequences with diverse phylogenetic background are clustered together. In contrast, taxonomic classifiers use position specific taxonomic information in assigning a probable taxonomy to a given sequence. Here we introduce Taxonomy Informed Clustering (TIC), a novel approach that utilizes classifier-assigned taxonomy to restrict clustering to only those sequences that share the same taxonomic path. Based on this concept, we offer a complete and automated pipeline for processing of 16S rRNA amplicon datasets in diversity analyses. First, raw reads are processed to form denoised amplicons. Next, the denoised amplicons are taxonomically classified. Finally, the TIC algorithm progressively assigning clusters at molecular species, genus and family levels. TIC outperforms greedy clustering algorithms like USEARCH and VSEARCH in terms of clusters’ purity and entropy, when using data from the Living Tree Project as test samples. Furthermore, we applied TIC on a dataset containing all Bifidobacteriaceae-classified sequences from the IMNGS database. Here, TIC identified evidence for 1000s of novel molecular genera and species. These results highlight the straightforward application of the TIC pipeline and superior results compared to former methods in diversity studies. The pipeline is freely available at: https://github.com/Lagkouvardos/TIC. Frontiers Media S.A. 2022-04-27 /pmc/articles/PMC9580952/ /pubmed/36304326 http://dx.doi.org/10.3389/fbinf.2022.864597 Text en Copyright © 2022 Kioukis, Pourjam, Neuhaus and Lagkouvardos. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Bioinformatics
Kioukis, Antonios
Pourjam, Mohsen
Neuhaus, Klaus
Lagkouvardos, Ilias
Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling
title Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling
title_full Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling
title_fullStr Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling
title_full_unstemmed Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling
title_short Taxonomy Informed Clustering, an Optimized Method for Purer and More Informative Clusters in Diversity Analysis and Microbiome Profiling
title_sort taxonomy informed clustering, an optimized method for purer and more informative clusters in diversity analysis and microbiome profiling
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9580952/
https://www.ncbi.nlm.nih.gov/pubmed/36304326
http://dx.doi.org/10.3389/fbinf.2022.864597
work_keys_str_mv AT kioukisantonios taxonomyinformedclusteringanoptimizedmethodforpurerandmoreinformativeclustersindiversityanalysisandmicrobiomeprofiling
AT pourjammohsen taxonomyinformedclusteringanoptimizedmethodforpurerandmoreinformativeclustersindiversityanalysisandmicrobiomeprofiling
AT neuhausklaus taxonomyinformedclusteringanoptimizedmethodforpurerandmoreinformativeclustersindiversityanalysisandmicrobiomeprofiling
AT lagkouvardosilias taxonomyinformedclusteringanoptimizedmethodforpurerandmoreinformativeclustersindiversityanalysisandmicrobiomeprofiling