Cargando…

ensembleTax: an R package for determinations of ensemble taxonomic assignments of phylogenetically-informative marker gene sequences

BACKGROUND: High-throughput sequencing of phylogenetically informative marker genes is a widely used method to assess the diversity and composition of microbial communities. Taxonomic assignment of sampled marker gene sequences (referred to as amplicon sequence variants, or ASVs) imparts ecological...

Descripción completa

Detalles Bibliográficos
Autores principales: Catlett, Dylan, Son, Kevin, Liang, Connie
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8320524/
https://www.ncbi.nlm.nih.gov/pubmed/34395092
http://dx.doi.org/10.7717/peerj.11865
_version_ 1783730659637854208
author Catlett, Dylan
Son, Kevin
Liang, Connie
author_facet Catlett, Dylan
Son, Kevin
Liang, Connie
author_sort Catlett, Dylan
collection PubMed
description BACKGROUND: High-throughput sequencing of phylogenetically informative marker genes is a widely used method to assess the diversity and composition of microbial communities. Taxonomic assignment of sampled marker gene sequences (referred to as amplicon sequence variants, or ASVs) imparts ecological significance to these genetic data. To assign taxonomy to an ASV, a taxonomic assignment algorithm compares the ASV to a collection of reference sequences (a reference database) with known taxonomic affiliations. However, many taxonomic assignment algorithms and reference databases are available, and the optimal algorithm and database for a particular scientific question is often unclear. Here, we present the ensembleTax R package, which provides an efficient framework for integrating taxonomic assignments predicted with any number of taxonomic assignment algorithms and reference databases to determine ensemble taxonomic assignments for ASVs. METHODS: The ensembleTax R package relies on two core algorithms: taxmapper and assign.ensembleTax. The taxmapper algorithm maps taxonomic assignments derived from one reference database onto the taxonomic nomenclature (a set of taxonomic naming and ranking conventions) of another reference database. The assign.ensembleTax algorithm computes ensemble taxonomic assignments for each ASV in a data set based on any number of taxonomic assignments determined with independent methods. Various parameters allow analysts to prioritize obtaining either more ASVs with more predicted clade names or more robust clade name predictions supported by multiple independent methods in ensemble taxonomic assignments. RESULTS: The ensembleTax R package is used to compute two sets of ensemble taxonomic assignments for a collection of protistan ASVs sampled from the coastal ocean. Comparisons of taxonomic assignments predicted by individual methods with those predicted by ensemble methods show that conservative implementations of the ensembleTax package minimize disagreements between taxonomic assignments predicted by individual and ensemble methods, but result in ASVs with fewer ranks assigned taxonomy. Less conservative implementations of the ensembleTax package result in an increased fraction of ASVs classified at all taxonomic ranks, but increase the number of ASVs for which ensemble assignments disagree with those predicted by individual methods. DISCUSSION: We discuss how implementation of the ensembleTax R package may be optimized to address specific scientific objectives based on the results of the application of the ensembleTax package to marine protist communities. While further work is required to evaluate the accuracy of ensemble taxonomic assignments relative to taxonomic assignments predicted by individual methods, we also discuss scenarios where ensemble methods are expected to improve the accuracy of taxonomy prediction for ASVs.
format Online
Article
Text
id pubmed-8320524
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-83205242021-08-13 ensembleTax: an R package for determinations of ensemble taxonomic assignments of phylogenetically-informative marker gene sequences Catlett, Dylan Son, Kevin Liang, Connie PeerJ Bioinformatics BACKGROUND: High-throughput sequencing of phylogenetically informative marker genes is a widely used method to assess the diversity and composition of microbial communities. Taxonomic assignment of sampled marker gene sequences (referred to as amplicon sequence variants, or ASVs) imparts ecological significance to these genetic data. To assign taxonomy to an ASV, a taxonomic assignment algorithm compares the ASV to a collection of reference sequences (a reference database) with known taxonomic affiliations. However, many taxonomic assignment algorithms and reference databases are available, and the optimal algorithm and database for a particular scientific question is often unclear. Here, we present the ensembleTax R package, which provides an efficient framework for integrating taxonomic assignments predicted with any number of taxonomic assignment algorithms and reference databases to determine ensemble taxonomic assignments for ASVs. METHODS: The ensembleTax R package relies on two core algorithms: taxmapper and assign.ensembleTax. The taxmapper algorithm maps taxonomic assignments derived from one reference database onto the taxonomic nomenclature (a set of taxonomic naming and ranking conventions) of another reference database. The assign.ensembleTax algorithm computes ensemble taxonomic assignments for each ASV in a data set based on any number of taxonomic assignments determined with independent methods. Various parameters allow analysts to prioritize obtaining either more ASVs with more predicted clade names or more robust clade name predictions supported by multiple independent methods in ensemble taxonomic assignments. RESULTS: The ensembleTax R package is used to compute two sets of ensemble taxonomic assignments for a collection of protistan ASVs sampled from the coastal ocean. Comparisons of taxonomic assignments predicted by individual methods with those predicted by ensemble methods show that conservative implementations of the ensembleTax package minimize disagreements between taxonomic assignments predicted by individual and ensemble methods, but result in ASVs with fewer ranks assigned taxonomy. Less conservative implementations of the ensembleTax package result in an increased fraction of ASVs classified at all taxonomic ranks, but increase the number of ASVs for which ensemble assignments disagree with those predicted by individual methods. DISCUSSION: We discuss how implementation of the ensembleTax R package may be optimized to address specific scientific objectives based on the results of the application of the ensembleTax package to marine protist communities. While further work is required to evaluate the accuracy of ensemble taxonomic assignments relative to taxonomic assignments predicted by individual methods, we also discuss scenarios where ensemble methods are expected to improve the accuracy of taxonomy prediction for ASVs. PeerJ Inc. 2021-07-26 /pmc/articles/PMC8320524/ /pubmed/34395092 http://dx.doi.org/10.7717/peerj.11865 Text en ©2021 Catlett et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Catlett, Dylan
Son, Kevin
Liang, Connie
ensembleTax: an R package for determinations of ensemble taxonomic assignments of phylogenetically-informative marker gene sequences
title ensembleTax: an R package for determinations of ensemble taxonomic assignments of phylogenetically-informative marker gene sequences
title_full ensembleTax: an R package for determinations of ensemble taxonomic assignments of phylogenetically-informative marker gene sequences
title_fullStr ensembleTax: an R package for determinations of ensemble taxonomic assignments of phylogenetically-informative marker gene sequences
title_full_unstemmed ensembleTax: an R package for determinations of ensemble taxonomic assignments of phylogenetically-informative marker gene sequences
title_short ensembleTax: an R package for determinations of ensemble taxonomic assignments of phylogenetically-informative marker gene sequences
title_sort ensembletax: an r package for determinations of ensemble taxonomic assignments of phylogenetically-informative marker gene sequences
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8320524/
https://www.ncbi.nlm.nih.gov/pubmed/34395092
http://dx.doi.org/10.7717/peerj.11865
work_keys_str_mv AT catlettdylan ensembletaxanrpackagefordeterminationsofensembletaxonomicassignmentsofphylogeneticallyinformativemarkergenesequences
AT sonkevin ensembletaxanrpackagefordeterminationsofensembletaxonomicassignmentsofphylogeneticallyinformativemarkergenesequences
AT liangconnie ensembletaxanrpackagefordeterminationsofensembletaxonomicassignmentsofphylogeneticallyinformativemarkergenesequences