Cargando…

RESCRIPt: Reproducible sequence taxonomy reference database management

Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleoti...

Descripción completa

Detalles Bibliográficos
Autores principales:	Robeson, Michael S., O’Rourke, Devon R., Kaehler, Benjamin D., Ziemski, Michal, Dillon, Matthew R., Foster, Jeffrey T., Bokulich, Nicholas A.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2021
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8601625/ https://www.ncbi.nlm.nih.gov/pubmed/34748542 http://dx.doi.org/10.1371/journal.pcbi.1009581

_version_	1784601395262914560
author	Robeson, Michael S. O’Rourke, Devon R. Kaehler, Benjamin D. Ziemski, Michal Dillon, Matthew R. Foster, Jeffrey T. Bokulich, Nicholas A.
author_facet	Robeson, Michael S. O’Rourke, Devon R. Kaehler, Benjamin D. Ziemski, Michal Dillon, Matthew R. Foster, Jeffrey T. Bokulich, Nicholas A.
author_sort	Robeson, Michael S.
collection	PubMed
description	Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.
format	Online Article Text
id	pubmed-8601625
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-86016252021-11-19 RESCRIPt: Reproducible sequence taxonomy reference database management Robeson, Michael S. O’Rourke, Devon R. Kaehler, Benjamin D. Ziemski, Michal Dillon, Matthew R. Foster, Jeffrey T. Bokulich, Nicholas A. PLoS Comput Biol Research Article Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt. Public Library of Science 2021-11-08 /pmc/articles/PMC8601625/ /pubmed/34748542 http://dx.doi.org/10.1371/journal.pcbi.1009581 Text en © 2021 Robeson, II et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Robeson, Michael S. O’Rourke, Devon R. Kaehler, Benjamin D. Ziemski, Michal Dillon, Matthew R. Foster, Jeffrey T. Bokulich, Nicholas A. RESCRIPt: Reproducible sequence taxonomy reference database management
title	RESCRIPt: Reproducible sequence taxonomy reference database management
title_full	RESCRIPt: Reproducible sequence taxonomy reference database management
title_fullStr	RESCRIPt: Reproducible sequence taxonomy reference database management
title_full_unstemmed	RESCRIPt: Reproducible sequence taxonomy reference database management
title_short	RESCRIPt: Reproducible sequence taxonomy reference database management
title_sort	rescript: reproducible sequence taxonomy reference database management
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8601625/ https://www.ncbi.nlm.nih.gov/pubmed/34748542 http://dx.doi.org/10.1371/journal.pcbi.1009581
work_keys_str_mv	AT robesonmichaels rescriptreproduciblesequencetaxonomyreferencedatabasemanagement AT orourkedevonr rescriptreproduciblesequencetaxonomyreferencedatabasemanagement AT kaehlerbenjamind rescriptreproduciblesequencetaxonomyreferencedatabasemanagement AT ziemskimichal rescriptreproduciblesequencetaxonomyreferencedatabasemanagement AT dillonmatthewr rescriptreproduciblesequencetaxonomyreferencedatabasemanagement AT fosterjeffreyt rescriptreproduciblesequencetaxonomyreferencedatabasemanagement AT bokulichnicholasa rescriptreproduciblesequencetaxonomyreferencedatabasemanagement

RESCRIPt: Reproducible sequence taxonomy reference database management

Ejemplares similares