Cargando…

StrainSelect: A novel microbiome reference database that disambiguates all bacterial strains, genome assemblies and extant cultures worldwide

Motivation: Microbial metagenomic profiling software and databases are advancing rapidly for development of novel disease biomarkers and therapeutics yet three problems impede analyses: 1) the conflation of “genome assembly” and “strain” in reference databases; 2) difficulty connecting DNA biomarker...

Descripción completa

Detalles Bibliográficos
Autores principales: DeSantis, Todd Z., Cardona, Cesar, Narayan, Nicole R., Viswanatham, Satish, Ravichandar, Divya, Wee, Brendan, Chow, Cheryl-Emiliane, Iwai, Shoko
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9939595/
https://www.ncbi.nlm.nih.gov/pubmed/36814618
http://dx.doi.org/10.1016/j.heliyon.2023.e13314
_version_ 1784890889389211648
author DeSantis, Todd Z.
Cardona, Cesar
Narayan, Nicole R.
Viswanatham, Satish
Ravichandar, Divya
Wee, Brendan
Chow, Cheryl-Emiliane
Iwai, Shoko
author_facet DeSantis, Todd Z.
Cardona, Cesar
Narayan, Nicole R.
Viswanatham, Satish
Ravichandar, Divya
Wee, Brendan
Chow, Cheryl-Emiliane
Iwai, Shoko
author_sort DeSantis, Todd Z.
collection PubMed
description Motivation: Microbial metagenomic profiling software and databases are advancing rapidly for development of novel disease biomarkers and therapeutics yet three problems impede analyses: 1) the conflation of “genome assembly” and “strain” in reference databases; 2) difficulty connecting DNA biomarkers to a procurable strain for laboratory experimentation; and 3) absence of a comprehensive and unified strain-resolved reference database for integrating both shotgun metagenomics and 16S rRNA gene data. Results: We demarcated 681,087 strains, the largest collection of its kind, by filtering public data into a knowledge graph of vertices representing contiguous DNA sequences, genome assemblies, strain monikers and bio-resource center (BRC) catalog numbers then adding inter-vertex edges only for synonyms or direct derivatives. Surprisingly, for 10,043 important strains, we found replicate RefSeq genome assemblies obstructing interpretation of database searches. We organized each strain into eight taxonomic ranks with bootstrap confidence inversely correlated with genome assembly contamination. The StrainSelect database is suited for applications where a taxonomic, functional or procurement reference is needed for shotgun or amplicon metagenomics since 636,568 strains have at least one 16S rRNA gene, 245,005 have at least one annotated genome assembly, and 36,671 are procurable from at least one BRC. The database overcomes all three aforementioned problems since it disambiguates strains from assemblies, locates strains at BRCs, and unifies a taxonomic reference for both 16S rRNA and shotgun metagenomics. Availability: The StrainSelect database is available in igraph and tabular vertex-edge formats compatible with Neo4J. Dereplicated MinHash and fasta databases are distributed for sourmash and usearch pipelines at http://strainselect.secondgenome.com. Contact:todd.desantis@gmail.com. Supplementary information: Supplementary data are available online.
format Online
Article
Text
id pubmed-9939595
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-99395952023-02-21 StrainSelect: A novel microbiome reference database that disambiguates all bacterial strains, genome assemblies and extant cultures worldwide DeSantis, Todd Z. Cardona, Cesar Narayan, Nicole R. Viswanatham, Satish Ravichandar, Divya Wee, Brendan Chow, Cheryl-Emiliane Iwai, Shoko Heliyon Research Article Motivation: Microbial metagenomic profiling software and databases are advancing rapidly for development of novel disease biomarkers and therapeutics yet three problems impede analyses: 1) the conflation of “genome assembly” and “strain” in reference databases; 2) difficulty connecting DNA biomarkers to a procurable strain for laboratory experimentation; and 3) absence of a comprehensive and unified strain-resolved reference database for integrating both shotgun metagenomics and 16S rRNA gene data. Results: We demarcated 681,087 strains, the largest collection of its kind, by filtering public data into a knowledge graph of vertices representing contiguous DNA sequences, genome assemblies, strain monikers and bio-resource center (BRC) catalog numbers then adding inter-vertex edges only for synonyms or direct derivatives. Surprisingly, for 10,043 important strains, we found replicate RefSeq genome assemblies obstructing interpretation of database searches. We organized each strain into eight taxonomic ranks with bootstrap confidence inversely correlated with genome assembly contamination. The StrainSelect database is suited for applications where a taxonomic, functional or procurement reference is needed for shotgun or amplicon metagenomics since 636,568 strains have at least one 16S rRNA gene, 245,005 have at least one annotated genome assembly, and 36,671 are procurable from at least one BRC. The database overcomes all three aforementioned problems since it disambiguates strains from assemblies, locates strains at BRCs, and unifies a taxonomic reference for both 16S rRNA and shotgun metagenomics. Availability: The StrainSelect database is available in igraph and tabular vertex-edge formats compatible with Neo4J. Dereplicated MinHash and fasta databases are distributed for sourmash and usearch pipelines at http://strainselect.secondgenome.com. Contact:todd.desantis@gmail.com. Supplementary information: Supplementary data are available online. Elsevier 2023-02-04 /pmc/articles/PMC9939595/ /pubmed/36814618 http://dx.doi.org/10.1016/j.heliyon.2023.e13314 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Research Article
DeSantis, Todd Z.
Cardona, Cesar
Narayan, Nicole R.
Viswanatham, Satish
Ravichandar, Divya
Wee, Brendan
Chow, Cheryl-Emiliane
Iwai, Shoko
StrainSelect: A novel microbiome reference database that disambiguates all bacterial strains, genome assemblies and extant cultures worldwide
title StrainSelect: A novel microbiome reference database that disambiguates all bacterial strains, genome assemblies and extant cultures worldwide
title_full StrainSelect: A novel microbiome reference database that disambiguates all bacterial strains, genome assemblies and extant cultures worldwide
title_fullStr StrainSelect: A novel microbiome reference database that disambiguates all bacterial strains, genome assemblies and extant cultures worldwide
title_full_unstemmed StrainSelect: A novel microbiome reference database that disambiguates all bacterial strains, genome assemblies and extant cultures worldwide
title_short StrainSelect: A novel microbiome reference database that disambiguates all bacterial strains, genome assemblies and extant cultures worldwide
title_sort strainselect: a novel microbiome reference database that disambiguates all bacterial strains, genome assemblies and extant cultures worldwide
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9939595/
https://www.ncbi.nlm.nih.gov/pubmed/36814618
http://dx.doi.org/10.1016/j.heliyon.2023.e13314
work_keys_str_mv AT desantistoddz strainselectanovelmicrobiomereferencedatabasethatdisambiguatesallbacterialstrainsgenomeassembliesandextantculturesworldwide
AT cardonacesar strainselectanovelmicrobiomereferencedatabasethatdisambiguatesallbacterialstrainsgenomeassembliesandextantculturesworldwide
AT narayannicoler strainselectanovelmicrobiomereferencedatabasethatdisambiguatesallbacterialstrainsgenomeassembliesandextantculturesworldwide
AT viswanathamsatish strainselectanovelmicrobiomereferencedatabasethatdisambiguatesallbacterialstrainsgenomeassembliesandextantculturesworldwide
AT ravichandardivya strainselectanovelmicrobiomereferencedatabasethatdisambiguatesallbacterialstrainsgenomeassembliesandextantculturesworldwide
AT weebrendan strainselectanovelmicrobiomereferencedatabasethatdisambiguatesallbacterialstrainsgenomeassembliesandextantculturesworldwide
AT chowcherylemiliane strainselectanovelmicrobiomereferencedatabasethatdisambiguatesallbacterialstrainsgenomeassembliesandextantculturesworldwide
AT iwaishoko strainselectanovelmicrobiomereferencedatabasethatdisambiguatesallbacterialstrainsgenomeassembliesandextantculturesworldwide