Cargando…

Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences

The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking...

Descripción completa

Detalles Bibliográficos
Autores principales: Blackwell, Grace A., Hunt, Martin, Malone, Kerri M., Lima, Leandro, Horesh, Gal, Alako, Blaise T. F., Thomson, Nicholas R., Iqbal, Zamin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8577725/
https://www.ncbi.nlm.nih.gov/pubmed/34752446
http://dx.doi.org/10.1371/journal.pbio.3001421
_version_ 1784596118263300096
author Blackwell, Grace A.
Hunt, Martin
Malone, Kerri M.
Lima, Leandro
Horesh, Gal
Alako, Blaise T. F.
Thomson, Nicholas R.
Iqbal, Zamin
author_facet Blackwell, Grace A.
Hunt, Martin
Malone, Kerri M.
Lima, Leandro
Horesh, Gal
Alako, Blaise T. F.
Thomson, Nicholas R.
Iqbal, Zamin
author_sort Blackwell, Grace A.
collection PubMed
description The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies.
format Online
Article
Text
id pubmed-8577725
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-85777252021-11-10 Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences Blackwell, Grace A. Hunt, Martin Malone, Kerri M. Lima, Leandro Horesh, Gal Alako, Blaise T. F. Thomson, Nicholas R. Iqbal, Zamin PLoS Biol Methods and Resources The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies. Public Library of Science 2021-11-09 /pmc/articles/PMC8577725/ /pubmed/34752446 http://dx.doi.org/10.1371/journal.pbio.3001421 Text en © 2021 Blackwell et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Methods and Resources
Blackwell, Grace A.
Hunt, Martin
Malone, Kerri M.
Lima, Leandro
Horesh, Gal
Alako, Blaise T. F.
Thomson, Nicholas R.
Iqbal, Zamin
Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
title Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
title_full Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
title_fullStr Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
title_full_unstemmed Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
title_short Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
title_sort exploring bacterial diversity via a curated and searchable snapshot of archived dna sequences
topic Methods and Resources
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8577725/
https://www.ncbi.nlm.nih.gov/pubmed/34752446
http://dx.doi.org/10.1371/journal.pbio.3001421
work_keys_str_mv AT blackwellgracea exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
AT huntmartin exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
AT malonekerrim exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
AT limaleandro exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
AT horeshgal exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
AT alakoblaisetf exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
AT thomsonnicholasr exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
AT iqbalzamin exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences