Cargando…
Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections makes it effectively impossible to search these data using tools such as BLAST and its successors. Here, we present a techn...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10153118/ https://www.ncbi.nlm.nih.gov/pubmed/37131636 http://dx.doi.org/10.1101/2023.04.15.536996 |
_version_ | 1785035874714517504 |
---|---|
author | Břinda, Karel Lima, Leandro Pignotti, Simone Quinones-Olvera, Natalia Salikhov, Kamil Chikhi, Rayan Kucherov, Gregory Iqbal, Zamin Baym, Michael |
author_facet | Břinda, Karel Lima, Leandro Pignotti, Simone Quinones-Olvera, Natalia Salikhov, Kamil Chikhi, Rayan Kucherov, Gregory Iqbal, Zamin Baym, Michael |
author_sort | Břinda, Karel |
collection | PubMed |
description | Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections makes it effectively impossible to search these data using tools such as BLAST and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs, and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids, or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure. |
format | Online Article Text |
id | pubmed-10153118 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cold Spring Harbor Laboratory |
record_format | MEDLINE/PubMed |
spelling | pubmed-101531182023-05-03 Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression Břinda, Karel Lima, Leandro Pignotti, Simone Quinones-Olvera, Natalia Salikhov, Kamil Chikhi, Rayan Kucherov, Gregory Iqbal, Zamin Baym, Michael bioRxiv Article Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections makes it effectively impossible to search these data using tools such as BLAST and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs, and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids, or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure. Cold Spring Harbor Laboratory 2023-04-18 /pmc/articles/PMC10153118/ /pubmed/37131636 http://dx.doi.org/10.1101/2023.04.15.536996 Text en https://creativecommons.org/licenses/by-nc/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (https://creativecommons.org/licenses/by-nc/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator. |
spellingShingle | Article Břinda, Karel Lima, Leandro Pignotti, Simone Quinones-Olvera, Natalia Salikhov, Kamil Chikhi, Rayan Kucherov, Gregory Iqbal, Zamin Baym, Michael Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression |
title | Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression |
title_full | Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression |
title_fullStr | Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression |
title_full_unstemmed | Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression |
title_short | Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression |
title_sort | efficient and robust search of microbial genomes via phylogenetic compression |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10153118/ https://www.ncbi.nlm.nih.gov/pubmed/37131636 http://dx.doi.org/10.1101/2023.04.15.536996 |
work_keys_str_mv | AT brindakarel efficientandrobustsearchofmicrobialgenomesviaphylogeneticcompression AT limaleandro efficientandrobustsearchofmicrobialgenomesviaphylogeneticcompression AT pignottisimone efficientandrobustsearchofmicrobialgenomesviaphylogeneticcompression AT quinonesolveranatalia efficientandrobustsearchofmicrobialgenomesviaphylogeneticcompression AT salikhovkamil efficientandrobustsearchofmicrobialgenomesviaphylogeneticcompression AT chikhirayan efficientandrobustsearchofmicrobialgenomesviaphylogeneticcompression AT kucherovgregory efficientandrobustsearchofmicrobialgenomesviaphylogeneticcompression AT iqbalzamin efficientandrobustsearchofmicrobialgenomesviaphylogeneticcompression AT baymmichael efficientandrobustsearchofmicrobialgenomesviaphylogeneticcompression |