Cargando…

Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes

MOTIVATION: Huge datasets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these datasets, efficient indexing data structures—that are both scal...

Descripción completa

Detalles Bibliográficos
Autores principales: Alanko, Jarno N, Vuohtoniemi, Jaakko, Mäklin, Tommi, Puglisi, Simon J
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311346/
https://www.ncbi.nlm.nih.gov/pubmed/37387143
http://dx.doi.org/10.1093/bioinformatics/btad233
_version_ 1785066724340531200
author Alanko, Jarno N
Vuohtoniemi, Jaakko
Mäklin, Tommi
Puglisi, Simon J
author_facet Alanko, Jarno N
Vuohtoniemi, Jaakko
Mäklin, Tommi
Puglisi, Simon J
author_sort Alanko, Jarno N
collection PubMed
description MOTIVATION: Huge datasets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these datasets, efficient indexing data structures—that are both scalable and provide rapid query throughput—are paramount. RESULTS: Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 h. The resulting index takes 142 gigabytes. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 000 genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets. AVAILABILITY AND IMPLEMENTATION: Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license.
format Online
Article
Text
id pubmed-10311346
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-103113462023-07-01 Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes Alanko, Jarno N Vuohtoniemi, Jaakko Mäklin, Tommi Puglisi, Simon J Bioinformatics Genome Sequence Analysis MOTIVATION: Huge datasets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these datasets, efficient indexing data structures—that are both scalable and provide rapid query throughput—are paramount. RESULTS: Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 h. The resulting index takes 142 gigabytes. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 000 genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets. AVAILABILITY AND IMPLEMENTATION: Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license. Oxford University Press 2023-06-30 /pmc/articles/PMC10311346/ /pubmed/37387143 http://dx.doi.org/10.1093/bioinformatics/btad233 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Genome Sequence Analysis
Alanko, Jarno N
Vuohtoniemi, Jaakko
Mäklin, Tommi
Puglisi, Simon J
Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes
title Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes
title_full Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes
title_fullStr Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes
title_full_unstemmed Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes
title_short Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes
title_sort themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes
topic Genome Sequence Analysis
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311346/
https://www.ncbi.nlm.nih.gov/pubmed/37387143
http://dx.doi.org/10.1093/bioinformatics/btad233
work_keys_str_mv AT alankojarnon themistoascalablecoloredkmerindexforsensitivepseudoalignmentagainsthundredsofthousandsofbacterialgenomes
AT vuohtoniemijaakko themistoascalablecoloredkmerindexforsensitivepseudoalignmentagainsthundredsofthousandsofbacterialgenomes
AT maklintommi themistoascalablecoloredkmerindexforsensitivepseudoalignmentagainsthundredsofthousandsofbacterialgenomes
AT puglisisimonj themistoascalablecoloredkmerindexforsensitivepseudoalignmentagainsthundredsofthousandsofbacterialgenomes