Cargando…

CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices

MOTIVATION: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metag...

Descripción completa

Detalles Bibliográficos
Autores principales: Liu, Shaopeng, Koslicki, David
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9235470/
https://www.ncbi.nlm.nih.gov/pubmed/35758788
http://dx.doi.org/10.1093/bioinformatics/btac237
_version_ 1784736316562341888
author Liu, Shaopeng
Koslicki, David
author_facet Liu, Shaopeng
Koslicki, David
author_sort Liu, Shaopeng
collection PubMed
description MOTIVATION: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large [Formula: see text] value, we can simultaneously obtain k-mer-based estimates for all k values up to k(max). This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. RESULTS: We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. AVAILABILITY AND IMPLEMENTATION: A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-9235470
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-92354702022-06-29 CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices Liu, Shaopeng Koslicki, David Bioinformatics ISCB/Ismb 2022 MOTIVATION: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large [Formula: see text] value, we can simultaneously obtain k-mer-based estimates for all k values up to k(max). This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. RESULTS: We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. AVAILABILITY AND IMPLEMENTATION: A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-06-27 /pmc/articles/PMC9235470/ /pubmed/35758788 http://dx.doi.org/10.1093/bioinformatics/btac237 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle ISCB/Ismb 2022
Liu, Shaopeng
Koslicki, David
CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
title CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
title_full CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
title_fullStr CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
title_full_unstemmed CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
title_short CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
title_sort cmash: fast, multi-resolution estimation of k-mer-based jaccard and containment indices
topic ISCB/Ismb 2022
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9235470/
https://www.ncbi.nlm.nih.gov/pubmed/35758788
http://dx.doi.org/10.1093/bioinformatics/btac237
work_keys_str_mv AT liushaopeng cmashfastmultiresolutionestimationofkmerbasedjaccardandcontainmentindices
AT koslickidavid cmashfastmultiresolutionestimationofkmerbasedjaccardandcontainmentindices