Cargando…
CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
MOTIVATION: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metag...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9235470/ https://www.ncbi.nlm.nih.gov/pubmed/35758788 http://dx.doi.org/10.1093/bioinformatics/btac237 |
_version_ | 1784736316562341888 |
---|---|
author | Liu, Shaopeng Koslicki, David |
author_facet | Liu, Shaopeng Koslicki, David |
author_sort | Liu, Shaopeng |
collection | PubMed |
description | MOTIVATION: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large [Formula: see text] value, we can simultaneously obtain k-mer-based estimates for all k values up to k(max). This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. RESULTS: We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. AVAILABILITY AND IMPLEMENTATION: A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-9235470 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-92354702022-06-29 CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices Liu, Shaopeng Koslicki, David Bioinformatics ISCB/Ismb 2022 MOTIVATION: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large [Formula: see text] value, we can simultaneously obtain k-mer-based estimates for all k values up to k(max). This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. RESULTS: We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. AVAILABILITY AND IMPLEMENTATION: A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-06-27 /pmc/articles/PMC9235470/ /pubmed/35758788 http://dx.doi.org/10.1093/bioinformatics/btac237 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | ISCB/Ismb 2022 Liu, Shaopeng Koslicki, David CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices |
title | CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices |
title_full | CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices |
title_fullStr | CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices |
title_full_unstemmed | CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices |
title_short | CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices |
title_sort | cmash: fast, multi-resolution estimation of k-mer-based jaccard and containment indices |
topic | ISCB/Ismb 2022 |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9235470/ https://www.ncbi.nlm.nih.gov/pubmed/35758788 http://dx.doi.org/10.1093/bioinformatics/btac237 |
work_keys_str_mv | AT liushaopeng cmashfastmultiresolutionestimationofkmerbasedjaccardandcontainmentindices AT koslickidavid cmashfastmultiresolutionestimationofkmerbasedjaccardandcontainmentindices |