Cargando…

CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices

MOTIVATION: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metag...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liu, Shaopeng, Koslicki, David
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	ISCB/Ismb 2022
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9235470/ https://www.ncbi.nlm.nih.gov/pubmed/35758788 http://dx.doi.org/10.1093/bioinformatics/btac237

_version_	1784736316562341888
author	Liu, Shaopeng Koslicki, David
author_facet	Liu, Shaopeng Koslicki, David
author_sort	Liu, Shaopeng
collection	PubMed
description	MOTIVATION: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large [Formula: see text] value, we can simultaneously obtain k-mer-based estimates for all k values up to k(max). This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. RESULTS: We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. AVAILABILITY AND IMPLEMENTATION: A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-9235470
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-92354702022-06-29 CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices Liu, Shaopeng Koslicki, David Bioinformatics ISCB/Ismb 2022 MOTIVATION: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large [Formula: see text] value, we can simultaneously obtain k-mer-based estimates for all k values up to k(max). This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. RESULTS: We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. AVAILABILITY AND IMPLEMENTATION: A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-06-27 /pmc/articles/PMC9235470/ /pubmed/35758788 http://dx.doi.org/10.1093/bioinformatics/btac237 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	ISCB/Ismb 2022 Liu, Shaopeng Koslicki, David CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
title	CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
title_full	CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
title_fullStr	CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
title_full_unstemmed	CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
title_short	CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices
title_sort	cmash: fast, multi-resolution estimation of k-mer-based jaccard and containment indices
topic	ISCB/Ismb 2022
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9235470/ https://www.ncbi.nlm.nih.gov/pubmed/35758788 http://dx.doi.org/10.1093/bioinformatics/btac237
work_keys_str_mv	AT liushaopeng cmashfastmultiresolutionestimationofkmerbasedjaccardandcontainmentindices AT koslickidavid cmashfastmultiresolutionestimationofkmerbasedjaccardandcontainmentindices

CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices

Ejemplares similares