Cargando…

On weighted k-mer dictionaries

We consider the problem of representing a set of [Formula: see text] -mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a [Formula: see text] -mer is efficient. The representation is called a weighted dictionary of [Formula: se...

Descripción completa

Detalles Bibliográficos
Autor principal: Pibiri, Giulio Ermanno
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10276434/
https://www.ncbi.nlm.nih.gov/pubmed/37328897
http://dx.doi.org/10.1186/s13015-023-00226-2
_version_ 1785060077134151680
author Pibiri, Giulio Ermanno
author_facet Pibiri, Giulio Ermanno
author_sort Pibiri, Giulio Ermanno
collection PubMed
description We consider the problem of representing a set of [Formula: see text] -mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a [Formula: see text] -mer is efficient. The representation is called a weighted dictionary of [Formula: see text] -mers and finds application in numerous tasks in Bioinformatics that usually count [Formula: see text] -mers as a pre-processing step. In fact, [Formula: see text] -mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri in Bioinformatics 38:185–194, 2022) to also store compactly the weights of the [Formula: see text] -mers. From a technical perspective, we exploit the order of the [Formula: see text] -mers represented in SSHash to encode runs of weights, hence allowing much better compression than the empirical entropy of the weights. We study the problem of reducing the number of runs in the weights to improve compression even further and give an optimal algorithm for this problem. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only [Formula: see text] -mer dictionary that is exact, weighted, associative, fast, and small.
format Online
Article
Text
id pubmed-10276434
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-102764342023-06-18 On weighted k-mer dictionaries Pibiri, Giulio Ermanno Algorithms Mol Biol Research We consider the problem of representing a set of [Formula: see text] -mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a [Formula: see text] -mer is efficient. The representation is called a weighted dictionary of [Formula: see text] -mers and finds application in numerous tasks in Bioinformatics that usually count [Formula: see text] -mers as a pre-processing step. In fact, [Formula: see text] -mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri in Bioinformatics 38:185–194, 2022) to also store compactly the weights of the [Formula: see text] -mers. From a technical perspective, we exploit the order of the [Formula: see text] -mers represented in SSHash to encode runs of weights, hence allowing much better compression than the empirical entropy of the weights. We study the problem of reducing the number of runs in the weights to improve compression even further and give an optimal algorithm for this problem. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only [Formula: see text] -mer dictionary that is exact, weighted, associative, fast, and small. BioMed Central 2023-06-17 /pmc/articles/PMC10276434/ /pubmed/37328897 http://dx.doi.org/10.1186/s13015-023-00226-2 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Pibiri, Giulio Ermanno
On weighted k-mer dictionaries
title On weighted k-mer dictionaries
title_full On weighted k-mer dictionaries
title_fullStr On weighted k-mer dictionaries
title_full_unstemmed On weighted k-mer dictionaries
title_short On weighted k-mer dictionaries
title_sort on weighted k-mer dictionaries
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10276434/
https://www.ncbi.nlm.nih.gov/pubmed/37328897
http://dx.doi.org/10.1186/s13015-023-00226-2
work_keys_str_mv AT pibirigiulioermanno onweightedkmerdictionaries