Cargando…

On weighted k-mer dictionaries

We consider the problem of representing a set of [Formula: see text] -mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a [Formula: see text] -mer is efficient. The representation is called a weighted dictionary of [Formula: se...

Descripción completa

Detalles Bibliográficos
Autor principal:	Pibiri, Giulio Ermanno
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2023
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10276434/ https://www.ncbi.nlm.nih.gov/pubmed/37328897 http://dx.doi.org/10.1186/s13015-023-00226-2

_version_	1785060077134151680
author	Pibiri, Giulio Ermanno
author_facet	Pibiri, Giulio Ermanno
author_sort	Pibiri, Giulio Ermanno
collection	PubMed
description	We consider the problem of representing a set of [Formula: see text] -mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a [Formula: see text] -mer is efficient. The representation is called a weighted dictionary of [Formula: see text] -mers and finds application in numerous tasks in Bioinformatics that usually count [Formula: see text] -mers as a pre-processing step. In fact, [Formula: see text] -mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri in Bioinformatics 38:185–194, 2022) to also store compactly the weights of the [Formula: see text] -mers. From a technical perspective, we exploit the order of the [Formula: see text] -mers represented in SSHash to encode runs of weights, hence allowing much better compression than the empirical entropy of the weights. We study the problem of reducing the number of runs in the weights to improve compression even further and give an optimal algorithm for this problem. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only [Formula: see text] -mer dictionary that is exact, weighted, associative, fast, and small.
format	Online Article Text
id	pubmed-10276434
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-102764342023-06-18 On weighted k-mer dictionaries Pibiri, Giulio Ermanno Algorithms Mol Biol Research We consider the problem of representing a set of [Formula: see text] -mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a [Formula: see text] -mer is efficient. The representation is called a weighted dictionary of [Formula: see text] -mers and finds application in numerous tasks in Bioinformatics that usually count [Formula: see text] -mers as a pre-processing step. In fact, [Formula: see text] -mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri in Bioinformatics 38:185–194, 2022) to also store compactly the weights of the [Formula: see text] -mers. From a technical perspective, we exploit the order of the [Formula: see text] -mers represented in SSHash to encode runs of weights, hence allowing much better compression than the empirical entropy of the weights. We study the problem of reducing the number of runs in the weights to improve compression even further and give an optimal algorithm for this problem. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only [Formula: see text] -mer dictionary that is exact, weighted, associative, fast, and small. BioMed Central 2023-06-17 /pmc/articles/PMC10276434/ /pubmed/37328897 http://dx.doi.org/10.1186/s13015-023-00226-2 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Pibiri, Giulio Ermanno On weighted k-mer dictionaries
title	On weighted k-mer dictionaries
title_full	On weighted k-mer dictionaries
title_fullStr	On weighted k-mer dictionaries
title_full_unstemmed	On weighted k-mer dictionaries
title_short	On weighted k-mer dictionaries
title_sort	on weighted k-mer dictionaries
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10276434/ https://www.ncbi.nlm.nih.gov/pubmed/37328897 http://dx.doi.org/10.1186/s13015-023-00226-2
work_keys_str_mv	AT pibirigiulioermanno onweightedkmerdictionaries

On weighted k-mer dictionaries

Ejemplares similares