Cargando…

Locality-preserving minimal perfect hashing of k-mers

MOTIVATION: Minimal perfect hashing is the problem of mapping a static set of n distinct keys into the address space [Formula: see text] bijectively. It is well-known that [Formula: see text] bits are necessary to specify a minimal perfect hash function (MPHF) f, when no additional knowledge of the...

Descripción completa

Detalles Bibliográficos
Autores principales:	Pibiri, Giulio Ermanno, Shibuya, Yoshihiro, Limasset, Antoine
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	General Computational Biology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311298/ https://www.ncbi.nlm.nih.gov/pubmed/37387137 http://dx.doi.org/10.1093/bioinformatics/btad219

_version_	1785066713196265472
author	Pibiri, Giulio Ermanno Shibuya, Yoshihiro Limasset, Antoine
author_facet	Pibiri, Giulio Ermanno Shibuya, Yoshihiro Limasset, Antoine
author_sort	Pibiri, Giulio Ermanno
collection	PubMed
description	MOTIVATION: Minimal perfect hashing is the problem of mapping a static set of n distinct keys into the address space [Formula: see text] bijectively. It is well-known that [Formula: see text] bits are necessary to specify a minimal perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys: since two consecutive k-mers share an overlap of [Formula: see text] symbols, it seems possible to beat the classic [Formula: see text] bits/key barrier in this case. Moreover, we would like f to map consecutive k-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better evaluation time when querying consecutive k-mers. RESULTS: Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for k-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing k and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature.
format	Online Article Text
id	pubmed-10311298
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-103112982023-07-01 Locality-preserving minimal perfect hashing of k-mers Pibiri, Giulio Ermanno Shibuya, Yoshihiro Limasset, Antoine Bioinformatics General Computational Biology MOTIVATION: Minimal perfect hashing is the problem of mapping a static set of n distinct keys into the address space [Formula: see text] bijectively. It is well-known that [Formula: see text] bits are necessary to specify a minimal perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys: since two consecutive k-mers share an overlap of [Formula: see text] symbols, it seems possible to beat the classic [Formula: see text] bits/key barrier in this case. Moreover, we would like f to map consecutive k-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better evaluation time when querying consecutive k-mers. RESULTS: Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for k-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing k and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature. Oxford University Press 2023-06-30 /pmc/articles/PMC10311298/ /pubmed/37387137 http://dx.doi.org/10.1093/bioinformatics/btad219 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	General Computational Biology Pibiri, Giulio Ermanno Shibuya, Yoshihiro Limasset, Antoine Locality-preserving minimal perfect hashing of k-mers
title	Locality-preserving minimal perfect hashing of k-mers
title_full	Locality-preserving minimal perfect hashing of k-mers
title_fullStr	Locality-preserving minimal perfect hashing of k-mers
title_full_unstemmed	Locality-preserving minimal perfect hashing of k-mers
title_short	Locality-preserving minimal perfect hashing of k-mers
title_sort	locality-preserving minimal perfect hashing of k-mers
topic	General Computational Biology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311298/ https://www.ncbi.nlm.nih.gov/pubmed/37387137 http://dx.doi.org/10.1093/bioinformatics/btad219
work_keys_str_mv	AT pibirigiulioermanno localitypreservingminimalperfecthashingofkmers AT shibuyayoshihiro localitypreservingminimalperfecthashingofkmers AT limassetantoine localitypreservingminimalperfecthashingofkmers

Locality-preserving minimal perfect hashing of k-mers

Ejemplares similares