Cargando…

fimpera: drastic improvement of Approximate Membership Query data-structures with counts

MOTIVATION: High throughput sequencing technologies generate massive amounts of biological sequence datasets as costs fall. One of the current algorithmic challenges for exploiting these data on a global scale consists in providing efficient query engines on these petabyte-scale datasets. Most metho...

Descripción completa

Detalles Bibliográficos
Autores principales:	Robidou, Lucas, Peterlongo, Pierre
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10212535/ https://www.ncbi.nlm.nih.gov/pubmed/37195454 http://dx.doi.org/10.1093/bioinformatics/btad305

_version_	1785047436400525312
author	Robidou, Lucas Peterlongo, Pierre
author_facet	Robidou, Lucas Peterlongo, Pierre
author_sort	Robidou, Lucas
collection	PubMed
description	MOTIVATION: High throughput sequencing technologies generate massive amounts of biological sequence datasets as costs fall. One of the current algorithmic challenges for exploiting these data on a global scale consists in providing efficient query engines on these petabyte-scale datasets. Most methods indexing those datasets rely on indexing words of fixed length k, called k-mers. Many applications, such as metagenomics, require the abundance of indexed k-mers as well as their simple presence or absence, but no method scales up to petabyte-scaled datasets. This deficiency is primarily because storing abundance requires explicit storage of the k-mers in order to associate them with their counts. Using counting Approximate Membership Queries (cAMQ) data structures, such as counting Bloom filters, provides a way to index large amounts of k-mers with their abundance, but at the expense of a sensible false positive rate. RESULTS: We propose a novel algorithm, called fimpera, that enables the improvement of any cAMQ performance. Applied to counting Bloom filters, our proposed algorithm reduces the false positive rate by two orders of magnitude and it improves the precision of the reported abundances. Alternatively, fimpera allows for the reduction of the size of a counting Bloom filter by two orders of magnitude while maintaining the same precision. fimpera does not introduce any memory overhead and may even reduces the query time. AVAILABILITY AND IMPLEMENTATION: https://github.com/lrobidou/fimpera.
format	Online Article Text
id	pubmed-10212535
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-102125352023-05-26 fimpera: drastic improvement of Approximate Membership Query data-structures with counts Robidou, Lucas Peterlongo, Pierre Bioinformatics Original Paper MOTIVATION: High throughput sequencing technologies generate massive amounts of biological sequence datasets as costs fall. One of the current algorithmic challenges for exploiting these data on a global scale consists in providing efficient query engines on these petabyte-scale datasets. Most methods indexing those datasets rely on indexing words of fixed length k, called k-mers. Many applications, such as metagenomics, require the abundance of indexed k-mers as well as their simple presence or absence, but no method scales up to petabyte-scaled datasets. This deficiency is primarily because storing abundance requires explicit storage of the k-mers in order to associate them with their counts. Using counting Approximate Membership Queries (cAMQ) data structures, such as counting Bloom filters, provides a way to index large amounts of k-mers with their abundance, but at the expense of a sensible false positive rate. RESULTS: We propose a novel algorithm, called fimpera, that enables the improvement of any cAMQ performance. Applied to counting Bloom filters, our proposed algorithm reduces the false positive rate by two orders of magnitude and it improves the precision of the reported abundances. Alternatively, fimpera allows for the reduction of the size of a counting Bloom filter by two orders of magnitude while maintaining the same precision. fimpera does not introduce any memory overhead and may even reduces the query time. AVAILABILITY AND IMPLEMENTATION: https://github.com/lrobidou/fimpera. Oxford University Press 2023-05-17 /pmc/articles/PMC10212535/ /pubmed/37195454 http://dx.doi.org/10.1093/bioinformatics/btad305 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Robidou, Lucas Peterlongo, Pierre fimpera: drastic improvement of Approximate Membership Query data-structures with counts
title	fimpera: drastic improvement of Approximate Membership Query data-structures with counts
title_full	fimpera: drastic improvement of Approximate Membership Query data-structures with counts
title_fullStr	fimpera: drastic improvement of Approximate Membership Query data-structures with counts
title_full_unstemmed	fimpera: drastic improvement of Approximate Membership Query data-structures with counts
title_short	fimpera: drastic improvement of Approximate Membership Query data-structures with counts
title_sort	fimpera: drastic improvement of approximate membership query data-structures with counts
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10212535/ https://www.ncbi.nlm.nih.gov/pubmed/37195454 http://dx.doi.org/10.1093/bioinformatics/btad305
work_keys_str_mv	AT robidoulucas fimperadrasticimprovementofapproximatemembershipquerydatastructureswithcounts AT peterlongopierre fimperadrasticimprovementofapproximatemembershipquerydatastructureswithcounts

fimpera: drastic improvement of Approximate Membership Query data-structures with counts

Ejemplares similares