Cargando…

fimpera: drastic improvement of Approximate Membership Query data-structures with counts

MOTIVATION: High throughput sequencing technologies generate massive amounts of biological sequence datasets as costs fall. One of the current algorithmic challenges for exploiting these data on a global scale consists in providing efficient query engines on these petabyte-scale datasets. Most metho...

Descripción completa

Detalles Bibliográficos
Autores principales: Robidou, Lucas, Peterlongo, Pierre
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10212535/
https://www.ncbi.nlm.nih.gov/pubmed/37195454
http://dx.doi.org/10.1093/bioinformatics/btad305
_version_ 1785047436400525312
author Robidou, Lucas
Peterlongo, Pierre
author_facet Robidou, Lucas
Peterlongo, Pierre
author_sort Robidou, Lucas
collection PubMed
description MOTIVATION: High throughput sequencing technologies generate massive amounts of biological sequence datasets as costs fall. One of the current algorithmic challenges for exploiting these data on a global scale consists in providing efficient query engines on these petabyte-scale datasets. Most methods indexing those datasets rely on indexing words of fixed length k, called k-mers. Many applications, such as metagenomics, require the abundance of indexed k-mers as well as their simple presence or absence, but no method scales up to petabyte-scaled datasets. This deficiency is primarily because storing abundance requires explicit storage of the k-mers in order to associate them with their counts. Using counting Approximate Membership Queries (cAMQ) data structures, such as counting Bloom filters, provides a way to index large amounts of k-mers with their abundance, but at the expense of a sensible false positive rate. RESULTS: We propose a novel algorithm, called fimpera, that enables the improvement of any cAMQ performance. Applied to counting Bloom filters, our proposed algorithm reduces the false positive rate by two orders of magnitude and it improves the precision of the reported abundances. Alternatively, fimpera allows for the reduction of the size of a counting Bloom filter by two orders of magnitude while maintaining the same precision. fimpera does not introduce any memory overhead and may even reduces the query time. AVAILABILITY AND IMPLEMENTATION: https://github.com/lrobidou/fimpera.
format Online
Article
Text
id pubmed-10212535
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-102125352023-05-26 fimpera: drastic improvement of Approximate Membership Query data-structures with counts Robidou, Lucas Peterlongo, Pierre Bioinformatics Original Paper MOTIVATION: High throughput sequencing technologies generate massive amounts of biological sequence datasets as costs fall. One of the current algorithmic challenges for exploiting these data on a global scale consists in providing efficient query engines on these petabyte-scale datasets. Most methods indexing those datasets rely on indexing words of fixed length k, called k-mers. Many applications, such as metagenomics, require the abundance of indexed k-mers as well as their simple presence or absence, but no method scales up to petabyte-scaled datasets. This deficiency is primarily because storing abundance requires explicit storage of the k-mers in order to associate them with their counts. Using counting Approximate Membership Queries (cAMQ) data structures, such as counting Bloom filters, provides a way to index large amounts of k-mers with their abundance, but at the expense of a sensible false positive rate. RESULTS: We propose a novel algorithm, called fimpera, that enables the improvement of any cAMQ performance. Applied to counting Bloom filters, our proposed algorithm reduces the false positive rate by two orders of magnitude and it improves the precision of the reported abundances. Alternatively, fimpera allows for the reduction of the size of a counting Bloom filter by two orders of magnitude while maintaining the same precision. fimpera does not introduce any memory overhead and may even reduces the query time. AVAILABILITY AND IMPLEMENTATION: https://github.com/lrobidou/fimpera. Oxford University Press 2023-05-17 /pmc/articles/PMC10212535/ /pubmed/37195454 http://dx.doi.org/10.1093/bioinformatics/btad305 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Robidou, Lucas
Peterlongo, Pierre
fimpera: drastic improvement of Approximate Membership Query data-structures with counts
title fimpera: drastic improvement of Approximate Membership Query data-structures with counts
title_full fimpera: drastic improvement of Approximate Membership Query data-structures with counts
title_fullStr fimpera: drastic improvement of Approximate Membership Query data-structures with counts
title_full_unstemmed fimpera: drastic improvement of Approximate Membership Query data-structures with counts
title_short fimpera: drastic improvement of Approximate Membership Query data-structures with counts
title_sort fimpera: drastic improvement of approximate membership query data-structures with counts
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10212535/
https://www.ncbi.nlm.nih.gov/pubmed/37195454
http://dx.doi.org/10.1093/bioinformatics/btad305
work_keys_str_mv AT robidoulucas fimperadrasticimprovementofapproximatemembershipquerydatastructureswithcounts
AT peterlongopierre fimperadrasticimprovementofapproximatemembershipquerydatastructureswithcounts