Cargando…
fimpera: drastic improvement of Approximate Membership Query data-structures with counts
MOTIVATION: High throughput sequencing technologies generate massive amounts of biological sequence datasets as costs fall. One of the current algorithmic challenges for exploiting these data on a global scale consists in providing efficient query engines on these petabyte-scale datasets. Most metho...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10212535/ https://www.ncbi.nlm.nih.gov/pubmed/37195454 http://dx.doi.org/10.1093/bioinformatics/btad305 |
_version_ | 1785047436400525312 |
---|---|
author | Robidou, Lucas Peterlongo, Pierre |
author_facet | Robidou, Lucas Peterlongo, Pierre |
author_sort | Robidou, Lucas |
collection | PubMed |
description | MOTIVATION: High throughput sequencing technologies generate massive amounts of biological sequence datasets as costs fall. One of the current algorithmic challenges for exploiting these data on a global scale consists in providing efficient query engines on these petabyte-scale datasets. Most methods indexing those datasets rely on indexing words of fixed length k, called k-mers. Many applications, such as metagenomics, require the abundance of indexed k-mers as well as their simple presence or absence, but no method scales up to petabyte-scaled datasets. This deficiency is primarily because storing abundance requires explicit storage of the k-mers in order to associate them with their counts. Using counting Approximate Membership Queries (cAMQ) data structures, such as counting Bloom filters, provides a way to index large amounts of k-mers with their abundance, but at the expense of a sensible false positive rate. RESULTS: We propose a novel algorithm, called fimpera, that enables the improvement of any cAMQ performance. Applied to counting Bloom filters, our proposed algorithm reduces the false positive rate by two orders of magnitude and it improves the precision of the reported abundances. Alternatively, fimpera allows for the reduction of the size of a counting Bloom filter by two orders of magnitude while maintaining the same precision. fimpera does not introduce any memory overhead and may even reduces the query time. AVAILABILITY AND IMPLEMENTATION: https://github.com/lrobidou/fimpera. |
format | Online Article Text |
id | pubmed-10212535 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-102125352023-05-26 fimpera: drastic improvement of Approximate Membership Query data-structures with counts Robidou, Lucas Peterlongo, Pierre Bioinformatics Original Paper MOTIVATION: High throughput sequencing technologies generate massive amounts of biological sequence datasets as costs fall. One of the current algorithmic challenges for exploiting these data on a global scale consists in providing efficient query engines on these petabyte-scale datasets. Most methods indexing those datasets rely on indexing words of fixed length k, called k-mers. Many applications, such as metagenomics, require the abundance of indexed k-mers as well as their simple presence or absence, but no method scales up to petabyte-scaled datasets. This deficiency is primarily because storing abundance requires explicit storage of the k-mers in order to associate them with their counts. Using counting Approximate Membership Queries (cAMQ) data structures, such as counting Bloom filters, provides a way to index large amounts of k-mers with their abundance, but at the expense of a sensible false positive rate. RESULTS: We propose a novel algorithm, called fimpera, that enables the improvement of any cAMQ performance. Applied to counting Bloom filters, our proposed algorithm reduces the false positive rate by two orders of magnitude and it improves the precision of the reported abundances. Alternatively, fimpera allows for the reduction of the size of a counting Bloom filter by two orders of magnitude while maintaining the same precision. fimpera does not introduce any memory overhead and may even reduces the query time. AVAILABILITY AND IMPLEMENTATION: https://github.com/lrobidou/fimpera. Oxford University Press 2023-05-17 /pmc/articles/PMC10212535/ /pubmed/37195454 http://dx.doi.org/10.1093/bioinformatics/btad305 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Paper Robidou, Lucas Peterlongo, Pierre fimpera: drastic improvement of Approximate Membership Query data-structures with counts |
title |
fimpera: drastic improvement of Approximate Membership Query data-structures with counts |
title_full |
fimpera: drastic improvement of Approximate Membership Query data-structures with counts |
title_fullStr |
fimpera: drastic improvement of Approximate Membership Query data-structures with counts |
title_full_unstemmed |
fimpera: drastic improvement of Approximate Membership Query data-structures with counts |
title_short |
fimpera: drastic improvement of Approximate Membership Query data-structures with counts |
title_sort | fimpera: drastic improvement of approximate membership query data-structures with counts |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10212535/ https://www.ncbi.nlm.nih.gov/pubmed/37195454 http://dx.doi.org/10.1093/bioinformatics/btad305 |
work_keys_str_mv | AT robidoulucas fimperadrasticimprovementofapproximatemembershipquerydatastructureswithcounts AT peterlongopierre fimperadrasticimprovementofapproximatemembershipquerydatastructureswithcounts |