Cargando…
MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants
MOTIVATION: Bloom filters are a popular data structure that allows rapid searches in large sequence datasets. So far, all tools work with nucleotide sequences; however, protein sequences are conserved over longer evolutionary distances, and only mutations on the protein level may have any functional...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9994790/ https://www.ncbi.nlm.nih.gov/pubmed/36825843 http://dx.doi.org/10.1093/bioinformatics/btad101 |
_version_ | 1784902693645451264 |
---|---|
author | Srikakulam, Sanjay K Keller, Sebastian Dabbaghie, Fawaz Bals, Robert Kalinina, Olga V |
author_facet | Srikakulam, Sanjay K Keller, Sebastian Dabbaghie, Fawaz Bals, Robert Kalinina, Olga V |
author_sort | Srikakulam, Sanjay K |
collection | PubMed |
description | MOTIVATION: Bloom filters are a popular data structure that allows rapid searches in large sequence datasets. So far, all tools work with nucleotide sequences; however, protein sequences are conserved over longer evolutionary distances, and only mutations on the protein level may have any functional significance. RESULTS: We present MetaProFi, a Bloom filter-based tool that, for the first time, offers the functionality to build indexes of amino acid sequences and query them with both amino acid and nucleotide sequences, thus bringing sequence comparison to the biologically relevant protein level. MetaProFi implements additional efficient engineering solutions, such as a shared memory system, chunked data storage and efficient compression. In addition to its conceptual novelty, MetaProFi demonstrates state-of-the-art performance and excellent memory consumption-to-speed ratio when applied to various large datasets. AVAILABILITY AND IMPLEMENTATION: Source code in Python is available at https://github.com/kalininalab/metaprofi. |
format | Online Article Text |
id | pubmed-9994790 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-99947902023-03-09 MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants Srikakulam, Sanjay K Keller, Sebastian Dabbaghie, Fawaz Bals, Robert Kalinina, Olga V Bioinformatics Original Paper MOTIVATION: Bloom filters are a popular data structure that allows rapid searches in large sequence datasets. So far, all tools work with nucleotide sequences; however, protein sequences are conserved over longer evolutionary distances, and only mutations on the protein level may have any functional significance. RESULTS: We present MetaProFi, a Bloom filter-based tool that, for the first time, offers the functionality to build indexes of amino acid sequences and query them with both amino acid and nucleotide sequences, thus bringing sequence comparison to the biologically relevant protein level. MetaProFi implements additional efficient engineering solutions, such as a shared memory system, chunked data storage and efficient compression. In addition to its conceptual novelty, MetaProFi demonstrates state-of-the-art performance and excellent memory consumption-to-speed ratio when applied to various large datasets. AVAILABILITY AND IMPLEMENTATION: Source code in Python is available at https://github.com/kalininalab/metaprofi. Oxford University Press 2023-02-24 /pmc/articles/PMC9994790/ /pubmed/36825843 http://dx.doi.org/10.1093/bioinformatics/btad101 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Paper Srikakulam, Sanjay K Keller, Sebastian Dabbaghie, Fawaz Bals, Robert Kalinina, Olga V MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants |
title | MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants |
title_full | MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants |
title_fullStr | MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants |
title_full_unstemmed | MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants |
title_short | MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants |
title_sort | metaprofi: an ultrafast chunked bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9994790/ https://www.ncbi.nlm.nih.gov/pubmed/36825843 http://dx.doi.org/10.1093/bioinformatics/btad101 |
work_keys_str_mv | AT srikakulamsanjayk metaprofianultrafastchunkedbloomfilterforstoringandqueryingproteinandnucleotidesequencedataforaccurateidentificationoffunctionallyrelevantgeneticvariants AT kellersebastian metaprofianultrafastchunkedbloomfilterforstoringandqueryingproteinandnucleotidesequencedataforaccurateidentificationoffunctionallyrelevantgeneticvariants AT dabbaghiefawaz metaprofianultrafastchunkedbloomfilterforstoringandqueryingproteinandnucleotidesequencedataforaccurateidentificationoffunctionallyrelevantgeneticvariants AT balsrobert metaprofianultrafastchunkedbloomfilterforstoringandqueryingproteinandnucleotidesequencedataforaccurateidentificationoffunctionallyrelevantgeneticvariants AT kalininaolgav metaprofianultrafastchunkedbloomfilterforstoringandqueryingproteinandnucleotidesequencedataforaccurateidentificationoffunctionallyrelevantgeneticvariants |