Cargando…

MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants

MOTIVATION: Bloom filters are a popular data structure that allows rapid searches in large sequence datasets. So far, all tools work with nucleotide sequences; however, protein sequences are conserved over longer evolutionary distances, and only mutations on the protein level may have any functional...

Descripción completa

Detalles Bibliográficos
Autores principales: Srikakulam, Sanjay K, Keller, Sebastian, Dabbaghie, Fawaz, Bals, Robert, Kalinina, Olga V
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9994790/
https://www.ncbi.nlm.nih.gov/pubmed/36825843
http://dx.doi.org/10.1093/bioinformatics/btad101
_version_ 1784902693645451264
author Srikakulam, Sanjay K
Keller, Sebastian
Dabbaghie, Fawaz
Bals, Robert
Kalinina, Olga V
author_facet Srikakulam, Sanjay K
Keller, Sebastian
Dabbaghie, Fawaz
Bals, Robert
Kalinina, Olga V
author_sort Srikakulam, Sanjay K
collection PubMed
description MOTIVATION: Bloom filters are a popular data structure that allows rapid searches in large sequence datasets. So far, all tools work with nucleotide sequences; however, protein sequences are conserved over longer evolutionary distances, and only mutations on the protein level may have any functional significance. RESULTS: We present MetaProFi, a Bloom filter-based tool that, for the first time, offers the functionality to build indexes of amino acid sequences and query them with both amino acid and nucleotide sequences, thus bringing sequence comparison to the biologically relevant protein level. MetaProFi implements additional efficient engineering solutions, such as a shared memory system, chunked data storage and efficient compression. In addition to its conceptual novelty, MetaProFi demonstrates state-of-the-art performance and excellent memory consumption-to-speed ratio when applied to various large datasets. AVAILABILITY AND IMPLEMENTATION: Source code in Python is available at https://github.com/kalininalab/metaprofi.
format Online
Article
Text
id pubmed-9994790
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-99947902023-03-09 MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants Srikakulam, Sanjay K Keller, Sebastian Dabbaghie, Fawaz Bals, Robert Kalinina, Olga V Bioinformatics Original Paper MOTIVATION: Bloom filters are a popular data structure that allows rapid searches in large sequence datasets. So far, all tools work with nucleotide sequences; however, protein sequences are conserved over longer evolutionary distances, and only mutations on the protein level may have any functional significance. RESULTS: We present MetaProFi, a Bloom filter-based tool that, for the first time, offers the functionality to build indexes of amino acid sequences and query them with both amino acid and nucleotide sequences, thus bringing sequence comparison to the biologically relevant protein level. MetaProFi implements additional efficient engineering solutions, such as a shared memory system, chunked data storage and efficient compression. In addition to its conceptual novelty, MetaProFi demonstrates state-of-the-art performance and excellent memory consumption-to-speed ratio when applied to various large datasets. AVAILABILITY AND IMPLEMENTATION: Source code in Python is available at https://github.com/kalininalab/metaprofi. Oxford University Press 2023-02-24 /pmc/articles/PMC9994790/ /pubmed/36825843 http://dx.doi.org/10.1093/bioinformatics/btad101 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Srikakulam, Sanjay K
Keller, Sebastian
Dabbaghie, Fawaz
Bals, Robert
Kalinina, Olga V
MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants
title MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants
title_full MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants
title_fullStr MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants
title_full_unstemmed MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants
title_short MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants
title_sort metaprofi: an ultrafast chunked bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9994790/
https://www.ncbi.nlm.nih.gov/pubmed/36825843
http://dx.doi.org/10.1093/bioinformatics/btad101
work_keys_str_mv AT srikakulamsanjayk metaprofianultrafastchunkedbloomfilterforstoringandqueryingproteinandnucleotidesequencedataforaccurateidentificationoffunctionallyrelevantgeneticvariants
AT kellersebastian metaprofianultrafastchunkedbloomfilterforstoringandqueryingproteinandnucleotidesequencedataforaccurateidentificationoffunctionallyrelevantgeneticvariants
AT dabbaghiefawaz metaprofianultrafastchunkedbloomfilterforstoringandqueryingproteinandnucleotidesequencedataforaccurateidentificationoffunctionallyrelevantgeneticvariants
AT balsrobert metaprofianultrafastchunkedbloomfilterforstoringandqueryingproteinandnucleotidesequencedataforaccurateidentificationoffunctionallyrelevantgeneticvariants
AT kalininaolgav metaprofianultrafastchunkedbloomfilterforstoringandqueryingproteinandnucleotidesequencedataforaccurateidentificationoffunctionallyrelevantgeneticvariants