Cargando…

Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

We present Raptor, a system for approximately searching many queries such as next-generation sequencing reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the interleaved Bloom filters (IBFs) a...

Descripción completa

Detalles Bibliográficos
Autores principales: Seiler, Enrico, Mehringer, Svenja, Darvish, Mitra, Turc, Etienne, Reinert, Knut
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8313605/
https://www.ncbi.nlm.nih.gov/pubmed/34337360
http://dx.doi.org/10.1016/j.isci.2021.102782
_version_ 1783729385251012608
author Seiler, Enrico
Mehringer, Svenja
Darvish, Mitra
Turc, Etienne
Reinert, Knut
author_facet Seiler, Enrico
Mehringer, Svenja
Darvish, Mitra
Turc, Etienne
Reinert, Knut
author_sort Seiler, Enrico
collection PubMed
description We present Raptor, a system for approximately searching many queries such as next-generation sequencing reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the interleaved Bloom filters (IBFs) as a set membership data structure and probabilistic thresholding for minimizers. Our approach allows compression and partitioning of the IBF to enable the effective use of secondary memory. We test and show the performance and limitations of the new features using simulated and real datasets. Our data structure can be used to accelerate various core bioinformatics applications. We show this by re-implementing the distributed read mapping tool DREAM-Yara.
format Online
Article
Text
id pubmed-8313605
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-83136052021-07-31 Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences Seiler, Enrico Mehringer, Svenja Darvish, Mitra Turc, Etienne Reinert, Knut iScience Article We present Raptor, a system for approximately searching many queries such as next-generation sequencing reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the interleaved Bloom filters (IBFs) as a set membership data structure and probabilistic thresholding for minimizers. Our approach allows compression and partitioning of the IBF to enable the effective use of secondary memory. We test and show the performance and limitations of the new features using simulated and real datasets. Our data structure can be used to accelerate various core bioinformatics applications. We show this by re-implementing the distributed read mapping tool DREAM-Yara. Elsevier 2021-06-24 /pmc/articles/PMC8313605/ /pubmed/34337360 http://dx.doi.org/10.1016/j.isci.2021.102782 Text en © 2021 The Authors https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Seiler, Enrico
Mehringer, Svenja
Darvish, Mitra
Turc, Etienne
Reinert, Knut
Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences
title Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences
title_full Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences
title_fullStr Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences
title_full_unstemmed Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences
title_short Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences
title_sort raptor: a fast and space-efficient pre-filter for querying very large collections of nucleotide sequences
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8313605/
https://www.ncbi.nlm.nih.gov/pubmed/34337360
http://dx.doi.org/10.1016/j.isci.2021.102782
work_keys_str_mv AT seilerenrico raptorafastandspaceefficientprefilterforqueryingverylargecollectionsofnucleotidesequences
AT mehringersvenja raptorafastandspaceefficientprefilterforqueryingverylargecollectionsofnucleotidesequences
AT darvishmitra raptorafastandspaceefficientprefilterforqueryingverylargecollectionsofnucleotidesequences
AT turcetienne raptorafastandspaceefficientprefilterforqueryingverylargecollectionsofnucleotidesequences
AT reinertknut raptorafastandspaceefficientprefilterforqueryingverylargecollectionsofnucleotidesequences