Cargando…

kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections

SUMMARY: When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers w...

Descripción completa

Detalles Bibliográficos
Autores principales: Lemane, Téo, Medvedev, Paul, Chikhi, Rayan, Peterlongo, Pierre
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710589/
https://www.ncbi.nlm.nih.gov/pubmed/36699393
http://dx.doi.org/10.1093/bioadv/vbac029
_version_ 1784841399229743104
author Lemane, Téo
Medvedev, Paul
Chikhi, Rayan
Peterlongo, Pierre
author_facet Lemane, Téo
Medvedev, Paul
Chikhi, Rayan
Peterlongo, Pierre
author_sort Lemane, Téo
collection PubMed
description SUMMARY: When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are (i) an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art tools; (ii) a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. Our experiments highlight that this technique preserves around 8× more k-mers than the usual yet crude filtering of low-abundance k-mers in a large metagenomics dataset. AVAILABILITY AND IMPLEMENTATION: https://github.com/tlemane/kmtricks. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online.
format Online
Article
Text
id pubmed-9710589
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-97105892023-01-24 kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections Lemane, Téo Medvedev, Paul Chikhi, Rayan Peterlongo, Pierre Bioinform Adv Original Paper SUMMARY: When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are (i) an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art tools; (ii) a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. Our experiments highlight that this technique preserves around 8× more k-mers than the usual yet crude filtering of low-abundance k-mers in a large metagenomics dataset. AVAILABILITY AND IMPLEMENTATION: https://github.com/tlemane/kmtricks. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online. Oxford University Press 2022-04-29 /pmc/articles/PMC9710589/ /pubmed/36699393 http://dx.doi.org/10.1093/bioadv/vbac029 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Lemane, Téo
Medvedev, Paul
Chikhi, Rayan
Peterlongo, Pierre
kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections
title kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections
title_full kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections
title_fullStr kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections
title_full_unstemmed kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections
title_short kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections
title_sort kmtricks: efficient and flexible construction of bloom filters for large sequencing data collections
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710589/
https://www.ncbi.nlm.nih.gov/pubmed/36699393
http://dx.doi.org/10.1093/bioadv/vbac029
work_keys_str_mv AT lemaneteo kmtricksefficientandflexibleconstructionofbloomfiltersforlargesequencingdatacollections
AT medvedevpaul kmtricksefficientandflexibleconstructionofbloomfiltersforlargesequencingdatacollections
AT chikhirayan kmtricksefficientandflexibleconstructionofbloomfiltersforlargesequencingdatacollections
AT peterlongopierre kmtricksefficientandflexibleconstructionofbloomfiltersforlargesequencingdatacollections