Cargando…
kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections
SUMMARY: When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers w...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710589/ https://www.ncbi.nlm.nih.gov/pubmed/36699393 http://dx.doi.org/10.1093/bioadv/vbac029 |
_version_ | 1784841399229743104 |
---|---|
author | Lemane, Téo Medvedev, Paul Chikhi, Rayan Peterlongo, Pierre |
author_facet | Lemane, Téo Medvedev, Paul Chikhi, Rayan Peterlongo, Pierre |
author_sort | Lemane, Téo |
collection | PubMed |
description | SUMMARY: When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are (i) an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art tools; (ii) a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. Our experiments highlight that this technique preserves around 8× more k-mers than the usual yet crude filtering of low-abundance k-mers in a large metagenomics dataset. AVAILABILITY AND IMPLEMENTATION: https://github.com/tlemane/kmtricks. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online. |
format | Online Article Text |
id | pubmed-9710589 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-97105892023-01-24 kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections Lemane, Téo Medvedev, Paul Chikhi, Rayan Peterlongo, Pierre Bioinform Adv Original Paper SUMMARY: When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are (i) an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art tools; (ii) a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. Our experiments highlight that this technique preserves around 8× more k-mers than the usual yet crude filtering of low-abundance k-mers in a large metagenomics dataset. AVAILABILITY AND IMPLEMENTATION: https://github.com/tlemane/kmtricks. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online. Oxford University Press 2022-04-29 /pmc/articles/PMC9710589/ /pubmed/36699393 http://dx.doi.org/10.1093/bioadv/vbac029 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Paper Lemane, Téo Medvedev, Paul Chikhi, Rayan Peterlongo, Pierre kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections |
title | kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections |
title_full | kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections |
title_fullStr | kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections |
title_full_unstemmed | kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections |
title_short | kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections |
title_sort | kmtricks: efficient and flexible construction of bloom filters for large sequencing data collections |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710589/ https://www.ncbi.nlm.nih.gov/pubmed/36699393 http://dx.doi.org/10.1093/bioadv/vbac029 |
work_keys_str_mv | AT lemaneteo kmtricksefficientandflexibleconstructionofbloomfiltersforlargesequencingdatacollections AT medvedevpaul kmtricksefficientandflexibleconstructionofbloomfiltersforlargesequencingdatacollections AT chikhirayan kmtricksefficientandflexibleconstructionofbloomfiltersforlargesequencingdatacollections AT peterlongopierre kmtricksefficientandflexibleconstructionofbloomfiltersforlargesequencingdatacollections |