Cargando…

kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections

SUMMARY: When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers w...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lemane, Téo, Medvedev, Paul, Chikhi, Rayan, Peterlongo, Pierre
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710589/ https://www.ncbi.nlm.nih.gov/pubmed/36699393 http://dx.doi.org/10.1093/bioadv/vbac029

_version_	1784841399229743104
author	Lemane, Téo Medvedev, Paul Chikhi, Rayan Peterlongo, Pierre
author_facet	Lemane, Téo Medvedev, Paul Chikhi, Rayan Peterlongo, Pierre
author_sort	Lemane, Téo
collection	PubMed
description	SUMMARY: When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are (i) an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art tools; (ii) a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. Our experiments highlight that this technique preserves around 8× more k-mers than the usual yet crude filtering of low-abundance k-mers in a large metagenomics dataset. AVAILABILITY AND IMPLEMENTATION: https://github.com/tlemane/kmtricks. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online.
format	Online Article Text
id	pubmed-9710589
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-97105892023-01-24 kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections Lemane, Téo Medvedev, Paul Chikhi, Rayan Peterlongo, Pierre Bioinform Adv Original Paper SUMMARY: When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks, a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are (i) an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art tools; (ii) a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. Our experiments highlight that this technique preserves around 8× more k-mers than the usual yet crude filtering of low-abundance k-mers in a large metagenomics dataset. AVAILABILITY AND IMPLEMENTATION: https://github.com/tlemane/kmtricks. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online. Oxford University Press 2022-04-29 /pmc/articles/PMC9710589/ /pubmed/36699393 http://dx.doi.org/10.1093/bioadv/vbac029 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Lemane, Téo Medvedev, Paul Chikhi, Rayan Peterlongo, Pierre kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections
title	kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections
title_full	kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections
title_fullStr	kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections
title_full_unstemmed	kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections
title_short	kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections
title_sort	kmtricks: efficient and flexible construction of bloom filters for large sequencing data collections
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9710589/ https://www.ncbi.nlm.nih.gov/pubmed/36699393 http://dx.doi.org/10.1093/bioadv/vbac029
work_keys_str_mv	AT lemaneteo kmtricksefficientandflexibleconstructionofbloomfiltersforlargesequencingdatacollections AT medvedevpaul kmtricksefficientandflexibleconstructionofbloomfiltersforlargesequencingdatacollections AT chikhirayan kmtricksefficientandflexibleconstructionofbloomfiltersforlargesequencingdatacollections AT peterlongopierre kmtricksefficientandflexibleconstructionofbloomfiltersforlargesequencingdatacollections

kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections

Ejemplares similares