Cargando…

Space-efficient representation of genomic k-mer count tables

MOTIVATION: k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries...

Descripción completa

Detalles Bibliográficos
Autores principales: Shibuya, Yoshihiro, Belazzougui, Djamal, Kucherov, Gregory
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8939220/
https://www.ncbi.nlm.nih.gov/pubmed/35317833
http://dx.doi.org/10.1186/s13015-022-00212-0
_version_ 1784672699032797184
author Shibuya, Yoshihiro
Belazzougui, Djamal
Kucherov, Gregory
author_facet Shibuya, Yoshihiro
Belazzougui, Djamal
Kucherov, Gregory
author_sort Shibuya, Yoshihiro
collection PubMed
description MOTIVATION: k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. RESULTS: In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E. Coli and C. Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E. Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k’s.
format Online
Article
Text
id pubmed-8939220
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-89392202022-03-23 Space-efficient representation of genomic k-mer count tables Shibuya, Yoshihiro Belazzougui, Djamal Kucherov, Gregory Algorithms Mol Biol Research MOTIVATION: k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. RESULTS: In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E. Coli and C. Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E. Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k’s. BioMed Central 2022-03-21 /pmc/articles/PMC8939220/ /pubmed/35317833 http://dx.doi.org/10.1186/s13015-022-00212-0 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Shibuya, Yoshihiro
Belazzougui, Djamal
Kucherov, Gregory
Space-efficient representation of genomic k-mer count tables
title Space-efficient representation of genomic k-mer count tables
title_full Space-efficient representation of genomic k-mer count tables
title_fullStr Space-efficient representation of genomic k-mer count tables
title_full_unstemmed Space-efficient representation of genomic k-mer count tables
title_short Space-efficient representation of genomic k-mer count tables
title_sort space-efficient representation of genomic k-mer count tables
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8939220/
https://www.ncbi.nlm.nih.gov/pubmed/35317833
http://dx.doi.org/10.1186/s13015-022-00212-0
work_keys_str_mv AT shibuyayoshihiro spaceefficientrepresentationofgenomickmercounttables
AT belazzouguidjamal spaceefficientrepresentationofgenomickmercounttables
AT kucherovgregory spaceefficientrepresentationofgenomickmercounttables