Cargando…
Space-efficient representation of genomic k-mer count tables
MOTIVATION: k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8939220/ https://www.ncbi.nlm.nih.gov/pubmed/35317833 http://dx.doi.org/10.1186/s13015-022-00212-0 |
_version_ | 1784672699032797184 |
---|---|
author | Shibuya, Yoshihiro Belazzougui, Djamal Kucherov, Gregory |
author_facet | Shibuya, Yoshihiro Belazzougui, Djamal Kucherov, Gregory |
author_sort | Shibuya, Yoshihiro |
collection | PubMed |
description | MOTIVATION: k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. RESULTS: In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E. Coli and C. Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E. Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k’s. |
format | Online Article Text |
id | pubmed-8939220 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-89392202022-03-23 Space-efficient representation of genomic k-mer count tables Shibuya, Yoshihiro Belazzougui, Djamal Kucherov, Gregory Algorithms Mol Biol Research MOTIVATION: k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. RESULTS: In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E. Coli and C. Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E. Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k’s. BioMed Central 2022-03-21 /pmc/articles/PMC8939220/ /pubmed/35317833 http://dx.doi.org/10.1186/s13015-022-00212-0 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Shibuya, Yoshihiro Belazzougui, Djamal Kucherov, Gregory Space-efficient representation of genomic k-mer count tables |
title | Space-efficient representation of genomic k-mer count tables |
title_full | Space-efficient representation of genomic k-mer count tables |
title_fullStr | Space-efficient representation of genomic k-mer count tables |
title_full_unstemmed | Space-efficient representation of genomic k-mer count tables |
title_short | Space-efficient representation of genomic k-mer count tables |
title_sort | space-efficient representation of genomic k-mer count tables |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8939220/ https://www.ncbi.nlm.nih.gov/pubmed/35317833 http://dx.doi.org/10.1186/s13015-022-00212-0 |
work_keys_str_mv | AT shibuyayoshihiro spaceefficientrepresentationofgenomickmercounttables AT belazzouguidjamal spaceefficientrepresentationofgenomickmercounttables AT kucherovgregory spaceefficientrepresentationofgenomickmercounttables |