Cargando…

Compact and evenly distributed k-mer binning for genomic sequences

MOTIVATION: The processing of k-mers (subsequences of length k) is at the foundation of many sequence processing algorithms in bioinformatics, including k-mer counting for genome size estimation, genome assembly, and taxonomic classification for metagenomics. Minimizers—ordered m-mers where m < k...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nyström-Persson, Johan, Keeble-Gagnère, Gabriel, Zawad, Niamat
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2021
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8428581/ https://www.ncbi.nlm.nih.gov/pubmed/33693556 http://dx.doi.org/10.1093/bioinformatics/btab156

_version_	1783750406504972288
author	Nyström-Persson, Johan Keeble-Gagnère, Gabriel Zawad, Niamat
author_facet	Nyström-Persson, Johan Keeble-Gagnère, Gabriel Zawad, Niamat
author_sort	Nyström-Persson, Johan
collection	PubMed
description	MOTIVATION: The processing of k-mers (subsequences of length k) is at the foundation of many sequence processing algorithms in bioinformatics, including k-mer counting for genome size estimation, genome assembly, and taxonomic classification for metagenomics. Minimizers—ordered m-mers where m < k—are often used to group k-mers into bins as a first step in such processing. However, minimizers are known to generate bins of very different sizes, which can pose challenges for distributed and parallel processing, as well as generally increase memory requirements. Furthermore, although various minimizer orderings have been proposed, their practical value for improving tool efficiency has not yet been fully explored. RESULTS: We present Discount, a distributed k-mer counting tool based on Apache Spark, which we use to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data. Using this tool, we then introduce the universal frequency ordering, a new combination of frequency-sampled minimizers and universal k-mer hitting sets, which yields both evenly distributed binning and small bin sizes. We show that this ordering allows Discount to perform distributed k-mer counting on a large dataset in as little as 1/8 of the memory of comparable approaches, making it the most efficient out-of-core distributed k-mer counting method available. AVAILABILITY AND IMPLEMENTATION: Discount is GPL licensed and available at https://github.com/jtnystrom/discount. The data underlying this article are available in the article and in its online supplementary material. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-8428581
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-84285812021-09-10 Compact and evenly distributed k-mer binning for genomic sequences Nyström-Persson, Johan Keeble-Gagnère, Gabriel Zawad, Niamat Bioinformatics Original Papers MOTIVATION: The processing of k-mers (subsequences of length k) is at the foundation of many sequence processing algorithms in bioinformatics, including k-mer counting for genome size estimation, genome assembly, and taxonomic classification for metagenomics. Minimizers—ordered m-mers where m < k—are often used to group k-mers into bins as a first step in such processing. However, minimizers are known to generate bins of very different sizes, which can pose challenges for distributed and parallel processing, as well as generally increase memory requirements. Furthermore, although various minimizer orderings have been proposed, their practical value for improving tool efficiency has not yet been fully explored. RESULTS: We present Discount, a distributed k-mer counting tool based on Apache Spark, which we use to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data. Using this tool, we then introduce the universal frequency ordering, a new combination of frequency-sampled minimizers and universal k-mer hitting sets, which yields both evenly distributed binning and small bin sizes. We show that this ordering allows Discount to perform distributed k-mer counting on a large dataset in as little as 1/8 of the memory of comparable approaches, making it the most efficient out-of-core distributed k-mer counting method available. AVAILABILITY AND IMPLEMENTATION: Discount is GPL licensed and available at https://github.com/jtnystrom/discount. The data underlying this article are available in the article and in its online supplementary material. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-03-08 /pmc/articles/PMC8428581/ /pubmed/33693556 http://dx.doi.org/10.1093/bioinformatics/btab156 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Nyström-Persson, Johan Keeble-Gagnère, Gabriel Zawad, Niamat Compact and evenly distributed k-mer binning for genomic sequences
title	Compact and evenly distributed k-mer binning for genomic sequences
title_full	Compact and evenly distributed k-mer binning for genomic sequences
title_fullStr	Compact and evenly distributed k-mer binning for genomic sequences
title_full_unstemmed	Compact and evenly distributed k-mer binning for genomic sequences
title_short	Compact and evenly distributed k-mer binning for genomic sequences
title_sort	compact and evenly distributed k-mer binning for genomic sequences
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8428581/ https://www.ncbi.nlm.nih.gov/pubmed/33693556 http://dx.doi.org/10.1093/bioinformatics/btab156
work_keys_str_mv	AT nystromperssonjohan compactandevenlydistributedkmerbinningforgenomicsequences AT keeblegagneregabriel compactandevenlydistributedkmerbinningforgenomicsequences AT zawadniamat compactandevenlydistributedkmerbinningforgenomicsequences

Compact and evenly distributed k-mer binning for genomic sequences

Ejemplares similares