Cargando…

Lossless indexing with counting de Bruijn graphs

Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored)...

Descripción completa

Detalles Bibliográficos
Autores principales:	Karasikov, Mikhail, Mustafa, Harun, Rätsch, Gunnar, Kahles, André
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cold Spring Harbor Laboratory Press 2022
Materias:	RECOMB 2022 Special/Methods
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9528980/ https://www.ncbi.nlm.nih.gov/pubmed/35609994 http://dx.doi.org/10.1101/gr.276607.122

_version_	1784801406392205312
author	Karasikov, Mikhail Mustafa, Harun Rätsch, Gunnar Kahles, André
author_facet	Karasikov, Mikhail Mustafa, Harun Rätsch, Gunnar Kahles, André
author_sort	Karasikov, Mikhail
collection	PubMed
description	Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node–label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.
format	Online Article Text
id	pubmed-9528980
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Cold Spring Harbor Laboratory Press
record_format	MEDLINE/PubMed
spelling	pubmed-95289802023-03-01 Lossless indexing with counting de Bruijn graphs Karasikov, Mikhail Mustafa, Harun Rätsch, Gunnar Kahles, André Genome Res RECOMB 2022 Special/Methods Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node–label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes. Cold Spring Harbor Laboratory Press 2022-09 /pmc/articles/PMC9528980/ /pubmed/35609994 http://dx.doi.org/10.1101/gr.276607.122 Text en © 2022 Karasikov et al.; Published by Cold Spring Harbor Laboratory Press https://creativecommons.org/licenses/by-nc/4.0/This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) .
spellingShingle	RECOMB 2022 Special/Methods Karasikov, Mikhail Mustafa, Harun Rätsch, Gunnar Kahles, André Lossless indexing with counting de Bruijn graphs
title	Lossless indexing with counting de Bruijn graphs
title_full	Lossless indexing with counting de Bruijn graphs
title_fullStr	Lossless indexing with counting de Bruijn graphs
title_full_unstemmed	Lossless indexing with counting de Bruijn graphs
title_short	Lossless indexing with counting de Bruijn graphs
title_sort	lossless indexing with counting de bruijn graphs
topic	RECOMB 2022 Special/Methods
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9528980/ https://www.ncbi.nlm.nih.gov/pubmed/35609994 http://dx.doi.org/10.1101/gr.276607.122
work_keys_str_mv	AT karasikovmikhail losslessindexingwithcountingdebruijngraphs AT mustafaharun losslessindexingwithcountingdebruijngraphs AT ratschgunnar losslessindexingwithcountingdebruijngraphs AT kahlesandre losslessindexingwithcountingdebruijngraphs

Lossless indexing with counting de Bruijn graphs

Ejemplares similares