Cargando…

Compact representation of k-mer de Bruijn graphs for genome read assembly

BACKGROUND: Processing of reads from high throughput sequencing is often done in terms of edges in the de Bruijn graph representing all k-mers from the reads. The memory requirements for storing all k-mers in a lookup table can be demanding, even after removal of read errors, but can be alleviated b...

Descripción completa

Detalles Bibliográficos
Autor principal:	Rødland, Einar Andreas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4015147/ https://www.ncbi.nlm.nih.gov/pubmed/24152242 http://dx.doi.org/10.1186/1471-2105-14-313

_version_	1782315287894294528
author	Rødland, Einar Andreas
author_facet	Rødland, Einar Andreas
author_sort	Rødland, Einar Andreas
collection	PubMed
description	BACKGROUND: Processing of reads from high throughput sequencing is often done in terms of edges in the de Bruijn graph representing all k-mers from the reads. The memory requirements for storing all k-mers in a lookup table can be demanding, even after removal of read errors, but can be alleviated by using a memory efficient data structure. RESULTS: The FM-index, which is based on the Burrows–Wheeler transform, provides an efficient data structure providing a searchable index of all substrings from a set of strings, and is used to compactly represent full genomes for use in mapping reads to a genome: the memory required to store this is in the same order of magnitude as the strings themselves. However, reads from high throughput sequences mostly have high coverage and so contain the same substrings multiple times from different reads. I here present a modification of the FM-index, which I call the kFM-index, for indexing the set of k-mers from the reads. For DNA sequences, this requires 5 bit of information for each vertex of the corresponding de Bruijn subgraph, i.e. for each different k−1-mer, plus some additional overhead, typically 0.5 to 1 bit per vertex, for storing the equivalent of the FM-index for walking the underlying de Bruijn graph and reproducing the actual k-mers efficiently. CONCLUSIONS: The kFM-index could replace more memory demanding data structures for storing the de Bruijn k-mer graph representation of sequence reads. A Java implementation with additional technical documentation is provided which demonstrates the applicability of the data structure (http://folk.uio.no/einarro/Projects/KFM-index/).
format	Online Article Text
id	pubmed-4015147
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-40151472014-05-23 Compact representation of k-mer de Bruijn graphs for genome read assembly Rødland, Einar Andreas BMC Bioinformatics Methodology Article BACKGROUND: Processing of reads from high throughput sequencing is often done in terms of edges in the de Bruijn graph representing all k-mers from the reads. The memory requirements for storing all k-mers in a lookup table can be demanding, even after removal of read errors, but can be alleviated by using a memory efficient data structure. RESULTS: The FM-index, which is based on the Burrows–Wheeler transform, provides an efficient data structure providing a searchable index of all substrings from a set of strings, and is used to compactly represent full genomes for use in mapping reads to a genome: the memory required to store this is in the same order of magnitude as the strings themselves. However, reads from high throughput sequences mostly have high coverage and so contain the same substrings multiple times from different reads. I here present a modification of the FM-index, which I call the kFM-index, for indexing the set of k-mers from the reads. For DNA sequences, this requires 5 bit of information for each vertex of the corresponding de Bruijn subgraph, i.e. for each different k−1-mer, plus some additional overhead, typically 0.5 to 1 bit per vertex, for storing the equivalent of the FM-index for walking the underlying de Bruijn graph and reproducing the actual k-mers efficiently. CONCLUSIONS: The kFM-index could replace more memory demanding data structures for storing the de Bruijn k-mer graph representation of sequence reads. A Java implementation with additional technical documentation is provided which demonstrates the applicability of the data structure (http://folk.uio.no/einarro/Projects/KFM-index/). BioMed Central 2013-10-23 /pmc/articles/PMC4015147/ /pubmed/24152242 http://dx.doi.org/10.1186/1471-2105-14-313 Text en Copyright © 2013 Rødland; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Rødland, Einar Andreas Compact representation of k-mer de Bruijn graphs for genome read assembly
title	Compact representation of k-mer de Bruijn graphs for genome read assembly
title_full	Compact representation of k-mer de Bruijn graphs for genome read assembly
title_fullStr	Compact representation of k-mer de Bruijn graphs for genome read assembly
title_full_unstemmed	Compact representation of k-mer de Bruijn graphs for genome read assembly
title_short	Compact representation of k-mer de Bruijn graphs for genome read assembly
title_sort	compact representation of k-mer de bruijn graphs for genome read assembly
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4015147/ https://www.ncbi.nlm.nih.gov/pubmed/24152242 http://dx.doi.org/10.1186/1471-2105-14-313
work_keys_str_mv	AT rødlandeinarandreas compactrepresentationofkmerdebruijngraphsforgenomereadassembly

Compact representation of k-mer de Bruijn graphs for genome read assembly

Ejemplares similares