Cargando…

Spaced Seed Data Structures for De Novo Assembly

De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are...

Descripción completa

Detalles Bibliográficos
Autores principales:	Birol, Inanç, Chu, Justin, Mohamadi, Hamid, Jackman, Shaun D., Raghavan, Karthika, Vandervalk, Benjamin P., Raymond, Anthony, Warren, René L.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Hindawi Publishing Corporation 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4619942/ https://www.ncbi.nlm.nih.gov/pubmed/26539459 http://dx.doi.org/10.1155/2015/196591

_version_	1782397218136784896
author	Birol, Inanç Chu, Justin Mohamadi, Hamid Jackman, Shaun D. Raghavan, Karthika Vandervalk, Benjamin P. Raymond, Anthony Warren, René L.
author_facet	Birol, Inanç Chu, Justin Mohamadi, Hamid Jackman, Shaun D. Raghavan, Karthika Vandervalk, Benjamin P. Raymond, Anthony Warren, René L.
author_sort	Birol, Inanç
collection	PubMed
description	De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.
format	Online Article Text
id	pubmed-4619942
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Hindawi Publishing Corporation
record_format	MEDLINE/PubMed
spelling	pubmed-46199422015-11-04 Spaced Seed Data Structures for De Novo Assembly Birol, Inanç Chu, Justin Mohamadi, Hamid Jackman, Shaun D. Raghavan, Karthika Vandervalk, Benjamin P. Raymond, Anthony Warren, René L. Int J Genomics Research Article De novo assembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction. Hindawi Publishing Corporation 2015 2015-10-11 /pmc/articles/PMC4619942/ /pubmed/26539459 http://dx.doi.org/10.1155/2015/196591 Text en Copyright © 2015 Inanç Birol et al. https://creativecommons.org/licenses/by/3.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Birol, Inanç Chu, Justin Mohamadi, Hamid Jackman, Shaun D. Raghavan, Karthika Vandervalk, Benjamin P. Raymond, Anthony Warren, René L. Spaced Seed Data Structures for De Novo Assembly
title	Spaced Seed Data Structures for De Novo Assembly
title_full	Spaced Seed Data Structures for De Novo Assembly
title_fullStr	Spaced Seed Data Structures for De Novo Assembly
title_full_unstemmed	Spaced Seed Data Structures for De Novo Assembly
title_short	Spaced Seed Data Structures for De Novo Assembly
title_sort	spaced seed data structures for de novo assembly
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4619942/ https://www.ncbi.nlm.nih.gov/pubmed/26539459 http://dx.doi.org/10.1155/2015/196591
work_keys_str_mv	AT birolinanc spacedseeddatastructuresfordenovoassembly AT chujustin spacedseeddatastructuresfordenovoassembly AT mohamadihamid spacedseeddatastructuresfordenovoassembly AT jackmanshaund spacedseeddatastructuresfordenovoassembly AT raghavankarthika spacedseeddatastructuresfordenovoassembly AT vandervalkbenjaminp spacedseeddatastructuresfordenovoassembly AT raymondanthony spacedseeddatastructuresfordenovoassembly AT warrenrenel spacedseeddatastructuresfordenovoassembly

Spaced Seed Data Structures for De Novo Assembly

Ejemplares similares