Cargando…

Parallelized short read assembly of large genomes using de Bruijn graphs

BACKGROUND: Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liu, Yongchao, Schmidt, Bertil, Maskell, Douglas L
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3167803/ https://www.ncbi.nlm.nih.gov/pubmed/21867511 http://dx.doi.org/10.1186/1471-2105-12-354

_version_	1782211289312919552
author	Liu, Yongchao Schmidt, Bertil Maskell, Douglas L
author_facet	Liu, Yongchao Schmidt, Bertil Maskell, Douglas L
author_sort	Liu, Yongchao
collection	PubMed
description	BACKGROUND: Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads. RESULTS: We present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows that PASHA is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. PASHA's scalability for large genome datasets is demonstrated with human genome assembly. Compared to ABySS, PASHA achieves competitive assembly quality with faster execution speed on the same compute resources, yielding an NG50 contig size of 503 with the longest correct contig size of 18,252, and an NG50 scaffold size of 2,294. Moreover, the human assembly is completed in about 21 hours with only modest compute resources. CONCLUSIONS: Developing parallel assemblers for large genomes has been garnering significant research efforts due to the explosive size growth of high-throughput short read datasets. By employing hybrid parallelism consisting of multi-threading on multi-core CPUs and message passing on compute clusters, PASHA is able to assemble the human genome with high quality and in reasonable time using modest compute resources.
format	Online Article Text
id	pubmed-3167803
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-31678032011-09-07 Parallelized short read assembly of large genomes using de Bruijn graphs Liu, Yongchao Schmidt, Bertil Maskell, Douglas L BMC Bioinformatics Research Article BACKGROUND: Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads. RESULTS: We present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows that PASHA is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. PASHA's scalability for large genome datasets is demonstrated with human genome assembly. Compared to ABySS, PASHA achieves competitive assembly quality with faster execution speed on the same compute resources, yielding an NG50 contig size of 503 with the longest correct contig size of 18,252, and an NG50 scaffold size of 2,294. Moreover, the human assembly is completed in about 21 hours with only modest compute resources. CONCLUSIONS: Developing parallel assemblers for large genomes has been garnering significant research efforts due to the explosive size growth of high-throughput short read datasets. By employing hybrid parallelism consisting of multi-threading on multi-core CPUs and message passing on compute clusters, PASHA is able to assemble the human genome with high quality and in reasonable time using modest compute resources. BioMed Central 2011-08-25 /pmc/articles/PMC3167803/ /pubmed/21867511 http://dx.doi.org/10.1186/1471-2105-12-354 Text en Copyright ©2011 Liu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Liu, Yongchao Schmidt, Bertil Maskell, Douglas L Parallelized short read assembly of large genomes using de Bruijn graphs
title	Parallelized short read assembly of large genomes using de Bruijn graphs
title_full	Parallelized short read assembly of large genomes using de Bruijn graphs
title_fullStr	Parallelized short read assembly of large genomes using de Bruijn graphs
title_full_unstemmed	Parallelized short read assembly of large genomes using de Bruijn graphs
title_short	Parallelized short read assembly of large genomes using de Bruijn graphs
title_sort	parallelized short read assembly of large genomes using de bruijn graphs
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3167803/ https://www.ncbi.nlm.nih.gov/pubmed/21867511 http://dx.doi.org/10.1186/1471-2105-12-354
work_keys_str_mv	AT liuyongchao parallelizedshortreadassemblyoflargegenomesusingdebruijngraphs AT schmidtbertil parallelizedshortreadassemblyoflargegenomesusingdebruijngraphs AT maskelldouglasl parallelizedshortreadassemblyoflargegenomesusingdebruijngraphs

Parallelized short read assembly of large genomes using de Bruijn graphs

Ejemplares similares