Cargando…

Parallelized short read assembly of large genomes using de Bruijn graphs

BACKGROUND: Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to...

Descripción completa

Detalles Bibliográficos
Autores principales: Liu, Yongchao, Schmidt, Bertil, Maskell, Douglas L
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3167803/
https://www.ncbi.nlm.nih.gov/pubmed/21867511
http://dx.doi.org/10.1186/1471-2105-12-354
_version_ 1782211289312919552
author Liu, Yongchao
Schmidt, Bertil
Maskell, Douglas L
author_facet Liu, Yongchao
Schmidt, Bertil
Maskell, Douglas L
author_sort Liu, Yongchao
collection PubMed
description BACKGROUND: Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads. RESULTS: We present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows that PASHA is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. PASHA's scalability for large genome datasets is demonstrated with human genome assembly. Compared to ABySS, PASHA achieves competitive assembly quality with faster execution speed on the same compute resources, yielding an NG50 contig size of 503 with the longest correct contig size of 18,252, and an NG50 scaffold size of 2,294. Moreover, the human assembly is completed in about 21 hours with only modest compute resources. CONCLUSIONS: Developing parallel assemblers for large genomes has been garnering significant research efforts due to the explosive size growth of high-throughput short read datasets. By employing hybrid parallelism consisting of multi-threading on multi-core CPUs and message passing on compute clusters, PASHA is able to assemble the human genome with high quality and in reasonable time using modest compute resources.
format Online
Article
Text
id pubmed-3167803
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-31678032011-09-07 Parallelized short read assembly of large genomes using de Bruijn graphs Liu, Yongchao Schmidt, Bertil Maskell, Douglas L BMC Bioinformatics Research Article BACKGROUND: Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads. RESULTS: We present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows that PASHA is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. PASHA's scalability for large genome datasets is demonstrated with human genome assembly. Compared to ABySS, PASHA achieves competitive assembly quality with faster execution speed on the same compute resources, yielding an NG50 contig size of 503 with the longest correct contig size of 18,252, and an NG50 scaffold size of 2,294. Moreover, the human assembly is completed in about 21 hours with only modest compute resources. CONCLUSIONS: Developing parallel assemblers for large genomes has been garnering significant research efforts due to the explosive size growth of high-throughput short read datasets. By employing hybrid parallelism consisting of multi-threading on multi-core CPUs and message passing on compute clusters, PASHA is able to assemble the human genome with high quality and in reasonable time using modest compute resources. BioMed Central 2011-08-25 /pmc/articles/PMC3167803/ /pubmed/21867511 http://dx.doi.org/10.1186/1471-2105-12-354 Text en Copyright ©2011 Liu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Liu, Yongchao
Schmidt, Bertil
Maskell, Douglas L
Parallelized short read assembly of large genomes using de Bruijn graphs
title Parallelized short read assembly of large genomes using de Bruijn graphs
title_full Parallelized short read assembly of large genomes using de Bruijn graphs
title_fullStr Parallelized short read assembly of large genomes using de Bruijn graphs
title_full_unstemmed Parallelized short read assembly of large genomes using de Bruijn graphs
title_short Parallelized short read assembly of large genomes using de Bruijn graphs
title_sort parallelized short read assembly of large genomes using de bruijn graphs
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3167803/
https://www.ncbi.nlm.nih.gov/pubmed/21867511
http://dx.doi.org/10.1186/1471-2105-12-354
work_keys_str_mv AT liuyongchao parallelizedshortreadassemblyoflargegenomesusingdebruijngraphs
AT schmidtbertil parallelizedshortreadassemblyoflargegenomesusingdebruijngraphs
AT maskelldouglasl parallelizedshortreadassemblyoflargegenomesusingdebruijngraphs