Cargando…
Parallelized short read assembly of large genomes using de Bruijn graphs
BACKGROUND: Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2011
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3167803/ https://www.ncbi.nlm.nih.gov/pubmed/21867511 http://dx.doi.org/10.1186/1471-2105-12-354 |
_version_ | 1782211289312919552 |
---|---|
author | Liu, Yongchao Schmidt, Bertil Maskell, Douglas L |
author_facet | Liu, Yongchao Schmidt, Bertil Maskell, Douglas L |
author_sort | Liu, Yongchao |
collection | PubMed |
description | BACKGROUND: Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads. RESULTS: We present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows that PASHA is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. PASHA's scalability for large genome datasets is demonstrated with human genome assembly. Compared to ABySS, PASHA achieves competitive assembly quality with faster execution speed on the same compute resources, yielding an NG50 contig size of 503 with the longest correct contig size of 18,252, and an NG50 scaffold size of 2,294. Moreover, the human assembly is completed in about 21 hours with only modest compute resources. CONCLUSIONS: Developing parallel assemblers for large genomes has been garnering significant research efforts due to the explosive size growth of high-throughput short read datasets. By employing hybrid parallelism consisting of multi-threading on multi-core CPUs and message passing on compute clusters, PASHA is able to assemble the human genome with high quality and in reasonable time using modest compute resources. |
format | Online Article Text |
id | pubmed-3167803 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2011 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-31678032011-09-07 Parallelized short read assembly of large genomes using de Bruijn graphs Liu, Yongchao Schmidt, Bertil Maskell, Douglas L BMC Bioinformatics Research Article BACKGROUND: Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads. RESULTS: We present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows that PASHA is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. PASHA's scalability for large genome datasets is demonstrated with human genome assembly. Compared to ABySS, PASHA achieves competitive assembly quality with faster execution speed on the same compute resources, yielding an NG50 contig size of 503 with the longest correct contig size of 18,252, and an NG50 scaffold size of 2,294. Moreover, the human assembly is completed in about 21 hours with only modest compute resources. CONCLUSIONS: Developing parallel assemblers for large genomes has been garnering significant research efforts due to the explosive size growth of high-throughput short read datasets. By employing hybrid parallelism consisting of multi-threading on multi-core CPUs and message passing on compute clusters, PASHA is able to assemble the human genome with high quality and in reasonable time using modest compute resources. BioMed Central 2011-08-25 /pmc/articles/PMC3167803/ /pubmed/21867511 http://dx.doi.org/10.1186/1471-2105-12-354 Text en Copyright ©2011 Liu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Liu, Yongchao Schmidt, Bertil Maskell, Douglas L Parallelized short read assembly of large genomes using de Bruijn graphs |
title | Parallelized short read assembly of large genomes using de Bruijn graphs |
title_full | Parallelized short read assembly of large genomes using de Bruijn graphs |
title_fullStr | Parallelized short read assembly of large genomes using de Bruijn graphs |
title_full_unstemmed | Parallelized short read assembly of large genomes using de Bruijn graphs |
title_short | Parallelized short read assembly of large genomes using de Bruijn graphs |
title_sort | parallelized short read assembly of large genomes using de bruijn graphs |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3167803/ https://www.ncbi.nlm.nih.gov/pubmed/21867511 http://dx.doi.org/10.1186/1471-2105-12-354 |
work_keys_str_mv | AT liuyongchao parallelizedshortreadassemblyoflargegenomesusingdebruijngraphs AT schmidtbertil parallelizedshortreadassemblyoflargegenomesusingdebruijngraphs AT maskelldouglasl parallelizedshortreadassemblyoflargegenomesusingdebruijngraphs |