Cargando…

DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

The highly anticipated transition from next generation sequencing (NGS) to third generation sequencing (3GS) has been difficult primarily due to high error rates and excessive sequencing cost. The high error rates make the assembly of long erroneous reads of large genomes challenging because existin...

Descripción completa

Detalles Bibliográficos
Autores principales: Ye, Chengxi, Hill, Christopher M., Wu, Shigang, Ruan, Jue, Ma, Zhanshan (Sam)
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5004134/
https://www.ncbi.nlm.nih.gov/pubmed/27573208
http://dx.doi.org/10.1038/srep31900
_version_ 1782450745147129856
author Ye, Chengxi
Hill, Christopher M.
Wu, Shigang
Ruan, Jue
Ma, Zhanshan (Sam)
author_facet Ye, Chengxi
Hill, Christopher M.
Wu, Shigang
Ruan, Jue
Ma, Zhanshan (Sam)
author_sort Ye, Chengxi
collection PubMed
description The highly anticipated transition from next generation sequencing (NGS) to third generation sequencing (3GS) has been difficult primarily due to high error rates and excessive sequencing cost. The high error rates make the assembly of long erroneous reads of large genomes challenging because existing software solutions are often overwhelmed by error correction tasks. Here we report a hybrid assembly approach that simultaneously utilizes NGS and 3GS data to address both issues. We gain advantages from three general and basic design principles: (i) Compact representation of the long reads leads to efficient alignments. (ii) Base-level errors can be skipped; structural errors need to be detected and corrected. (iii) Structurally correct 3GS reads are assembled and polished. In our implementation, preassembled NGS contigs are used to derive the compact representation of the long reads, motivating an algorithmic conversion from a de Bruijn graph to an overlap graph, the two major assembly paradigms. Moreover, since NGS and 3GS data can compensate for each other, our hybrid assembly approach reduces both of their sequencing requirements. Experiments show that our software is able to assemble mammalian-sized genomes orders of magnitude more quickly than existing methods without consuming a lot of memory, while saving about half of the sequencing cost.
format Online
Article
Text
id pubmed-5004134
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Nature Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-50041342016-09-07 DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies Ye, Chengxi Hill, Christopher M. Wu, Shigang Ruan, Jue Ma, Zhanshan (Sam) Sci Rep Article The highly anticipated transition from next generation sequencing (NGS) to third generation sequencing (3GS) has been difficult primarily due to high error rates and excessive sequencing cost. The high error rates make the assembly of long erroneous reads of large genomes challenging because existing software solutions are often overwhelmed by error correction tasks. Here we report a hybrid assembly approach that simultaneously utilizes NGS and 3GS data to address both issues. We gain advantages from three general and basic design principles: (i) Compact representation of the long reads leads to efficient alignments. (ii) Base-level errors can be skipped; structural errors need to be detected and corrected. (iii) Structurally correct 3GS reads are assembled and polished. In our implementation, preassembled NGS contigs are used to derive the compact representation of the long reads, motivating an algorithmic conversion from a de Bruijn graph to an overlap graph, the two major assembly paradigms. Moreover, since NGS and 3GS data can compensate for each other, our hybrid assembly approach reduces both of their sequencing requirements. Experiments show that our software is able to assemble mammalian-sized genomes orders of magnitude more quickly than existing methods without consuming a lot of memory, while saving about half of the sequencing cost. Nature Publishing Group 2016-08-30 /pmc/articles/PMC5004134/ /pubmed/27573208 http://dx.doi.org/10.1038/srep31900 Text en Copyright © 2016, The Author(s) http://creativecommons.org/licenses/by/4.0/ This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
spellingShingle Article
Ye, Chengxi
Hill, Christopher M.
Wu, Shigang
Ruan, Jue
Ma, Zhanshan (Sam)
DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies
title DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies
title_full DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies
title_fullStr DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies
title_full_unstemmed DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies
title_short DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies
title_sort dbg2olc: efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5004134/
https://www.ncbi.nlm.nih.gov/pubmed/27573208
http://dx.doi.org/10.1038/srep31900
work_keys_str_mv AT yechengxi dbg2olcefficientassemblyoflargegenomesusinglongerroneousreadsofthethirdgenerationsequencingtechnologies
AT hillchristopherm dbg2olcefficientassemblyoflargegenomesusinglongerroneousreadsofthethirdgenerationsequencingtechnologies
AT wushigang dbg2olcefficientassemblyoflargegenomesusinglongerroneousreadsofthethirdgenerationsequencingtechnologies
AT ruanjue dbg2olcefficientassemblyoflargegenomesusinglongerroneousreadsofthethirdgenerationsequencingtechnologies
AT mazhanshansam dbg2olcefficientassemblyoflargegenomesusinglongerroneousreadsofthethirdgenerationsequencingtechnologies