Cargando…

HINGE: long-read assembly achieves optimal repeat resolution

Long-read sequencing technologies have the potential to produce gold-standard de novo genome assemblies, but fully exploiting error-prone reads to resolve repeats remains a challenge. Aggressive approaches to repeat resolution often produce misassemblies, and conservative approaches lead to unnecess...

Descripción completa

Detalles Bibliográficos
Autores principales: Kamath, Govinda M., Shomorony, Ilan, Xia, Fei, Courtade, Thomas A., Tse, David N.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411769/
https://www.ncbi.nlm.nih.gov/pubmed/28320918
http://dx.doi.org/10.1101/gr.216465.116
_version_ 1783232862276812800
author Kamath, Govinda M.
Shomorony, Ilan
Xia, Fei
Courtade, Thomas A.
Tse, David N.
author_facet Kamath, Govinda M.
Shomorony, Ilan
Xia, Fei
Courtade, Thomas A.
Tse, David N.
author_sort Kamath, Govinda M.
collection PubMed
description Long-read sequencing technologies have the potential to produce gold-standard de novo genome assemblies, but fully exploiting error-prone reads to resolve repeats remains a challenge. Aggressive approaches to repeat resolution often produce misassemblies, and conservative approaches lead to unnecessary fragmentation. We present HINGE, an assembler that seeks to achieve optimal repeat resolution by distinguishing repeats that can be resolved given the data from those that cannot. This is accomplished by adding “hinges” to reads for constructing an overlap graph where only unresolvable repeats are merged. As a result, HINGE combines the error resilience of overlap-based assemblers with repeat-resolution capabilities of de Bruijn graph assemblers. HINGE was evaluated on the long-read bacterial data sets from the NCTC project. HINGE produces more finished assemblies than Miniasm and the manual pipeline of NCTC based on the HGAP assembler and Circlator. HINGE also allows us to identify 40 data sets where unresolvable repeats prevent the reliable construction of a unique finished assembly. In these cases, HINGE outputs a visually interpretable assembly graph that encodes all possible finished assemblies consistent with the reads, while other approaches such as the NCTC pipeline and FALCON either fragment the assembly or resolve the ambiguity arbitrarily.
format Online
Article
Text
id pubmed-5411769
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-54117692017-05-16 HINGE: long-read assembly achieves optimal repeat resolution Kamath, Govinda M. Shomorony, Ilan Xia, Fei Courtade, Thomas A. Tse, David N. Genome Res Method Long-read sequencing technologies have the potential to produce gold-standard de novo genome assemblies, but fully exploiting error-prone reads to resolve repeats remains a challenge. Aggressive approaches to repeat resolution often produce misassemblies, and conservative approaches lead to unnecessary fragmentation. We present HINGE, an assembler that seeks to achieve optimal repeat resolution by distinguishing repeats that can be resolved given the data from those that cannot. This is accomplished by adding “hinges” to reads for constructing an overlap graph where only unresolvable repeats are merged. As a result, HINGE combines the error resilience of overlap-based assemblers with repeat-resolution capabilities of de Bruijn graph assemblers. HINGE was evaluated on the long-read bacterial data sets from the NCTC project. HINGE produces more finished assemblies than Miniasm and the manual pipeline of NCTC based on the HGAP assembler and Circlator. HINGE also allows us to identify 40 data sets where unresolvable repeats prevent the reliable construction of a unique finished assembly. In these cases, HINGE outputs a visually interpretable assembly graph that encodes all possible finished assemblies consistent with the reads, while other approaches such as the NCTC pipeline and FALCON either fragment the assembly or resolve the ambiguity arbitrarily. Cold Spring Harbor Laboratory Press 2017-05 /pmc/articles/PMC5411769/ /pubmed/28320918 http://dx.doi.org/10.1101/gr.216465.116 Text en © 2017 Kamath et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by/4.0/ This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.
spellingShingle Method
Kamath, Govinda M.
Shomorony, Ilan
Xia, Fei
Courtade, Thomas A.
Tse, David N.
HINGE: long-read assembly achieves optimal repeat resolution
title HINGE: long-read assembly achieves optimal repeat resolution
title_full HINGE: long-read assembly achieves optimal repeat resolution
title_fullStr HINGE: long-read assembly achieves optimal repeat resolution
title_full_unstemmed HINGE: long-read assembly achieves optimal repeat resolution
title_short HINGE: long-read assembly achieves optimal repeat resolution
title_sort hinge: long-read assembly achieves optimal repeat resolution
topic Method
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411769/
https://www.ncbi.nlm.nih.gov/pubmed/28320918
http://dx.doi.org/10.1101/gr.216465.116
work_keys_str_mv AT kamathgovindam hingelongreadassemblyachievesoptimalrepeatresolution
AT shomoronyilan hingelongreadassemblyachievesoptimalrepeatresolution
AT xiafei hingelongreadassemblyachievesoptimalrepeatresolution
AT courtadethomasa hingelongreadassemblyachievesoptimalrepeatresolution
AT tsedavidn hingelongreadassemblyachievesoptimalrepeatresolution