Cargando…

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm

Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult...

Descripción completa

Detalles Bibliográficos
Autores principales: Zimin, Aleksey V., Puiu, Daniela, Luo, Ming-Cheng, Zhu, Tingting, Koren, Sergey, Marçais, Guillaume, Yorke, James A., Dvořák, Jan, Salzberg, Steven L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411773/
https://www.ncbi.nlm.nih.gov/pubmed/28130360
http://dx.doi.org/10.1101/gr.213405.116
_version_ 1783232863282397184
author Zimin, Aleksey V.
Puiu, Daniela
Luo, Ming-Cheng
Zhu, Tingting
Koren, Sergey
Marçais, Guillaume
Yorke, James A.
Dvořák, Jan
Salzberg, Steven L.
author_facet Zimin, Aleksey V.
Puiu, Daniela
Luo, Ming-Cheng
Zhu, Tingting
Koren, Sergey
Marçais, Guillaume
Yorke, James A.
Dvořák, Jan
Salzberg, Steven L.
author_sort Zimin, Aleksey V.
collection PubMed
description Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.
format Online
Article
Text
id pubmed-5411773
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-54117732017-11-01 Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm Zimin, Aleksey V. Puiu, Daniela Luo, Ming-Cheng Zhu, Tingting Koren, Sergey Marçais, Guillaume Yorke, James A. Dvořák, Jan Salzberg, Steven L. Genome Res Method Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy. Cold Spring Harbor Laboratory Press 2017-05 /pmc/articles/PMC5411773/ /pubmed/28130360 http://dx.doi.org/10.1101/gr.213405.116 Text en © 2017 Zimin et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.
spellingShingle Method
Zimin, Aleksey V.
Puiu, Daniela
Luo, Ming-Cheng
Zhu, Tingting
Koren, Sergey
Marçais, Guillaume
Yorke, James A.
Dvořák, Jan
Salzberg, Steven L.
Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm
title Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm
title_full Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm
title_fullStr Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm
title_full_unstemmed Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm
title_short Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm
title_sort hybrid assembly of the large and highly repetitive genome of aegilops tauschii, a progenitor of bread wheat, with the masurca mega-reads algorithm
topic Method
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411773/
https://www.ncbi.nlm.nih.gov/pubmed/28130360
http://dx.doi.org/10.1101/gr.213405.116
work_keys_str_mv AT ziminalekseyv hybridassemblyofthelargeandhighlyrepetitivegenomeofaegilopstauschiiaprogenitorofbreadwheatwiththemasurcamegareadsalgorithm
AT puiudaniela hybridassemblyofthelargeandhighlyrepetitivegenomeofaegilopstauschiiaprogenitorofbreadwheatwiththemasurcamegareadsalgorithm
AT luomingcheng hybridassemblyofthelargeandhighlyrepetitivegenomeofaegilopstauschiiaprogenitorofbreadwheatwiththemasurcamegareadsalgorithm
AT zhutingting hybridassemblyofthelargeandhighlyrepetitivegenomeofaegilopstauschiiaprogenitorofbreadwheatwiththemasurcamegareadsalgorithm
AT korensergey hybridassemblyofthelargeandhighlyrepetitivegenomeofaegilopstauschiiaprogenitorofbreadwheatwiththemasurcamegareadsalgorithm
AT marcaisguillaume hybridassemblyofthelargeandhighlyrepetitivegenomeofaegilopstauschiiaprogenitorofbreadwheatwiththemasurcamegareadsalgorithm
AT yorkejamesa hybridassemblyofthelargeandhighlyrepetitivegenomeofaegilopstauschiiaprogenitorofbreadwheatwiththemasurcamegareadsalgorithm
AT dvorakjan hybridassemblyofthelargeandhighlyrepetitivegenomeofaegilopstauschiiaprogenitorofbreadwheatwiththemasurcamegareadsalgorithm
AT salzbergstevenl hybridassemblyofthelargeandhighlyrepetitivegenomeofaegilopstauschiiaprogenitorofbreadwheatwiththemasurcamegareadsalgorithm