Cargando…

De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads

Reference-quality genomes are expected to provide a resource for studying gene structure, function, and evolution. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising s...

Descripción completa

Detalles Bibliográficos
Autores principales: Korlach, Jonas, Gedman, Gregory, Kingan, Sarah B., Chin, Chen-Shan, Howard, Jason T., Audet, Jean-Nicolas, Cantin, Lindsey, Jarvis, Erich D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5632298/
https://www.ncbi.nlm.nih.gov/pubmed/29020750
http://dx.doi.org/10.1093/gigascience/gix085
_version_ 1783269673573285888
author Korlach, Jonas
Gedman, Gregory
Kingan, Sarah B.
Chin, Chen-Shan
Howard, Jason T.
Audet, Jean-Nicolas
Cantin, Lindsey
Jarvis, Erich D.
author_facet Korlach, Jonas
Gedman, Gregory
Kingan, Sarah B.
Chin, Chen-Shan
Howard, Jason T.
Audet, Jean-Nicolas
Cantin, Lindsey
Jarvis, Erich D.
author_sort Korlach, Jonas
collection PubMed
description Reference-quality genomes are expected to provide a resource for studying gene structure, function, and evolution. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising solution is long-read sequencing. Here we tested PacBio-based long-read sequencing and diploid assembly for potential improvements to the Sanger-based intermediate-read zebra finch reference and Illumina-based short-read Anna's hummingbird reference, 2 vocal learning avian species widely studied in neuroscience and genomics. With DNA of the same individuals used to generate the reference genomes, we generated diploid assemblies with the FALCON-Unzip assembler, resulting in contigs with no gaps in the megabase range, representing 150-fold and 200-fold improvements over the current zebra finch and hummingbird references, respectively. These long-read and phased assemblies corrected and resolved what we discovered to be numerous misassemblies in the references, including missing sequences in gaps, erroneous sequences flanking gaps, base call errors in difficult-to-sequence regions, complex repeat structure errors, and allelic differences between the 2 haplotypes. These improvements were validated by single long-genome and transcriptome reads and resulted for the first time in completely resolved protein-coding genes widely studied in neuroscience and specialized in vocal learning species. These findings demonstrate the impact of long reads, sequencing of previously difficult-to-sequence regions, and phasing of haplotypes on generating the high-quality assemblies necessary for understanding gene structure, function, and evolution.
format Online
Article
Text
id pubmed-5632298
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-56322982017-10-12 De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads Korlach, Jonas Gedman, Gregory Kingan, Sarah B. Chin, Chen-Shan Howard, Jason T. Audet, Jean-Nicolas Cantin, Lindsey Jarvis, Erich D. Gigascience Research Reference-quality genomes are expected to provide a resource for studying gene structure, function, and evolution. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising solution is long-read sequencing. Here we tested PacBio-based long-read sequencing and diploid assembly for potential improvements to the Sanger-based intermediate-read zebra finch reference and Illumina-based short-read Anna's hummingbird reference, 2 vocal learning avian species widely studied in neuroscience and genomics. With DNA of the same individuals used to generate the reference genomes, we generated diploid assemblies with the FALCON-Unzip assembler, resulting in contigs with no gaps in the megabase range, representing 150-fold and 200-fold improvements over the current zebra finch and hummingbird references, respectively. These long-read and phased assemblies corrected and resolved what we discovered to be numerous misassemblies in the references, including missing sequences in gaps, erroneous sequences flanking gaps, base call errors in difficult-to-sequence regions, complex repeat structure errors, and allelic differences between the 2 haplotypes. These improvements were validated by single long-genome and transcriptome reads and resulted for the first time in completely resolved protein-coding genes widely studied in neuroscience and specialized in vocal learning species. These findings demonstrate the impact of long reads, sequencing of previously difficult-to-sequence regions, and phasing of haplotypes on generating the high-quality assemblies necessary for understanding gene structure, function, and evolution. Oxford University Press 2017-08-28 /pmc/articles/PMC5632298/ /pubmed/29020750 http://dx.doi.org/10.1093/gigascience/gix085 Text en © The Authors 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Korlach, Jonas
Gedman, Gregory
Kingan, Sarah B.
Chin, Chen-Shan
Howard, Jason T.
Audet, Jean-Nicolas
Cantin, Lindsey
Jarvis, Erich D.
De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads
title De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads
title_full De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads
title_fullStr De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads
title_full_unstemmed De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads
title_short De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads
title_sort de novo pacbio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5632298/
https://www.ncbi.nlm.nih.gov/pubmed/29020750
http://dx.doi.org/10.1093/gigascience/gix085
work_keys_str_mv AT korlachjonas denovopacbiolongreadandphasedaviangenomeassembliescorrectandaddtoreferencegenesgeneratedwithintermediateandshortreads
AT gedmangregory denovopacbiolongreadandphasedaviangenomeassembliescorrectandaddtoreferencegenesgeneratedwithintermediateandshortreads
AT kingansarahb denovopacbiolongreadandphasedaviangenomeassembliescorrectandaddtoreferencegenesgeneratedwithintermediateandshortreads
AT chinchenshan denovopacbiolongreadandphasedaviangenomeassembliescorrectandaddtoreferencegenesgeneratedwithintermediateandshortreads
AT howardjasont denovopacbiolongreadandphasedaviangenomeassembliescorrectandaddtoreferencegenesgeneratedwithintermediateandshortreads
AT audetjeannicolas denovopacbiolongreadandphasedaviangenomeassembliescorrectandaddtoreferencegenesgeneratedwithintermediateandshortreads
AT cantinlindsey denovopacbiolongreadandphasedaviangenomeassembliescorrectandaddtoreferencegenesgeneratedwithintermediateandshortreads
AT jarviserichd denovopacbiolongreadandphasedaviangenomeassembliescorrectandaddtoreferencegenesgeneratedwithintermediateandshortreads