Cargando…

Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case

BACKGROUND: Chloroplasts are organelles that conduct photosynthesis in plant and algal cells. The information chloroplast genome contained is widely used in agriculture and studies of evolution and ecology. Correctly assembling chloroplast genomes can be challenging because the chloroplast genome co...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Weiwen, Schalamun, Miriam, Morales-Suarez, Alejandro, Kainer, David, Schwessinger, Benjamin, Lanfear, Robert
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311037/
https://www.ncbi.nlm.nih.gov/pubmed/30594129
http://dx.doi.org/10.1186/s12864-018-5348-8
_version_ 1783383540842364928
author Wang, Weiwen
Schalamun, Miriam
Morales-Suarez, Alejandro
Kainer, David
Schwessinger, Benjamin
Lanfear, Robert
author_facet Wang, Weiwen
Schalamun, Miriam
Morales-Suarez, Alejandro
Kainer, David
Schwessinger, Benjamin
Lanfear, Robert
author_sort Wang, Weiwen
collection PubMed
description BACKGROUND: Chloroplasts are organelles that conduct photosynthesis in plant and algal cells. The information chloroplast genome contained is widely used in agriculture and studies of evolution and ecology. Correctly assembling chloroplast genomes can be challenging because the chloroplast genome contains a pair of long inverted repeats (10–30 kb). Typically, it is simply assumed that the gross structure of the chloroplast genome matches the most commonly observed structure of two single-copy regions separated by a pair of inverted repeats. The advent of long-read sequencing technologies should remove the need to make this assumption by providing sufficient information to completely span the inverted repeat regions. Yet, long-reads tend to have higher error rates than short-reads, and relatively little is known about the best way to combine long- and short-reads to obtain the most accurate chloroplast genome assemblies. Using Eucalyptus pauciflora, the snow gum, as a test case, we evaluated the effect of multiple parameters, such as different coverage of long-(Oxford nanopore) and short-(Illumina) reads, different long-read lengths, different assembly pipelines, with a view to determining the most accurate and efficient approach to chloroplast genome assembly. RESULTS: Hybrid assemblies combining at least 20x coverage of both long-reads and short-reads generated a single contig spanning the entire chloroplast genome with few or no detectable errors. Short-read-only assemblies generated three contigs (the long single copy, short single copy and inverted repeat regions) of the chloroplast genome. These contigs contained few single-base errors but tended to exclude several bases at the beginning or end of each contig. Long-read-only assemblies tended to create multiple contigs with a much higher single-base error rate. The chloroplast genome of Eucalyptus pauciflora is 159,942 bp, contains 131 genes of known function. CONCLUSIONS: Our results suggest that very accurate assemblies of chloroplast genomes can be achieved using a combination of at least 20x coverage of long- and short-reads respectively, provided that the long-reads contain at least ~5x coverage of reads longer than the inverted repeat region. We show that further increases in coverage give little or no improvement in accuracy, and that hybrid assemblies are more accurate than long-read-only or short-read-only assemblies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-018-5348-8) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6311037
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-63110372019-01-07 Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case Wang, Weiwen Schalamun, Miriam Morales-Suarez, Alejandro Kainer, David Schwessinger, Benjamin Lanfear, Robert BMC Genomics Methodology Article BACKGROUND: Chloroplasts are organelles that conduct photosynthesis in plant and algal cells. The information chloroplast genome contained is widely used in agriculture and studies of evolution and ecology. Correctly assembling chloroplast genomes can be challenging because the chloroplast genome contains a pair of long inverted repeats (10–30 kb). Typically, it is simply assumed that the gross structure of the chloroplast genome matches the most commonly observed structure of two single-copy regions separated by a pair of inverted repeats. The advent of long-read sequencing technologies should remove the need to make this assumption by providing sufficient information to completely span the inverted repeat regions. Yet, long-reads tend to have higher error rates than short-reads, and relatively little is known about the best way to combine long- and short-reads to obtain the most accurate chloroplast genome assemblies. Using Eucalyptus pauciflora, the snow gum, as a test case, we evaluated the effect of multiple parameters, such as different coverage of long-(Oxford nanopore) and short-(Illumina) reads, different long-read lengths, different assembly pipelines, with a view to determining the most accurate and efficient approach to chloroplast genome assembly. RESULTS: Hybrid assemblies combining at least 20x coverage of both long-reads and short-reads generated a single contig spanning the entire chloroplast genome with few or no detectable errors. Short-read-only assemblies generated three contigs (the long single copy, short single copy and inverted repeat regions) of the chloroplast genome. These contigs contained few single-base errors but tended to exclude several bases at the beginning or end of each contig. Long-read-only assemblies tended to create multiple contigs with a much higher single-base error rate. The chloroplast genome of Eucalyptus pauciflora is 159,942 bp, contains 131 genes of known function. CONCLUSIONS: Our results suggest that very accurate assemblies of chloroplast genomes can be achieved using a combination of at least 20x coverage of long- and short-reads respectively, provided that the long-reads contain at least ~5x coverage of reads longer than the inverted repeat region. We show that further increases in coverage give little or no improvement in accuracy, and that hybrid assemblies are more accurate than long-read-only or short-read-only assemblies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-018-5348-8) contains supplementary material, which is available to authorized users. BioMed Central 2018-12-29 /pmc/articles/PMC6311037/ /pubmed/30594129 http://dx.doi.org/10.1186/s12864-018-5348-8 Text en © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Wang, Weiwen
Schalamun, Miriam
Morales-Suarez, Alejandro
Kainer, David
Schwessinger, Benjamin
Lanfear, Robert
Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
title Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
title_full Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
title_fullStr Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
title_full_unstemmed Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
title_short Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
title_sort assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using eucalyptus pauciflora as a test case
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311037/
https://www.ncbi.nlm.nih.gov/pubmed/30594129
http://dx.doi.org/10.1186/s12864-018-5348-8
work_keys_str_mv AT wangweiwen assemblyofchloroplastgenomeswithlongandshortreaddataacomparisonofapproachesusingeucalyptuspaucifloraasatestcase
AT schalamunmiriam assemblyofchloroplastgenomeswithlongandshortreaddataacomparisonofapproachesusingeucalyptuspaucifloraasatestcase
AT moralessuarezalejandro assemblyofchloroplastgenomeswithlongandshortreaddataacomparisonofapproachesusingeucalyptuspaucifloraasatestcase
AT kainerdavid assemblyofchloroplastgenomeswithlongandshortreaddataacomparisonofapproachesusingeucalyptuspaucifloraasatestcase
AT schwessingerbenjamin assemblyofchloroplastgenomeswithlongandshortreaddataacomparisonofapproachesusingeucalyptuspaucifloraasatestcase
AT lanfearrobert assemblyofchloroplastgenomeswithlongandshortreaddataacomparisonofapproachesusingeucalyptuspaucifloraasatestcase