Cargando…

An exploration of assembly strategies and quality metrics on the accuracy of the rewarewa (Knightia excelsa) genome

We used long read sequencing data generated from Knightia excelsa, a nectar‐producing Proteaceae tree endemic to Aotearoa (New Zealand), to explore how sequencing data type, volume and workflows can impact final assembly accuracy and chromosome reconstruction. Establishing a high‐quality genome for...

Descripción completa

Detalles Bibliográficos
Autores principales: McCartney, Ann M., Hilario, Elena, Choi, Seung‐Sub, Guhlin, Joseph, Prebble, Jessica M., Houliston, Gary, Buckley, Thomas R., Chagné, David
Formato: Online Artículo Texto
Lenguaje:English
Publicado: John Wiley and Sons Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8362059/
https://www.ncbi.nlm.nih.gov/pubmed/33955186
http://dx.doi.org/10.1111/1755-0998.13406
_version_ 1783738077633576960
author McCartney, Ann M.
Hilario, Elena
Choi, Seung‐Sub
Guhlin, Joseph
Prebble, Jessica M.
Houliston, Gary
Buckley, Thomas R.
Chagné, David
author_facet McCartney, Ann M.
Hilario, Elena
Choi, Seung‐Sub
Guhlin, Joseph
Prebble, Jessica M.
Houliston, Gary
Buckley, Thomas R.
Chagné, David
author_sort McCartney, Ann M.
collection PubMed
description We used long read sequencing data generated from Knightia excelsa, a nectar‐producing Proteaceae tree endemic to Aotearoa (New Zealand), to explore how sequencing data type, volume and workflows can impact final assembly accuracy and chromosome reconstruction. Establishing a high‐quality genome for this species has specific cultural importance to Māori and commercial importance to honey producers in Aotearoa. Assemblies were produced by five long read assemblers using data subsampled based on read lengths, two polishing strategies and two Hi‐C mapping methods. Our results from subsampling the data by read length showed that each assembler tested performed differently depending on the coverage and the read length of the data. Subsampling highlighted that input data with longer read lengths but perhaps lower coverage constructed more contiguous, kmers and gene‐complete assemblies than short read length input data with higher coverage. The final genome assembly was constructed into 14 pseudochromosomes using an initial flye long read assembly, a racon/medaka/pilon combined polishing strategy, salsa2 and allhic scaffolding, juicebox curation, and Macadamia linkage map validation. We highlighted the importance of developing assembly workflows based on the volume and read length of sequencing data and established a robust set of quality metrics for generating high‐quality assemblies. Scaffolding analyses highlighted that problems found in the initial assemblies could not be resolved accurately by Hi‐C data and that assembly scaffolding was more successful when the underlying contig assembly was of higher accuracy. These findings provide insight into how quality assessment tools can be implemented throughout genome assembly pipelines to inform the de novo reconstruction of a high‐quality genome assembly for nonmodel organisms.
format Online
Article
Text
id pubmed-8362059
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher John Wiley and Sons Inc.
record_format MEDLINE/PubMed
spelling pubmed-83620592021-08-17 An exploration of assembly strategies and quality metrics on the accuracy of the rewarewa (Knightia excelsa) genome McCartney, Ann M. Hilario, Elena Choi, Seung‐Sub Guhlin, Joseph Prebble, Jessica M. Houliston, Gary Buckley, Thomas R. Chagné, David Mol Ecol Resour RESOURCE ARTICLES We used long read sequencing data generated from Knightia excelsa, a nectar‐producing Proteaceae tree endemic to Aotearoa (New Zealand), to explore how sequencing data type, volume and workflows can impact final assembly accuracy and chromosome reconstruction. Establishing a high‐quality genome for this species has specific cultural importance to Māori and commercial importance to honey producers in Aotearoa. Assemblies were produced by five long read assemblers using data subsampled based on read lengths, two polishing strategies and two Hi‐C mapping methods. Our results from subsampling the data by read length showed that each assembler tested performed differently depending on the coverage and the read length of the data. Subsampling highlighted that input data with longer read lengths but perhaps lower coverage constructed more contiguous, kmers and gene‐complete assemblies than short read length input data with higher coverage. The final genome assembly was constructed into 14 pseudochromosomes using an initial flye long read assembly, a racon/medaka/pilon combined polishing strategy, salsa2 and allhic scaffolding, juicebox curation, and Macadamia linkage map validation. We highlighted the importance of developing assembly workflows based on the volume and read length of sequencing data and established a robust set of quality metrics for generating high‐quality assemblies. Scaffolding analyses highlighted that problems found in the initial assemblies could not be resolved accurately by Hi‐C data and that assembly scaffolding was more successful when the underlying contig assembly was of higher accuracy. These findings provide insight into how quality assessment tools can be implemented throughout genome assembly pipelines to inform the de novo reconstruction of a high‐quality genome assembly for nonmodel organisms. John Wiley and Sons Inc. 2021-06-19 2021-08 /pmc/articles/PMC8362059/ /pubmed/33955186 http://dx.doi.org/10.1111/1755-0998.13406 Text en © 2021 The Authors. Molecular Ecology Resources published by John Wiley & Sons Ltd. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the terms of the http://creativecommons.org/licenses/by-nc-nd/4.0/ (https://creativecommons.org/licenses/by-nc-nd/4.0/) License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.
spellingShingle RESOURCE ARTICLES
McCartney, Ann M.
Hilario, Elena
Choi, Seung‐Sub
Guhlin, Joseph
Prebble, Jessica M.
Houliston, Gary
Buckley, Thomas R.
Chagné, David
An exploration of assembly strategies and quality metrics on the accuracy of the rewarewa (Knightia excelsa) genome
title An exploration of assembly strategies and quality metrics on the accuracy of the rewarewa (Knightia excelsa) genome
title_full An exploration of assembly strategies and quality metrics on the accuracy of the rewarewa (Knightia excelsa) genome
title_fullStr An exploration of assembly strategies and quality metrics on the accuracy of the rewarewa (Knightia excelsa) genome
title_full_unstemmed An exploration of assembly strategies and quality metrics on the accuracy of the rewarewa (Knightia excelsa) genome
title_short An exploration of assembly strategies and quality metrics on the accuracy of the rewarewa (Knightia excelsa) genome
title_sort exploration of assembly strategies and quality metrics on the accuracy of the rewarewa (knightia excelsa) genome
topic RESOURCE ARTICLES
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8362059/
https://www.ncbi.nlm.nih.gov/pubmed/33955186
http://dx.doi.org/10.1111/1755-0998.13406
work_keys_str_mv AT mccartneyannm anexplorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT hilarioelena anexplorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT choiseungsub anexplorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT guhlinjoseph anexplorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT prebblejessicam anexplorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT houlistongary anexplorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT buckleythomasr anexplorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT chagnedavid anexplorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT mccartneyannm explorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT hilarioelena explorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT choiseungsub explorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT guhlinjoseph explorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT prebblejessicam explorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT houlistongary explorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT buckleythomasr explorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome
AT chagnedavid explorationofassemblystrategiesandqualitymetricsontheaccuracyoftherewarewaknightiaexcelsagenome