Cargando…

HapSolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding

BACKGROUND: Despite marked recent improvements in long-read sequencing technology, the assembly of diploid genomes remains a difficult task. A major obstacle is distinguishing between alternative contigs that represent highly heterozygous regions. If primary and secondary contigs are not properly id...

Descripción completa

Detalles Bibliográficos
Autores principales: Solares, Edwin A., Tao, Yuan, Long, Anthony D., Gaut, Brandon S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7788845/
https://www.ncbi.nlm.nih.gov/pubmed/33407090
http://dx.doi.org/10.1186/s12859-020-03939-y
_version_ 1783633112385716224
author Solares, Edwin A.
Tao, Yuan
Long, Anthony D.
Gaut, Brandon S.
author_facet Solares, Edwin A.
Tao, Yuan
Long, Anthony D.
Gaut, Brandon S.
author_sort Solares, Edwin A.
collection PubMed
description BACKGROUND: Despite marked recent improvements in long-read sequencing technology, the assembly of diploid genomes remains a difficult task. A major obstacle is distinguishing between alternative contigs that represent highly heterozygous regions. If primary and secondary contigs are not properly identified, the primary assembly will overrepresent both the size and complexity of the genome, which complicates downstream analysis such as scaffolding. RESULTS: Here we illustrate a new method, which we call HapSolo, that identifies secondary contigs and defines a primary assembly based on multiple pairwise contig alignment metrics. HapSolo evaluates candidate primary assemblies using BUSCO scores and then distinguishes among candidate assemblies using a cost function. The cost function can be defined by the user but by default considers the number of missing, duplicated and single BUSCO genes within the assembly. HapSolo performs hill climbing to minimize cost over thousands of candidate assemblies. We illustrate the performance of HapSolo on genome data from three species: the Chardonnay grape (Vitis vinifera), with a genome of 490 Mb, a mosquito (Anopheles funestus; 200 Mb) and the Thorny Skate (Amblyraja radiata; 2650 Mb). CONCLUSIONS: HapSolo rapidly identified candidate assemblies that yield improvements in assembly metrics, including decreased genome size and improved N50 scores. Contig N50 scores improved by 35%, 9% and 9% for Chardonnay, mosquito and the thorny skate, respectively, relative to unreduced primary assemblies. The benefits of HapSolo were amplified by down-stream analyses, which we illustrated by scaffolding with Hi-C data. We found, for example, that prior to the application of HapSolo, only 52% of the Chardonnay genome was captured in the largest 19 scaffolds, corresponding to the number of chromosomes. After the application of HapSolo, this value increased to ~ 84%. The improvements for the mosquito’s largest three scaffolds, representing the number of chromosomes, were from 61 to 86%, and the improvement was even more pronounced for thorny skate. We compared the scaffolding results to assemblies that were based on PurgeDups for identifying secondary contigs, with generally superior results for HapSolo.
format Online
Article
Text
id pubmed-7788845
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-77888452021-01-07 HapSolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding Solares, Edwin A. Tao, Yuan Long, Anthony D. Gaut, Brandon S. BMC Bioinformatics Methodology Article BACKGROUND: Despite marked recent improvements in long-read sequencing technology, the assembly of diploid genomes remains a difficult task. A major obstacle is distinguishing between alternative contigs that represent highly heterozygous regions. If primary and secondary contigs are not properly identified, the primary assembly will overrepresent both the size and complexity of the genome, which complicates downstream analysis such as scaffolding. RESULTS: Here we illustrate a new method, which we call HapSolo, that identifies secondary contigs and defines a primary assembly based on multiple pairwise contig alignment metrics. HapSolo evaluates candidate primary assemblies using BUSCO scores and then distinguishes among candidate assemblies using a cost function. The cost function can be defined by the user but by default considers the number of missing, duplicated and single BUSCO genes within the assembly. HapSolo performs hill climbing to minimize cost over thousands of candidate assemblies. We illustrate the performance of HapSolo on genome data from three species: the Chardonnay grape (Vitis vinifera), with a genome of 490 Mb, a mosquito (Anopheles funestus; 200 Mb) and the Thorny Skate (Amblyraja radiata; 2650 Mb). CONCLUSIONS: HapSolo rapidly identified candidate assemblies that yield improvements in assembly metrics, including decreased genome size and improved N50 scores. Contig N50 scores improved by 35%, 9% and 9% for Chardonnay, mosquito and the thorny skate, respectively, relative to unreduced primary assemblies. The benefits of HapSolo were amplified by down-stream analyses, which we illustrated by scaffolding with Hi-C data. We found, for example, that prior to the application of HapSolo, only 52% of the Chardonnay genome was captured in the largest 19 scaffolds, corresponding to the number of chromosomes. After the application of HapSolo, this value increased to ~ 84%. The improvements for the mosquito’s largest three scaffolds, representing the number of chromosomes, were from 61 to 86%, and the improvement was even more pronounced for thorny skate. We compared the scaffolding results to assemblies that were based on PurgeDups for identifying secondary contigs, with generally superior results for HapSolo. BioMed Central 2021-01-06 /pmc/articles/PMC7788845/ /pubmed/33407090 http://dx.doi.org/10.1186/s12859-020-03939-y Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology Article
Solares, Edwin A.
Tao, Yuan
Long, Anthony D.
Gaut, Brandon S.
HapSolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding
title HapSolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding
title_full HapSolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding
title_fullStr HapSolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding
title_full_unstemmed HapSolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding
title_short HapSolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding
title_sort hapsolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7788845/
https://www.ncbi.nlm.nih.gov/pubmed/33407090
http://dx.doi.org/10.1186/s12859-020-03939-y
work_keys_str_mv AT solaresedwina hapsoloanoptimizationapproachforremovingsecondaryhaplotigsduringdiploidgenomeassemblyandscaffolding
AT taoyuan hapsoloanoptimizationapproachforremovingsecondaryhaplotigsduringdiploidgenomeassemblyandscaffolding
AT longanthonyd hapsoloanoptimizationapproachforremovingsecondaryhaplotigsduringdiploidgenomeassemblyandscaffolding
AT gautbrandons hapsoloanoptimizationapproachforremovingsecondaryhaplotigsduringdiploidgenomeassemblyandscaffolding