Cargando…

Identifying and removing haplotypic duplication in primary genome assemblies

MOTIVATION: Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than...

Descripción completa

Detalles Bibliográficos
Autores principales: Guan, Dengfeng, McCarthy, Shane A, Wood, Jonathan, Howe, Kerstin, Wang, Yadong, Durbin, Richard
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7203741/
https://www.ncbi.nlm.nih.gov/pubmed/31971576
http://dx.doi.org/10.1093/bioinformatics/btaa025
_version_ 1783529925415796736
author Guan, Dengfeng
McCarthy, Shane A
Wood, Jonathan
Howe, Kerstin
Wang, Yadong
Durbin, Richard
author_facet Guan, Dengfeng
McCarthy, Shane A
Wood, Jonathan
Howe, Kerstin
Wang, Yadong
Durbin, Richard
author_sort Guan, Dengfeng
collection PubMed
description MOTIVATION: Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. RESULTS: Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines. AVAILABILITY AND IMPLEMENTATION: The source code is written in C and is available at https://github.com/dfguan/purge_dups. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-7203741
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-72037412020-05-11 Identifying and removing haplotypic duplication in primary genome assemblies Guan, Dengfeng McCarthy, Shane A Wood, Jonathan Howe, Kerstin Wang, Yadong Durbin, Richard Bioinformatics Applications Notes MOTIVATION: Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. RESULTS: Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines. AVAILABILITY AND IMPLEMENTATION: The source code is written in C and is available at https://github.com/dfguan/purge_dups. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-05-01 2020-01-23 /pmc/articles/PMC7203741/ /pubmed/31971576 http://dx.doi.org/10.1093/bioinformatics/btaa025 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Applications Notes
Guan, Dengfeng
McCarthy, Shane A
Wood, Jonathan
Howe, Kerstin
Wang, Yadong
Durbin, Richard
Identifying and removing haplotypic duplication in primary genome assemblies
title Identifying and removing haplotypic duplication in primary genome assemblies
title_full Identifying and removing haplotypic duplication in primary genome assemblies
title_fullStr Identifying and removing haplotypic duplication in primary genome assemblies
title_full_unstemmed Identifying and removing haplotypic duplication in primary genome assemblies
title_short Identifying and removing haplotypic duplication in primary genome assemblies
title_sort identifying and removing haplotypic duplication in primary genome assemblies
topic Applications Notes
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7203741/
https://www.ncbi.nlm.nih.gov/pubmed/31971576
http://dx.doi.org/10.1093/bioinformatics/btaa025
work_keys_str_mv AT guandengfeng identifyingandremovinghaplotypicduplicationinprimarygenomeassemblies
AT mccarthyshanea identifyingandremovinghaplotypicduplicationinprimarygenomeassemblies
AT woodjonathan identifyingandremovinghaplotypicduplicationinprimarygenomeassemblies
AT howekerstin identifyingandremovinghaplotypicduplicationinprimarygenomeassemblies
AT wangyadong identifyingandremovinghaplotypicduplicationinprimarygenomeassemblies
AT durbinrichard identifyingandremovinghaplotypicduplicationinprimarygenomeassemblies