Cargando…

Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies

BACKGROUND: Recent developments in third-gen long read sequencing and diploid-aware assemblers have resulted in the rapid release of numerous reference-quality assemblies for diploid genomes. However, assembly of highly heterozygous genomes is still problematic when regional heterogeneity is so high...

Descripción completa

Detalles Bibliográficos
Autores principales: Roach, Michael J., Schmidt, Simon A., Borneman, Anthony R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6267036/
https://www.ncbi.nlm.nih.gov/pubmed/30497373
http://dx.doi.org/10.1186/s12859-018-2485-7
_version_ 1783375974453215232
author Roach, Michael J.
Schmidt, Simon A.
Borneman, Anthony R.
author_facet Roach, Michael J.
Schmidt, Simon A.
Borneman, Anthony R.
author_sort Roach, Michael J.
collection PubMed
description BACKGROUND: Recent developments in third-gen long read sequencing and diploid-aware assemblers have resulted in the rapid release of numerous reference-quality assemblies for diploid genomes. However, assembly of highly heterozygous genomes is still problematic when regional heterogeneity is so high that haplotype homology is not recognised during assembly. This results in regional duplication rather than consolidation into allelic variants and can cause issues with downstream analysis, for example variant discovery, or haplotype reconstruction using the diploid assembly with unpaired allelic contigs. RESULTS: A new pipeline—Purge Haplotigs—was developed specifically for third-gen sequencing-based assemblies to automate the reassignment of allelic contigs, and to assist in the manual curation of genome assemblies. The pipeline uses a draft haplotype-fused assembly or a diploid assembly, read alignments, and repeat annotations to identify allelic variants in the primary assembly. The pipeline was tested on a simulated dataset and on four recent diploid (phased) de novo assemblies from third-generation long-read sequencing, and compared with a similar tool. After processing with Purge Haplotigs, haploid assemblies were less duplicated with minimal impact on genome completeness, and diploid assemblies had more pairings of allelic contigs. CONCLUSIONS: Purge Haplotigs improves the haploid and diploid representations of third-gen sequencing based genome assemblies by identifying and reassigning allelic contigs. The implementation is fast and scales well with large genomes, and it is less likely to over-purge repetitive or paralogous elements compared to alignment-only based methods. The software is available at https://bitbucket.org/mroachawri/purge_haplotigs under a permissive MIT licence. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2485-7) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6267036
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-62670362018-12-05 Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies Roach, Michael J. Schmidt, Simon A. Borneman, Anthony R. BMC Bioinformatics Software BACKGROUND: Recent developments in third-gen long read sequencing and diploid-aware assemblers have resulted in the rapid release of numerous reference-quality assemblies for diploid genomes. However, assembly of highly heterozygous genomes is still problematic when regional heterogeneity is so high that haplotype homology is not recognised during assembly. This results in regional duplication rather than consolidation into allelic variants and can cause issues with downstream analysis, for example variant discovery, or haplotype reconstruction using the diploid assembly with unpaired allelic contigs. RESULTS: A new pipeline—Purge Haplotigs—was developed specifically for third-gen sequencing-based assemblies to automate the reassignment of allelic contigs, and to assist in the manual curation of genome assemblies. The pipeline uses a draft haplotype-fused assembly or a diploid assembly, read alignments, and repeat annotations to identify allelic variants in the primary assembly. The pipeline was tested on a simulated dataset and on four recent diploid (phased) de novo assemblies from third-generation long-read sequencing, and compared with a similar tool. After processing with Purge Haplotigs, haploid assemblies were less duplicated with minimal impact on genome completeness, and diploid assemblies had more pairings of allelic contigs. CONCLUSIONS: Purge Haplotigs improves the haploid and diploid representations of third-gen sequencing based genome assemblies by identifying and reassigning allelic contigs. The implementation is fast and scales well with large genomes, and it is less likely to over-purge repetitive or paralogous elements compared to alignment-only based methods. The software is available at https://bitbucket.org/mroachawri/purge_haplotigs under a permissive MIT licence. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2485-7) contains supplementary material, which is available to authorized users. BioMed Central 2018-11-29 /pmc/articles/PMC6267036/ /pubmed/30497373 http://dx.doi.org/10.1186/s12859-018-2485-7 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Roach, Michael J.
Schmidt, Simon A.
Borneman, Anthony R.
Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
title Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
title_full Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
title_fullStr Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
title_full_unstemmed Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
title_short Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
title_sort purge haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6267036/
https://www.ncbi.nlm.nih.gov/pubmed/30497373
http://dx.doi.org/10.1186/s12859-018-2485-7
work_keys_str_mv AT roachmichaelj purgehaplotigsalleliccontigreassignmentforthirdgendiploidgenomeassemblies
AT schmidtsimona purgehaplotigsalleliccontigreassignmentforthirdgendiploidgenomeassemblies
AT bornemananthonyr purgehaplotigsalleliccontigreassignmentforthirdgendiploidgenomeassemblies