Cargando…

DENTIST—using long reads for closing assembly gaps at high accuracy

BACKGROUND: Long sequencing reads allow increasing contiguity and completeness of fragmented, short-read–based genome assemblies by closing assembly gaps, ideally at high accuracy. While several gap-closing methods have been developed, these methods often close an assembly gap with sequence that doe...

Descripción completa

Detalles Bibliográficos
Autores principales: Ludwig, Arne, Pippel, Martin, Myers, Gene, Hiller, Michael
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8848313/
https://www.ncbi.nlm.nih.gov/pubmed/35077539
http://dx.doi.org/10.1093/gigascience/giab100
_version_ 1784652222825496576
author Ludwig, Arne
Pippel, Martin
Myers, Gene
Hiller, Michael
author_facet Ludwig, Arne
Pippel, Martin
Myers, Gene
Hiller, Michael
author_sort Ludwig, Arne
collection PubMed
description BACKGROUND: Long sequencing reads allow increasing contiguity and completeness of fragmented, short-read–based genome assemblies by closing assembly gaps, ideally at high accuracy. While several gap-closing methods have been developed, these methods often close an assembly gap with sequence that does not accurately represent the true sequence. FINDINGS: Here, we present DENTIST, a sensitive, highly accurate, and automated pipeline method to close gaps in short-read assemblies with long error-prone reads. DENTIST comprehensively determines repetitive assembly regions to identify reliable and unambiguous alignments of long reads to the correct loci, integrates a consensus sequence computation step to obtain a high base accuracy for the inserted sequence, and validates the accuracy of closed gaps. Unlike previous benchmarks, we generated test assemblies that have gaps at the exact positions where real short-read assemblies have gaps. Generating such realistic benchmarks for Drosophila (134 Mb genome), Arabidopsis (119 Mb), hummingbird (1 Gb), and human (3 Gb) and using simulated or real PacBio continuous long reads, we show that DENTIST consistently achieves a substantially higher accuracy compared to previous methods, while having a similar sensitivity. CONCLUSION: DENTIST provides an accurate approach to improve the contiguity and completeness of fragmented assemblies with long reads. DENTIST's source code including a Snakemake workflow, conda package, and Docker container is available at https://github.com/a-ludi/dentist. All test assemblies as a resource for future benchmarking are at https://bds.mpi-cbg.de/hillerlab/DENTIST/.
format Online
Article
Text
id pubmed-8848313
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-88483132022-02-17 DENTIST—using long reads for closing assembly gaps at high accuracy Ludwig, Arne Pippel, Martin Myers, Gene Hiller, Michael Gigascience Technical Note BACKGROUND: Long sequencing reads allow increasing contiguity and completeness of fragmented, short-read–based genome assemblies by closing assembly gaps, ideally at high accuracy. While several gap-closing methods have been developed, these methods often close an assembly gap with sequence that does not accurately represent the true sequence. FINDINGS: Here, we present DENTIST, a sensitive, highly accurate, and automated pipeline method to close gaps in short-read assemblies with long error-prone reads. DENTIST comprehensively determines repetitive assembly regions to identify reliable and unambiguous alignments of long reads to the correct loci, integrates a consensus sequence computation step to obtain a high base accuracy for the inserted sequence, and validates the accuracy of closed gaps. Unlike previous benchmarks, we generated test assemblies that have gaps at the exact positions where real short-read assemblies have gaps. Generating such realistic benchmarks for Drosophila (134 Mb genome), Arabidopsis (119 Mb), hummingbird (1 Gb), and human (3 Gb) and using simulated or real PacBio continuous long reads, we show that DENTIST consistently achieves a substantially higher accuracy compared to previous methods, while having a similar sensitivity. CONCLUSION: DENTIST provides an accurate approach to improve the contiguity and completeness of fragmented assemblies with long reads. DENTIST's source code including a Snakemake workflow, conda package, and Docker container is available at https://github.com/a-ludi/dentist. All test assemblies as a resource for future benchmarking are at https://bds.mpi-cbg.de/hillerlab/DENTIST/. Oxford University Press 2022-01-25 /pmc/articles/PMC8848313/ /pubmed/35077539 http://dx.doi.org/10.1093/gigascience/giab100 Text en © The Author(s) 2022. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Technical Note
Ludwig, Arne
Pippel, Martin
Myers, Gene
Hiller, Michael
DENTIST—using long reads for closing assembly gaps at high accuracy
title DENTIST—using long reads for closing assembly gaps at high accuracy
title_full DENTIST—using long reads for closing assembly gaps at high accuracy
title_fullStr DENTIST—using long reads for closing assembly gaps at high accuracy
title_full_unstemmed DENTIST—using long reads for closing assembly gaps at high accuracy
title_short DENTIST—using long reads for closing assembly gaps at high accuracy
title_sort dentist—using long reads for closing assembly gaps at high accuracy
topic Technical Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8848313/
https://www.ncbi.nlm.nih.gov/pubmed/35077539
http://dx.doi.org/10.1093/gigascience/giab100
work_keys_str_mv AT ludwigarne dentistusinglongreadsforclosingassemblygapsathighaccuracy
AT pippelmartin dentistusinglongreadsforclosingassemblygapsathighaccuracy
AT myersgene dentistusinglongreadsforclosingassemblygapsathighaccuracy
AT hillermichael dentistusinglongreadsforclosingassemblygapsathighaccuracy