Cargando…

LongStitch: high-quality genome assembly correction and scaffolding using long reads

BACKGROUND: Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through t...

Descripción completa

Detalles Bibliográficos
Autores principales: Coombe, Lauren, Li, Janet X., Lo, Theodora, Wong, Johnathan, Nikolic, Vladimir, Warren, René L., Birol, Inanc
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8557608/
https://www.ncbi.nlm.nih.gov/pubmed/34717540
http://dx.doi.org/10.1186/s12859-021-04451-7
_version_ 1784592407284678656
author Coombe, Lauren
Li, Janet X.
Lo, Theodora
Wong, Johnathan
Nikolic, Vladimir
Warren, René L.
Birol, Inanc
author_facet Coombe, Lauren
Li, Janet X.
Lo, Theodora
Wong, Johnathan
Nikolic, Vladimir
Warren, René L.
Birol, Inanc
author_sort Coombe, Lauren
collection PubMed
description BACKGROUND: Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. RESULTS: LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. CONCLUSIONS: Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04451-7.
format Online
Article
Text
id pubmed-8557608
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-85576082021-11-03 LongStitch: high-quality genome assembly correction and scaffolding using long reads Coombe, Lauren Li, Janet X. Lo, Theodora Wong, Johnathan Nikolic, Vladimir Warren, René L. Birol, Inanc BMC Bioinformatics Software BACKGROUND: Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. RESULTS: LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. CONCLUSIONS: Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04451-7. BioMed Central 2021-10-30 /pmc/articles/PMC8557608/ /pubmed/34717540 http://dx.doi.org/10.1186/s12859-021-04451-7 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Coombe, Lauren
Li, Janet X.
Lo, Theodora
Wong, Johnathan
Nikolic, Vladimir
Warren, René L.
Birol, Inanc
LongStitch: high-quality genome assembly correction and scaffolding using long reads
title LongStitch: high-quality genome assembly correction and scaffolding using long reads
title_full LongStitch: high-quality genome assembly correction and scaffolding using long reads
title_fullStr LongStitch: high-quality genome assembly correction and scaffolding using long reads
title_full_unstemmed LongStitch: high-quality genome assembly correction and scaffolding using long reads
title_short LongStitch: high-quality genome assembly correction and scaffolding using long reads
title_sort longstitch: high-quality genome assembly correction and scaffolding using long reads
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8557608/
https://www.ncbi.nlm.nih.gov/pubmed/34717540
http://dx.doi.org/10.1186/s12859-021-04451-7
work_keys_str_mv AT coombelauren longstitchhighqualitygenomeassemblycorrectionandscaffoldingusinglongreads
AT lijanetx longstitchhighqualitygenomeassemblycorrectionandscaffoldingusinglongreads
AT lotheodora longstitchhighqualitygenomeassemblycorrectionandscaffoldingusinglongreads
AT wongjohnathan longstitchhighqualitygenomeassemblycorrectionandscaffoldingusinglongreads
AT nikolicvladimir longstitchhighqualitygenomeassemblycorrectionandscaffoldingusinglongreads
AT warrenrenel longstitchhighqualitygenomeassemblycorrectionandscaffoldingusinglongreads
AT birolinanc longstitchhighqualitygenomeassemblycorrectionandscaffoldingusinglongreads