Cargando…
LongStitch: high-quality genome assembly correction and scaffolding using long reads
BACKGROUND: Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through t...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8557608/ https://www.ncbi.nlm.nih.gov/pubmed/34717540 http://dx.doi.org/10.1186/s12859-021-04451-7 |
_version_ | 1784592407284678656 |
---|---|
author | Coombe, Lauren Li, Janet X. Lo, Theodora Wong, Johnathan Nikolic, Vladimir Warren, René L. Birol, Inanc |
author_facet | Coombe, Lauren Li, Janet X. Lo, Theodora Wong, Johnathan Nikolic, Vladimir Warren, René L. Birol, Inanc |
author_sort | Coombe, Lauren |
collection | PubMed |
description | BACKGROUND: Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. RESULTS: LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. CONCLUSIONS: Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04451-7. |
format | Online Article Text |
id | pubmed-8557608 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-85576082021-11-03 LongStitch: high-quality genome assembly correction and scaffolding using long reads Coombe, Lauren Li, Janet X. Lo, Theodora Wong, Johnathan Nikolic, Vladimir Warren, René L. Birol, Inanc BMC Bioinformatics Software BACKGROUND: Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. RESULTS: LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. CONCLUSIONS: Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04451-7. BioMed Central 2021-10-30 /pmc/articles/PMC8557608/ /pubmed/34717540 http://dx.doi.org/10.1186/s12859-021-04451-7 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Software Coombe, Lauren Li, Janet X. Lo, Theodora Wong, Johnathan Nikolic, Vladimir Warren, René L. Birol, Inanc LongStitch: high-quality genome assembly correction and scaffolding using long reads |
title | LongStitch: high-quality genome assembly correction and scaffolding using long reads |
title_full | LongStitch: high-quality genome assembly correction and scaffolding using long reads |
title_fullStr | LongStitch: high-quality genome assembly correction and scaffolding using long reads |
title_full_unstemmed | LongStitch: high-quality genome assembly correction and scaffolding using long reads |
title_short | LongStitch: high-quality genome assembly correction and scaffolding using long reads |
title_sort | longstitch: high-quality genome assembly correction and scaffolding using long reads |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8557608/ https://www.ncbi.nlm.nih.gov/pubmed/34717540 http://dx.doi.org/10.1186/s12859-021-04451-7 |
work_keys_str_mv | AT coombelauren longstitchhighqualitygenomeassemblycorrectionandscaffoldingusinglongreads AT lijanetx longstitchhighqualitygenomeassemblycorrectionandscaffoldingusinglongreads AT lotheodora longstitchhighqualitygenomeassemblycorrectionandscaffoldingusinglongreads AT wongjohnathan longstitchhighqualitygenomeassemblycorrectionandscaffoldingusinglongreads AT nikolicvladimir longstitchhighqualitygenomeassemblycorrectionandscaffoldingusinglongreads AT warrenrenel longstitchhighqualitygenomeassemblycorrectionandscaffoldingusinglongreads AT birolinanc longstitchhighqualitygenomeassemblycorrectionandscaffoldingusinglongreads |