Cargando…

The SAMBA tool uses long reads to improve the contiguity of genome assemblies

Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemb...

Descripción completa

Detalles Bibliográficos
Autores principales: Zimin, Aleksey V., Salzberg, Steven L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8849508/
https://www.ncbi.nlm.nih.gov/pubmed/35120119
http://dx.doi.org/10.1371/journal.pcbi.1009860
_version_ 1784652481582596096
author Zimin, Aleksey V.
Salzberg, Steven L.
author_facet Zimin, Aleksey V.
Salzberg, Steven L.
author_sort Zimin, Aleksey V.
collection PubMed
description Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca.
format Online
Article
Text
id pubmed-8849508
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-88495082022-02-17 The SAMBA tool uses long reads to improve the contiguity of genome assemblies Zimin, Aleksey V. Salzberg, Steven L. PLoS Comput Biol Research Article Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca. Public Library of Science 2022-02-04 /pmc/articles/PMC8849508/ /pubmed/35120119 http://dx.doi.org/10.1371/journal.pcbi.1009860 Text en © 2022 Zimin, Salzberg https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Zimin, Aleksey V.
Salzberg, Steven L.
The SAMBA tool uses long reads to improve the contiguity of genome assemblies
title The SAMBA tool uses long reads to improve the contiguity of genome assemblies
title_full The SAMBA tool uses long reads to improve the contiguity of genome assemblies
title_fullStr The SAMBA tool uses long reads to improve the contiguity of genome assemblies
title_full_unstemmed The SAMBA tool uses long reads to improve the contiguity of genome assemblies
title_short The SAMBA tool uses long reads to improve the contiguity of genome assemblies
title_sort samba tool uses long reads to improve the contiguity of genome assemblies
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8849508/
https://www.ncbi.nlm.nih.gov/pubmed/35120119
http://dx.doi.org/10.1371/journal.pcbi.1009860
work_keys_str_mv AT ziminalekseyv thesambatooluseslongreadstoimprovethecontiguityofgenomeassemblies
AT salzbergstevenl thesambatooluseslongreadstoimprovethecontiguityofgenomeassemblies
AT ziminalekseyv sambatooluseslongreadstoimprovethecontiguityofgenomeassemblies
AT salzbergstevenl sambatooluseslongreadstoimprovethecontiguityofgenomeassemblies