Cargando…

LazyB: fast and cheap genome assembly

BACKGROUND: Advances in genome sequencing over the last years have lead to a fundamental paradigm shift in the field. With steadily decreasing sequencing costs, genome projects are no longer limited by the cost of raw sequencing data, but rather by computational problems associated with genome assem...

Descripción completa

Detalles Bibliográficos
Autores principales: Gatter, Thomas, von Löhneysen, Sarah, Fallmann, Jörg, Drozdova, Polina, Hartmann, Tom, Stadler, Peter F.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8168326/
https://www.ncbi.nlm.nih.gov/pubmed/34074310
http://dx.doi.org/10.1186/s13015-021-00186-5
_version_ 1783701865755574272
author Gatter, Thomas
von Löhneysen, Sarah
Fallmann, Jörg
Drozdova, Polina
Hartmann, Tom
Stadler, Peter F.
author_facet Gatter, Thomas
von Löhneysen, Sarah
Fallmann, Jörg
Drozdova, Polina
Hartmann, Tom
Stadler, Peter F.
author_sort Gatter, Thomas
collection PubMed
description BACKGROUND: Advances in genome sequencing over the last years have lead to a fundamental paradigm shift in the field. With steadily decreasing sequencing costs, genome projects are no longer limited by the cost of raw sequencing data, but rather by computational problems associated with genome assembly. There is an urgent demand for more efficient and and more accurate methods is particular with regard to the highly complex and often very large genomes of animals and plants. Most recently, “hybrid” methods that integrate short and long read data have been devised to address this need. RESULTS: LazyB is such a hybrid genome assembler. It has been designed specificially with an emphasis on utilizing low-coverage short and long reads. LazyB starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph G. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB stepwisely extracts subgraphs whose global properties approach a disjoint union of paths. First, a consistently oriented subgraph is extracted, which in a second step is reduced to a directed acyclic graph. In the next step, properties of proper interval graphs are used to extract contigs as maximum weight paths. These path are translated into genomic sequences only in the final step. A prototype implementation of LazyB, entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort. CONCLUSIONS: LazyB is new low-cost genome assembler that copes well with large genomes and low coverage. It is based on a novel approach for reducing the overlap graph to a collection of paths, thus opening new avenues for future improvements. AVAILABILITY: The LazyB prototype is available at https://github.com/TGatter/LazyB.
format Online
Article
Text
id pubmed-8168326
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-81683262021-06-02 LazyB: fast and cheap genome assembly Gatter, Thomas von Löhneysen, Sarah Fallmann, Jörg Drozdova, Polina Hartmann, Tom Stadler, Peter F. Algorithms Mol Biol Research BACKGROUND: Advances in genome sequencing over the last years have lead to a fundamental paradigm shift in the field. With steadily decreasing sequencing costs, genome projects are no longer limited by the cost of raw sequencing data, but rather by computational problems associated with genome assembly. There is an urgent demand for more efficient and and more accurate methods is particular with regard to the highly complex and often very large genomes of animals and plants. Most recently, “hybrid” methods that integrate short and long read data have been devised to address this need. RESULTS: LazyB is such a hybrid genome assembler. It has been designed specificially with an emphasis on utilizing low-coverage short and long reads. LazyB starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph G. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB stepwisely extracts subgraphs whose global properties approach a disjoint union of paths. First, a consistently oriented subgraph is extracted, which in a second step is reduced to a directed acyclic graph. In the next step, properties of proper interval graphs are used to extract contigs as maximum weight paths. These path are translated into genomic sequences only in the final step. A prototype implementation of LazyB, entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort. CONCLUSIONS: LazyB is new low-cost genome assembler that copes well with large genomes and low coverage. It is based on a novel approach for reducing the overlap graph to a collection of paths, thus opening new avenues for future improvements. AVAILABILITY: The LazyB prototype is available at https://github.com/TGatter/LazyB. BioMed Central 2021-06-01 /pmc/articles/PMC8168326/ /pubmed/34074310 http://dx.doi.org/10.1186/s13015-021-00186-5 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Gatter, Thomas
von Löhneysen, Sarah
Fallmann, Jörg
Drozdova, Polina
Hartmann, Tom
Stadler, Peter F.
LazyB: fast and cheap genome assembly
title LazyB: fast and cheap genome assembly
title_full LazyB: fast and cheap genome assembly
title_fullStr LazyB: fast and cheap genome assembly
title_full_unstemmed LazyB: fast and cheap genome assembly
title_short LazyB: fast and cheap genome assembly
title_sort lazyb: fast and cheap genome assembly
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8168326/
https://www.ncbi.nlm.nih.gov/pubmed/34074310
http://dx.doi.org/10.1186/s13015-021-00186-5
work_keys_str_mv AT gatterthomas lazybfastandcheapgenomeassembly
AT vonlohneysensarah lazybfastandcheapgenomeassembly
AT fallmannjorg lazybfastandcheapgenomeassembly
AT drozdovapolina lazybfastandcheapgenomeassembly
AT hartmanntom lazybfastandcheapgenomeassembly
AT stadlerpeterf lazybfastandcheapgenomeassembly