Cargando…

Genome-scale de novo assembly using ALGA

MOTIVATION: There are very few methods for de novo genome assembly based on the overlap graph approach. It is considered as giving more exact results than the so-called de Bruijn graph approach but in much greater time and of much higher memory usage. It is not uncommon that assembly methods involvi...

Descripción completa

Detalles Bibliográficos
Autores principales: Swat, Sylwester, Laskowski, Artur, Badura, Jan, Frohmberg, Wojciech, Wojciechowski, Pawel, Swiercz, Aleksandra, Kasprzak, Marta, Blazewicz, Jacek
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8289375/
https://www.ncbi.nlm.nih.gov/pubmed/33471088
http://dx.doi.org/10.1093/bioinformatics/btab005
_version_ 1783724288777388032
author Swat, Sylwester
Laskowski, Artur
Badura, Jan
Frohmberg, Wojciech
Wojciechowski, Pawel
Swiercz, Aleksandra
Kasprzak, Marta
Blazewicz, Jacek
author_facet Swat, Sylwester
Laskowski, Artur
Badura, Jan
Frohmberg, Wojciech
Wojciechowski, Pawel
Swiercz, Aleksandra
Kasprzak, Marta
Blazewicz, Jacek
author_sort Swat, Sylwester
collection PubMed
description MOTIVATION: There are very few methods for de novo genome assembly based on the overlap graph approach. It is considered as giving more exact results than the so-called de Bruijn graph approach but in much greater time and of much higher memory usage. It is not uncommon that assembly methods involving the overlap graph model are not able to successfully compute greater datasets, mainly due to memory limitation of a computer. This was the reason for developing in last decades mainly de Bruijn-based assembly methods, fast and fairly accurate. However, the latter methods can fail for longer or more repetitive genomes, as they decompose reads to shorter fragments and lose a part of information. An efficient assembler for processing big datasets and using the overlap graph model is still looked out. RESULTS: We propose a new genome-scale de novo assembler based on the overlap graph approach, designed for short-read sequencing data. The method, ALGA, incorporates several new ideas resulting in more exact contigs produced in short time. Among these ideas, we have creation of a sparse but quite informative graph, reduction of the graph including a procedure referring to the problem of minimum spanning tree of a local subgraph, and graph traversal connected with simultaneous analysis of contigs stored so far. What is rare in genome assembly, the algorithm is almost parameter-free, with only one optional parameter to be set by a user. ALGA was compared with nine state-of-the-art assemblers in tests on genome-scale sequencing data obtained from real experiments on six organisms, differing in size, coverage, GC content and repetition rate. ALGA produced best results in the sense of overall quality of genome reconstruction, understood as a good balance between genome coverage, accuracy and length of resulting sequences. The algorithm is one of tools involved in processing data in currently realized national project Genomic Map of Poland. AVAILABILITY AND IMPLEMENTATION: ALGA is available at http://alga.put.poznan.pl. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-8289375
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-82893752021-07-20 Genome-scale de novo assembly using ALGA Swat, Sylwester Laskowski, Artur Badura, Jan Frohmberg, Wojciech Wojciechowski, Pawel Swiercz, Aleksandra Kasprzak, Marta Blazewicz, Jacek Bioinformatics Original Papers MOTIVATION: There are very few methods for de novo genome assembly based on the overlap graph approach. It is considered as giving more exact results than the so-called de Bruijn graph approach but in much greater time and of much higher memory usage. It is not uncommon that assembly methods involving the overlap graph model are not able to successfully compute greater datasets, mainly due to memory limitation of a computer. This was the reason for developing in last decades mainly de Bruijn-based assembly methods, fast and fairly accurate. However, the latter methods can fail for longer or more repetitive genomes, as they decompose reads to shorter fragments and lose a part of information. An efficient assembler for processing big datasets and using the overlap graph model is still looked out. RESULTS: We propose a new genome-scale de novo assembler based on the overlap graph approach, designed for short-read sequencing data. The method, ALGA, incorporates several new ideas resulting in more exact contigs produced in short time. Among these ideas, we have creation of a sparse but quite informative graph, reduction of the graph including a procedure referring to the problem of minimum spanning tree of a local subgraph, and graph traversal connected with simultaneous analysis of contigs stored so far. What is rare in genome assembly, the algorithm is almost parameter-free, with only one optional parameter to be set by a user. ALGA was compared with nine state-of-the-art assemblers in tests on genome-scale sequencing data obtained from real experiments on six organisms, differing in size, coverage, GC content and repetition rate. ALGA produced best results in the sense of overall quality of genome reconstruction, understood as a good balance between genome coverage, accuracy and length of resulting sequences. The algorithm is one of tools involved in processing data in currently realized national project Genomic Map of Poland. AVAILABILITY AND IMPLEMENTATION: ALGA is available at http://alga.put.poznan.pl. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-01-20 /pmc/articles/PMC8289375/ /pubmed/33471088 http://dx.doi.org/10.1093/bioinformatics/btab005 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Swat, Sylwester
Laskowski, Artur
Badura, Jan
Frohmberg, Wojciech
Wojciechowski, Pawel
Swiercz, Aleksandra
Kasprzak, Marta
Blazewicz, Jacek
Genome-scale de novo assembly using ALGA
title Genome-scale de novo assembly using ALGA
title_full Genome-scale de novo assembly using ALGA
title_fullStr Genome-scale de novo assembly using ALGA
title_full_unstemmed Genome-scale de novo assembly using ALGA
title_short Genome-scale de novo assembly using ALGA
title_sort genome-scale de novo assembly using alga
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8289375/
https://www.ncbi.nlm.nih.gov/pubmed/33471088
http://dx.doi.org/10.1093/bioinformatics/btab005
work_keys_str_mv AT swatsylwester genomescaledenovoassemblyusingalga
AT laskowskiartur genomescaledenovoassemblyusingalga
AT badurajan genomescaledenovoassemblyusingalga
AT frohmbergwojciech genomescaledenovoassemblyusingalga
AT wojciechowskipawel genomescaledenovoassemblyusingalga
AT swierczaleksandra genomescaledenovoassemblyusingalga
AT kasprzakmarta genomescaledenovoassemblyusingalga
AT blazewiczjacek genomescaledenovoassemblyusingalga