Cargando…

SAGE: String-overlap Assembly of GEnomes

BACKGROUND: De novo genome assembly of next-generation sequencing data is one of the most important current problems in bioinformatics, essential in many biological applications. In spite of significant amount of work in this area, better solutions are still very much needed. RESULTS: We present a n...

Descripción completa

Detalles Bibliográficos
Autores principales: Ilie, Lucian, Haider, Bahlul, Molnar, Michael, Solis-Oba, Roberto
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4174676/
https://www.ncbi.nlm.nih.gov/pubmed/25225118
http://dx.doi.org/10.1186/1471-2105-15-302
_version_ 1782336377392726016
author Ilie, Lucian
Haider, Bahlul
Molnar, Michael
Solis-Oba, Roberto
author_facet Ilie, Lucian
Haider, Bahlul
Molnar, Michael
Solis-Oba, Roberto
author_sort Ilie, Lucian
collection PubMed
description BACKGROUND: De novo genome assembly of next-generation sequencing data is one of the most important current problems in bioinformatics, essential in many biological applications. In spite of significant amount of work in this area, better solutions are still very much needed. RESULTS: We present a new program, SAGE, for de novo genome assembly. As opposed to most assemblers, which are de Bruijn graph based, SAGE uses the string-overlap graph. SAGE builds upon great existing work on string-overlap graph and maximum likelihood assembly, bringing an important number of new ideas, such as the efficient computation of the transitive reduction of the string overlap graph, the use of (generalized) edge multiplicity statistics for more accurate estimation of read copy counts, and the improved use of mate pairs and min-cost flow for supporting edge merging. The assemblies produced by SAGE for several short and medium-size genomes compared favourably with those of existing leading assemblers. CONCLUSIONS: SAGE benefits from innovations in almost every aspect of the assembly process: error correction of input reads, string-overlap graph construction, read copy counts estimation, overlap graph analysis and reduction, contig extraction, and scaffolding. We hope that these new ideas will help advance the current state-of-the-art in an essential area of research in genomics. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2105-15-302) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4174676
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-41746762014-09-26 SAGE: String-overlap Assembly of GEnomes Ilie, Lucian Haider, Bahlul Molnar, Michael Solis-Oba, Roberto BMC Bioinformatics Software BACKGROUND: De novo genome assembly of next-generation sequencing data is one of the most important current problems in bioinformatics, essential in many biological applications. In spite of significant amount of work in this area, better solutions are still very much needed. RESULTS: We present a new program, SAGE, for de novo genome assembly. As opposed to most assemblers, which are de Bruijn graph based, SAGE uses the string-overlap graph. SAGE builds upon great existing work on string-overlap graph and maximum likelihood assembly, bringing an important number of new ideas, such as the efficient computation of the transitive reduction of the string overlap graph, the use of (generalized) edge multiplicity statistics for more accurate estimation of read copy counts, and the improved use of mate pairs and min-cost flow for supporting edge merging. The assemblies produced by SAGE for several short and medium-size genomes compared favourably with those of existing leading assemblers. CONCLUSIONS: SAGE benefits from innovations in almost every aspect of the assembly process: error correction of input reads, string-overlap graph construction, read copy counts estimation, overlap graph analysis and reduction, contig extraction, and scaffolding. We hope that these new ideas will help advance the current state-of-the-art in an essential area of research in genomics. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2105-15-302) contains supplementary material, which is available to authorized users. BioMed Central 2014-09-15 /pmc/articles/PMC4174676/ /pubmed/25225118 http://dx.doi.org/10.1186/1471-2105-15-302 Text en © Ilie et al.; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Ilie, Lucian
Haider, Bahlul
Molnar, Michael
Solis-Oba, Roberto
SAGE: String-overlap Assembly of GEnomes
title SAGE: String-overlap Assembly of GEnomes
title_full SAGE: String-overlap Assembly of GEnomes
title_fullStr SAGE: String-overlap Assembly of GEnomes
title_full_unstemmed SAGE: String-overlap Assembly of GEnomes
title_short SAGE: String-overlap Assembly of GEnomes
title_sort sage: string-overlap assembly of genomes
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4174676/
https://www.ncbi.nlm.nih.gov/pubmed/25225118
http://dx.doi.org/10.1186/1471-2105-15-302
work_keys_str_mv AT ilielucian sagestringoverlapassemblyofgenomes
AT haiderbahlul sagestringoverlapassemblyofgenomes
AT molnarmichael sagestringoverlapassemblyofgenomes
AT solisobaroberto sagestringoverlapassemblyofgenomes