Cargando…

A comprehensive evaluation of assembly scaffolding tools

BACKGROUND: Genome assembly is typically a two-stage process: contig assembly followed by the use of paired sequencing reads to join contigs into scaffolds. Scaffolds are usually the focus of reported assembly statistics; longer scaffolds greatly facilitate the use of genome sequences in downstream...

Descripción completa

Detalles Bibliográficos
Autores principales: Hunt, Martin, Newbold, Chris, Berriman, Matthew, Otto, Thomas D
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053845/
https://www.ncbi.nlm.nih.gov/pubmed/24581555
http://dx.doi.org/10.1186/gb-2014-15-3-r42
_version_ 1782320451487268864
author Hunt, Martin
Newbold, Chris
Berriman, Matthew
Otto, Thomas D
author_facet Hunt, Martin
Newbold, Chris
Berriman, Matthew
Otto, Thomas D
author_sort Hunt, Martin
collection PubMed
description BACKGROUND: Genome assembly is typically a two-stage process: contig assembly followed by the use of paired sequencing reads to join contigs into scaffolds. Scaffolds are usually the focus of reported assembly statistics; longer scaffolds greatly facilitate the use of genome sequences in downstream analyses, and it is appealing to present larger numbers as metrics of assembly performance. However, scaffolds are highly prone to errors, especially when generated using short reads, which can directly result in inflated assembly statistics. RESULTS: Here we provide the first independent evaluation of scaffolding tools for second-generation sequencing data. We find large variations in the quality of results depending on the tool and dataset used. Even extremely simple test cases of perfect input, constructed to elucidate the behaviour of each algorithm, produced some surprising results. We further dissect the performance of the scaffolders using real and simulated sequencing data derived from the genomes of Staphylococcus aureus, Rhodobacter sphaeroides, Plasmodium falciparum and Homo sapiens. The results from simulated data are of high quality, with several of the tools producing perfect output. However, at least 10% of joins remains unidentified when using real data. CONCLUSIONS: The scaffolders vary in their usability, speed and number of correct and missed joins made between contigs. Results from real data highlight opportunities for further improvements of the tools. Overall, SGA, SOPRA and SSPACE generally outperform the other tools on our datasets. However, the quality of the results is highly dependent on the read mapper and genome complexity.
format Online
Article
Text
id pubmed-4053845
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40538452014-06-12 A comprehensive evaluation of assembly scaffolding tools Hunt, Martin Newbold, Chris Berriman, Matthew Otto, Thomas D Genome Biol Research BACKGROUND: Genome assembly is typically a two-stage process: contig assembly followed by the use of paired sequencing reads to join contigs into scaffolds. Scaffolds are usually the focus of reported assembly statistics; longer scaffolds greatly facilitate the use of genome sequences in downstream analyses, and it is appealing to present larger numbers as metrics of assembly performance. However, scaffolds are highly prone to errors, especially when generated using short reads, which can directly result in inflated assembly statistics. RESULTS: Here we provide the first independent evaluation of scaffolding tools for second-generation sequencing data. We find large variations in the quality of results depending on the tool and dataset used. Even extremely simple test cases of perfect input, constructed to elucidate the behaviour of each algorithm, produced some surprising results. We further dissect the performance of the scaffolders using real and simulated sequencing data derived from the genomes of Staphylococcus aureus, Rhodobacter sphaeroides, Plasmodium falciparum and Homo sapiens. The results from simulated data are of high quality, with several of the tools producing perfect output. However, at least 10% of joins remains unidentified when using real data. CONCLUSIONS: The scaffolders vary in their usability, speed and number of correct and missed joins made between contigs. Results from real data highlight opportunities for further improvements of the tools. Overall, SGA, SOPRA and SSPACE generally outperform the other tools on our datasets. However, the quality of the results is highly dependent on the read mapper and genome complexity. BioMed Central 2014 2014-03-03 /pmc/articles/PMC4053845/ /pubmed/24581555 http://dx.doi.org/10.1186/gb-2014-15-3-r42 Text en Copyright © 2014 Hunt et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Hunt, Martin
Newbold, Chris
Berriman, Matthew
Otto, Thomas D
A comprehensive evaluation of assembly scaffolding tools
title A comprehensive evaluation of assembly scaffolding tools
title_full A comprehensive evaluation of assembly scaffolding tools
title_fullStr A comprehensive evaluation of assembly scaffolding tools
title_full_unstemmed A comprehensive evaluation of assembly scaffolding tools
title_short A comprehensive evaluation of assembly scaffolding tools
title_sort comprehensive evaluation of assembly scaffolding tools
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053845/
https://www.ncbi.nlm.nih.gov/pubmed/24581555
http://dx.doi.org/10.1186/gb-2014-15-3-r42
work_keys_str_mv AT huntmartin acomprehensiveevaluationofassemblyscaffoldingtools
AT newboldchris acomprehensiveevaluationofassemblyscaffoldingtools
AT berrimanmatthew acomprehensiveevaluationofassemblyscaffoldingtools
AT ottothomasd acomprehensiveevaluationofassemblyscaffoldingtools
AT huntmartin comprehensiveevaluationofassemblyscaffoldingtools
AT newboldchris comprehensiveevaluationofassemblyscaffoldingtools
AT berrimanmatthew comprehensiveevaluationofassemblyscaffoldingtools
AT ottothomasd comprehensiveevaluationofassemblyscaffoldingtools