Cargando…

Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes

BACKGROUND: Genome assembly is considered to be a challenging problem in computational biology, and has been studied extensively by many researchers. It is extremely difficult to build a general assembler that is able to reconstruct the original sequence instead of many contigs. However, we believe...

Descripción completa

Detalles Bibliográficos
Autores principales: Sahli, Mohammed, Shibuya, Tetsuo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3441218/
https://www.ncbi.nlm.nih.gov/pubmed/22591859
http://dx.doi.org/10.1186/1756-0500-5-243
_version_ 1782243237559271424
author Sahli, Mohammed
Shibuya, Tetsuo
author_facet Sahli, Mohammed
Shibuya, Tetsuo
author_sort Sahli, Mohammed
collection PubMed
description BACKGROUND: Genome assembly is considered to be a challenging problem in computational biology, and has been studied extensively by many researchers. It is extremely difficult to build a general assembler that is able to reconstruct the original sequence instead of many contigs. However, we believe that creating specific assemblers, for solving specific cases, will be much more fruitful than creating general assemblers. FINDINGS: In this paper, we present Arapan-S, a whole-genome assembly program dedicated to handling small genomes. It provides only one contig (along with the reverse complement of this contig) in many cases. Although genomes consist of a number of segments, the implemented algorithm can detect all the segments, as we demonstrate for Influenza Virus A. The Arapan-S program is based on the de Bruijn graph. We have implemented a very sophisticated and fast method to reconstruct the original sequence and neglect erroneous k-mers. The method explores the graph by using neither the shortest nor the longest path, but rather a specific and reliable path based on the coverage level or k-mers’ lengths. Arapan-S uses short reads, and it was tested on raw data downloaded from the NCBI Trace Archive. CONCLUSIONS: Our findings show that the accuracy of the assembly was very high; the result was checked against the European Bioinformatics Institute (EBI) database using the NCBI BLAST Sequence Similarity Search. The identity and the genome coverage was more than 99%. We also compared the efficiency of Arapan-S with other well-known assemblers. In dealing with small genomes, the accuracy of Arapan-S is significantly higher than the accuracy of other assemblers. The assembly process is very fast and requires only a few seconds. Arapan-S is available for free to the public. The binary files for Arapan-S are available through http://sourceforge.net/projects/dnascissor/files/.
format Online
Article
Text
id pubmed-3441218
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-34412182012-09-18 Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes Sahli, Mohammed Shibuya, Tetsuo BMC Res Notes Technical Note BACKGROUND: Genome assembly is considered to be a challenging problem in computational biology, and has been studied extensively by many researchers. It is extremely difficult to build a general assembler that is able to reconstruct the original sequence instead of many contigs. However, we believe that creating specific assemblers, for solving specific cases, will be much more fruitful than creating general assemblers. FINDINGS: In this paper, we present Arapan-S, a whole-genome assembly program dedicated to handling small genomes. It provides only one contig (along with the reverse complement of this contig) in many cases. Although genomes consist of a number of segments, the implemented algorithm can detect all the segments, as we demonstrate for Influenza Virus A. The Arapan-S program is based on the de Bruijn graph. We have implemented a very sophisticated and fast method to reconstruct the original sequence and neglect erroneous k-mers. The method explores the graph by using neither the shortest nor the longest path, but rather a specific and reliable path based on the coverage level or k-mers’ lengths. Arapan-S uses short reads, and it was tested on raw data downloaded from the NCBI Trace Archive. CONCLUSIONS: Our findings show that the accuracy of the assembly was very high; the result was checked against the European Bioinformatics Institute (EBI) database using the NCBI BLAST Sequence Similarity Search. The identity and the genome coverage was more than 99%. We also compared the efficiency of Arapan-S with other well-known assemblers. In dealing with small genomes, the accuracy of Arapan-S is significantly higher than the accuracy of other assemblers. The assembly process is very fast and requires only a few seconds. Arapan-S is available for free to the public. The binary files for Arapan-S are available through http://sourceforge.net/projects/dnascissor/files/. BioMed Central 2012-05-16 /pmc/articles/PMC3441218/ /pubmed/22591859 http://dx.doi.org/10.1186/1756-0500-5-243 Text en Copyright ©2012 Sahli and Shibuya; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Technical Note
Sahli, Mohammed
Shibuya, Tetsuo
Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes
title Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes
title_full Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes
title_fullStr Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes
title_full_unstemmed Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes
title_short Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes
title_sort arapan-s: a fast and highly accurate whole-genome assembly software for viruses and small genomes
topic Technical Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3441218/
https://www.ncbi.nlm.nih.gov/pubmed/22591859
http://dx.doi.org/10.1186/1756-0500-5-243
work_keys_str_mv AT sahlimohammed arapansafastandhighlyaccuratewholegenomeassemblysoftwareforvirusesandsmallgenomes
AT shibuyatetsuo arapansafastandhighlyaccuratewholegenomeassemblysoftwareforvirusesandsmallgenomes