Cargando…

GBParsy: A GenBank flatfile parser library with high speed

BACKGROUND: GenBank flatfile (GBF) format is one of the most popular sequence file formats because of its detailed sequence features and ease of readability. To use the data in the file by a computer, a parsing process is required and is performed according to a given grammar for the sequence and th...

Descripción completa

Detalles Bibliográficos
Autores principales: Lee, Tae-Ho, Kim, Yeon-Ki, Nahm, Baek Hie
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2516526/
https://www.ncbi.nlm.nih.gov/pubmed/18652706
http://dx.doi.org/10.1186/1471-2105-9-321
_version_ 1782158480611737600
author Lee, Tae-Ho
Kim, Yeon-Ki
Nahm, Baek Hie
author_facet Lee, Tae-Ho
Kim, Yeon-Ki
Nahm, Baek Hie
author_sort Lee, Tae-Ho
collection PubMed
description BACKGROUND: GenBank flatfile (GBF) format is one of the most popular sequence file formats because of its detailed sequence features and ease of readability. To use the data in the file by a computer, a parsing process is required and is performed according to a given grammar for the sequence and the description in a GBF. Currently, several parser libraries for the GBF have been developed. However, with the accumulation of DNA sequence information from eukaryotic chromosomes, parsing a eukaryotic genome sequence with these libraries inevitably takes a long time, due to the large GBF file and its correspondingly large genomic nucleotide sequence and related feature information. Thus, there is significant need to develop a parsing program with high speed and efficient use of system memory. RESULTS: We developed a library, GBParsy, which was C language-based and parses GBF files. The parsing speed was maximized by using content-specified functions in place of regular expressions that are flexible but slow. In addition, we optimized an algorithm related to memory usage so that it also increased parsing performance and efficiency of memory usage. GBParsy is at least 5 - 100× faster than current parsers in benchmark tests. CONCLUSION: GBParsy is estimated to extract annotated information from almost 100 Mb of a GenBank flatfile for chromosomal sequence information within a second. Thus, it should be used for a variety of applications such as on-time visualization of a genome at a web site.
format Text
id pubmed-2516526
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-25165262008-08-15 GBParsy: A GenBank flatfile parser library with high speed Lee, Tae-Ho Kim, Yeon-Ki Nahm, Baek Hie BMC Bioinformatics Software BACKGROUND: GenBank flatfile (GBF) format is one of the most popular sequence file formats because of its detailed sequence features and ease of readability. To use the data in the file by a computer, a parsing process is required and is performed according to a given grammar for the sequence and the description in a GBF. Currently, several parser libraries for the GBF have been developed. However, with the accumulation of DNA sequence information from eukaryotic chromosomes, parsing a eukaryotic genome sequence with these libraries inevitably takes a long time, due to the large GBF file and its correspondingly large genomic nucleotide sequence and related feature information. Thus, there is significant need to develop a parsing program with high speed and efficient use of system memory. RESULTS: We developed a library, GBParsy, which was C language-based and parses GBF files. The parsing speed was maximized by using content-specified functions in place of regular expressions that are flexible but slow. In addition, we optimized an algorithm related to memory usage so that it also increased parsing performance and efficiency of memory usage. GBParsy is at least 5 - 100× faster than current parsers in benchmark tests. CONCLUSION: GBParsy is estimated to extract annotated information from almost 100 Mb of a GenBank flatfile for chromosomal sequence information within a second. Thus, it should be used for a variety of applications such as on-time visualization of a genome at a web site. BioMed Central 2008-07-25 /pmc/articles/PMC2516526/ /pubmed/18652706 http://dx.doi.org/10.1186/1471-2105-9-321 Text en Copyright © 2008 Lee et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
Lee, Tae-Ho
Kim, Yeon-Ki
Nahm, Baek Hie
GBParsy: A GenBank flatfile parser library with high speed
title GBParsy: A GenBank flatfile parser library with high speed
title_full GBParsy: A GenBank flatfile parser library with high speed
title_fullStr GBParsy: A GenBank flatfile parser library with high speed
title_full_unstemmed GBParsy: A GenBank flatfile parser library with high speed
title_short GBParsy: A GenBank flatfile parser library with high speed
title_sort gbparsy: a genbank flatfile parser library with high speed
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2516526/
https://www.ncbi.nlm.nih.gov/pubmed/18652706
http://dx.doi.org/10.1186/1471-2105-9-321
work_keys_str_mv AT leetaeho gbparsyagenbankflatfileparserlibrarywithhighspeed
AT kimyeonki gbparsyagenbankflatfileparserlibrarywithhighspeed
AT nahmbaekhie gbparsyagenbankflatfileparserlibrarywithhighspeed