Cargando…

Processing genome scale tabular data with wormtable

BACKGROUND: Modern biological science generates a vast amount of data, the analysis of which presents a major challenge to researchers. Data are commonly represented in tables stored as plain text files and require line-by-line parsing for analysis, which is time consuming and error prone. Furthermo...

Descripción completa

Detalles Bibliográficos
Autores principales: Kelleher, Jerome, Ness, Rob W, Halligan, Daniel L
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4234461/
https://www.ncbi.nlm.nih.gov/pubmed/24308302
http://dx.doi.org/10.1186/1471-2105-14-356
_version_ 1782344864661241856
author Kelleher, Jerome
Ness, Rob W
Halligan, Daniel L
author_facet Kelleher, Jerome
Ness, Rob W
Halligan, Daniel L
author_sort Kelleher, Jerome
collection PubMed
description BACKGROUND: Modern biological science generates a vast amount of data, the analysis of which presents a major challenge to researchers. Data are commonly represented in tables stored as plain text files and require line-by-line parsing for analysis, which is time consuming and error prone. Furthermore, there is no simple means of indexing these files so that rows containing particular values can be quickly found. RESULTS: We introduce a new data format and software library called wormtable, which provides efficient access to tabular data in Python. Wormtable stores data in a compact binary format, provides random access to rows, and enables sophisticated indexing on columns within these tables. Files written in existing formats can be easily converted to wormtable format, and we provide conversion utilities for the VCF and GTF formats. CONCLUSIONS: Wormtable’s simple API allows users to process large tables orders of magnitude more quickly than is possible when parsing text. Furthermore, the indexing facilities provide efficient access to subsets of the data along with providing useful methods of summarising columns. Since third-party libraries or custom code are no longer needed to parse complex plain text formats, analysis code can also be substantially simpler as well as being uniform across different data formats. These benefits of reduced code complexity and greatly increased performance allow users much greater freedom to explore their data.
format Online
Article
Text
id pubmed-4234461
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-42344612014-11-18 Processing genome scale tabular data with wormtable Kelleher, Jerome Ness, Rob W Halligan, Daniel L BMC Bioinformatics Software BACKGROUND: Modern biological science generates a vast amount of data, the analysis of which presents a major challenge to researchers. Data are commonly represented in tables stored as plain text files and require line-by-line parsing for analysis, which is time consuming and error prone. Furthermore, there is no simple means of indexing these files so that rows containing particular values can be quickly found. RESULTS: We introduce a new data format and software library called wormtable, which provides efficient access to tabular data in Python. Wormtable stores data in a compact binary format, provides random access to rows, and enables sophisticated indexing on columns within these tables. Files written in existing formats can be easily converted to wormtable format, and we provide conversion utilities for the VCF and GTF formats. CONCLUSIONS: Wormtable’s simple API allows users to process large tables orders of magnitude more quickly than is possible when parsing text. Furthermore, the indexing facilities provide efficient access to subsets of the data along with providing useful methods of summarising columns. Since third-party libraries or custom code are no longer needed to parse complex plain text formats, analysis code can also be substantially simpler as well as being uniform across different data formats. These benefits of reduced code complexity and greatly increased performance allow users much greater freedom to explore their data. BioMed Central 2013-12-05 /pmc/articles/PMC4234461/ /pubmed/24308302 http://dx.doi.org/10.1186/1471-2105-14-356 Text en Copyright © 2013 Kelleher et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
Kelleher, Jerome
Ness, Rob W
Halligan, Daniel L
Processing genome scale tabular data with wormtable
title Processing genome scale tabular data with wormtable
title_full Processing genome scale tabular data with wormtable
title_fullStr Processing genome scale tabular data with wormtable
title_full_unstemmed Processing genome scale tabular data with wormtable
title_short Processing genome scale tabular data with wormtable
title_sort processing genome scale tabular data with wormtable
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4234461/
https://www.ncbi.nlm.nih.gov/pubmed/24308302
http://dx.doi.org/10.1186/1471-2105-14-356
work_keys_str_mv AT kelleherjerome processinggenomescaletabulardatawithwormtable
AT nessrobw processinggenomescaletabulardatawithwormtable
AT halligandaniell processinggenomescaletabulardatawithwormtable