Cargando…

Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files

Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility...

Descripción completa

Detalles Bibliográficos
Autor principal:	Adaszewski, Stanisław
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4113380/ https://www.ncbi.nlm.nih.gov/pubmed/25068261 http://dx.doi.org/10.1371/journal.pone.0103319

_version_	1782328286045536256
author	Adaszewski, Stanisław
author_facet	Adaszewski, Stanisław
author_sort	Adaszewski, Stanisław
collection	PubMed
description	Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility to access data in a rich, uniform manner, e.g. using Structured Query Language (SQL) would offer expressiveness and user-friendliness. Comma-separated values (CSV) are one of the most common data storage formats. Despite its simplicity, with growing file size handling it becomes non-trivial. Importing CSVs into existing databases is time-consuming and troublesome, or even impossible if its horizontal dimension reaches thousands of columns. Most databases are optimized for handling large number of rows rather than columns, therefore, performance for datasets with non-typical layouts is often unacceptable. Other challenges include schema creation, updates and repeated data imports. To address the above-mentioned problems, I present a system for accessing very large CSV-based datasets by means of SQL. It's characterized by: “no copy” approach – data stay mostly in the CSV files; “zero configuration” – no need to specify database schema; written in C++, with boost [1], SQLite [2] and Qt [3], doesn't require installation and has very small size; query rewriting, dynamic creation of indices for appropriate columns and static data retrieval directly from CSV files ensure efficient plan execution; effortless support for millions of columns; due to per-value typing, using mixed text/numbers data is easy; very simple network protocol provides efficient interface for MATLAB and reduces implementation time for other languages. The software is available as freeware along with educational videos on its website [4]. It doesn't need any prerequisites to run, as all of the libraries are included in the distribution package. I test it against existing database solutions using a battery of benchmarks and discuss the results.
format	Online Article Text
id	pubmed-4113380
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-41133802014-08-04 Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files Adaszewski, Stanisław PLoS One Research Article Volumes of data used in science and industry are growing rapidly. When researchers face the challenge of analyzing them, their format is often the first obstacle. Lack of standardized ways of exploring different data layouts requires an effort each time to solve the problem from scratch. Possibility to access data in a rich, uniform manner, e.g. using Structured Query Language (SQL) would offer expressiveness and user-friendliness. Comma-separated values (CSV) are one of the most common data storage formats. Despite its simplicity, with growing file size handling it becomes non-trivial. Importing CSVs into existing databases is time-consuming and troublesome, or even impossible if its horizontal dimension reaches thousands of columns. Most databases are optimized for handling large number of rows rather than columns, therefore, performance for datasets with non-typical layouts is often unacceptable. Other challenges include schema creation, updates and repeated data imports. To address the above-mentioned problems, I present a system for accessing very large CSV-based datasets by means of SQL. It's characterized by: “no copy” approach – data stay mostly in the CSV files; “zero configuration” – no need to specify database schema; written in C++, with boost [1], SQLite [2] and Qt [3], doesn't require installation and has very small size; query rewriting, dynamic creation of indices for appropriate columns and static data retrieval directly from CSV files ensure efficient plan execution; effortless support for millions of columns; due to per-value typing, using mixed text/numbers data is easy; very simple network protocol provides efficient interface for MATLAB and reduces implementation time for other languages. The software is available as freeware along with educational videos on its website [4]. It doesn't need any prerequisites to run, as all of the libraries are included in the distribution package. I test it against existing database solutions using a battery of benchmarks and discuss the results. Public Library of Science 2014-07-28 /pmc/articles/PMC4113380/ /pubmed/25068261 http://dx.doi.org/10.1371/journal.pone.0103319 Text en © 2014 Stanislaw Adaszewski http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Adaszewski, Stanisław Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files
title	Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files
title_full	Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files
title_fullStr	Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files
title_full_unstemmed	Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files
title_short	Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files
title_sort	mynodbcsv: lightweight zero-config database solution for handling very large csv files
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4113380/ https://www.ncbi.nlm.nih.gov/pubmed/25068261 http://dx.doi.org/10.1371/journal.pone.0103319
work_keys_str_mv	AT adaszewskistanisław mynodbcsvlightweightzeroconfigdatabasesolutionforhandlingverylargecsvfiles

Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files

Ejemplares similares