Cargando…

Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff

The relative ease of collaborative data science and analysis has led to a proliferation of many thousands or millions of versions of the same datasets in many scientific and commercial domains, acquired or constructed at various stages of data analysis across many users, and often over long periods...

Descripción completa

Detalles Bibliográficos
Autores principales: Bhattacherjee, Souvik, Chavan, Amit, Huang, Silu, Deshpande, Amol, Parameswaran, Aditya
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5526644/
https://www.ncbi.nlm.nih.gov/pubmed/28752014
http://dx.doi.org/10.14778/2824032.2824035
_version_ 1783252837943214080
author Bhattacherjee, Souvik
Chavan, Amit
Huang, Silu
Deshpande, Amol
Parameswaran, Aditya
author_facet Bhattacherjee, Souvik
Chavan, Amit
Huang, Silu
Deshpande, Amol
Parameswaran, Aditya
author_sort Bhattacherjee, Souvik
collection PubMed
description The relative ease of collaborative data science and analysis has led to a proliferation of many thousands or millions of versions of the same datasets in many scientific and commercial domains, acquired or constructed at various stages of data analysis across many users, and often over long periods of time. Managing, storing, and recreating these dataset versions is a non-trivial task. The fundamental challenge here is the storage-recreation trade-off: the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. Despite the fundamental nature of this problem, there has been a surprisingly little amount of work on it. In this paper, we study this trade-off in a principled manner: we formulate six problems under various settings, trading off these quantities in various ways, demonstrate that most of the problems are intractable, and propose a suite of inexpensive heuristics drawing from techniques in delay-constrained scheduling, and spanning tree literature, to solve these problems. We have built a prototype version management system, that aims to serve as a foundation to our DataHub system for facilitating collaborative data science. We demonstrate, via extensive experiments, that our proposed heuristics provide efficient solutions in practical dataset versioning scenarios.
format Online
Article
Text
id pubmed-5526644
institution National Center for Biotechnology Information
language English
publishDate 2015
record_format MEDLINE/PubMed
spelling pubmed-55266442017-07-25 Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff Bhattacherjee, Souvik Chavan, Amit Huang, Silu Deshpande, Amol Parameswaran, Aditya Proceedings VLDB Endowment Article The relative ease of collaborative data science and analysis has led to a proliferation of many thousands or millions of versions of the same datasets in many scientific and commercial domains, acquired or constructed at various stages of data analysis across many users, and often over long periods of time. Managing, storing, and recreating these dataset versions is a non-trivial task. The fundamental challenge here is the storage-recreation trade-off: the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. Despite the fundamental nature of this problem, there has been a surprisingly little amount of work on it. In this paper, we study this trade-off in a principled manner: we formulate six problems under various settings, trading off these quantities in various ways, demonstrate that most of the problems are intractable, and propose a suite of inexpensive heuristics drawing from techniques in delay-constrained scheduling, and spanning tree literature, to solve these problems. We have built a prototype version management system, that aims to serve as a foundation to our DataHub system for facilitating collaborative data science. We demonstrate, via extensive experiments, that our proposed heuristics provide efficient solutions in practical dataset versioning scenarios. 2015-08 /pmc/articles/PMC5526644/ /pubmed/28752014 http://dx.doi.org/10.14778/2824032.2824035 Text en http://creativecommons.org/licenses/byncnd/3.0/ This work is licensed under the Creative Commons Attribution-NonCommercialNoDerivs 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/byncnd/3.0/.
spellingShingle Article
Bhattacherjee, Souvik
Chavan, Amit
Huang, Silu
Deshpande, Amol
Parameswaran, Aditya
Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff
title Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff
title_full Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff
title_fullStr Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff
title_full_unstemmed Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff
title_short Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff
title_sort principles of dataset versioning: exploring the recreation/storage tradeoff
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5526644/
https://www.ncbi.nlm.nih.gov/pubmed/28752014
http://dx.doi.org/10.14778/2824032.2824035
work_keys_str_mv AT bhattacherjeesouvik principlesofdatasetversioningexploringtherecreationstoragetradeoff
AT chavanamit principlesofdatasetversioningexploringtherecreationstoragetradeoff
AT huangsilu principlesofdatasetversioningexploringtherecreationstoragetradeoff
AT deshpandeamol principlesofdatasetversioningexploringtherecreationstoragetradeoff
AT parameswaranaditya principlesofdatasetversioningexploringtherecreationstoragetradeoff