Cargando…

Dataset Reuse: Toward Translating Principles to Practice

The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper,...

Descripción completa

Detalles Bibliográficos
Autores principales: Koesten, Laura, Vougiouklis, Pavlos, Simperl, Elena, Groth, Paul
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7691392/
https://www.ncbi.nlm.nih.gov/pubmed/33294873
http://dx.doi.org/10.1016/j.patter.2020.100136
_version_ 1783614280517550080
author Koesten, Laura
Vougiouklis, Pavlos
Simperl, Elena
Groth, Paul
author_facet Koesten, Laura
Vougiouklis, Pavlos
Simperl, Elena
Groth, Paul
author_sort Koesten, Laura
collection PubMed
description The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub's engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset's reusability. This demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse.
format Online
Article
Text
id pubmed-7691392
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-76913922020-12-07 Dataset Reuse: Toward Translating Principles to Practice Koesten, Laura Vougiouklis, Pavlos Simperl, Elena Groth, Paul Patterns (N Y) Article The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub's engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset's reusability. This demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. Elsevier 2020-11-04 /pmc/articles/PMC7691392/ /pubmed/33294873 http://dx.doi.org/10.1016/j.patter.2020.100136 Text en © 2020 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Koesten, Laura
Vougiouklis, Pavlos
Simperl, Elena
Groth, Paul
Dataset Reuse: Toward Translating Principles to Practice
title Dataset Reuse: Toward Translating Principles to Practice
title_full Dataset Reuse: Toward Translating Principles to Practice
title_fullStr Dataset Reuse: Toward Translating Principles to Practice
title_full_unstemmed Dataset Reuse: Toward Translating Principles to Practice
title_short Dataset Reuse: Toward Translating Principles to Practice
title_sort dataset reuse: toward translating principles to practice
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7691392/
https://www.ncbi.nlm.nih.gov/pubmed/33294873
http://dx.doi.org/10.1016/j.patter.2020.100136
work_keys_str_mv AT koestenlaura datasetreusetowardtranslatingprinciplestopractice
AT vougiouklispavlos datasetreusetowardtranslatingprinciplestopractice
AT simperlelena datasetreusetowardtranslatingprinciplestopractice
AT grothpaul datasetreusetowardtranslatingprinciplestopractice