Cargando…

Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer

Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access. The consequence is twofold. First, subsequent researchers must spend signific...

Descripción completa

Detalles Bibliográficos
Autores principales:	Schubotz, Moritz, Satpute, Ankit, Greiner-Petter, André, Aizawa, Akiko, Gipp, Bela
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2022
Materias:	Research Metrics and Analytics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9075102/ https://www.ncbi.nlm.nih.gov/pubmed/35531060 http://dx.doi.org/10.3389/frma.2022.861944

_version_	1784701607125975040
author	Schubotz, Moritz Satpute, Ankit Greiner-Petter, André Aizawa, Akiko Gipp, Bela
author_facet	Schubotz, Moritz Satpute, Ankit Greiner-Petter, André Aizawa, Akiko Gipp, Bela
author_sort	Schubotz, Moritz
collection	PubMed
description	Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access. The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypotheses or experimental framework. In the worst case, others cannot reproduce the experiment and reuse the findings for subsequent research. Second, suppose the ad-hoc research software fails during often long-running computational expensive experiments. In that case, the overall effort to iteratively improve the software and rerun the experiments creates significant time pressure on the researchers. We suggest making caching an integral part of the research software development process, even before the first line of code is written. This article outlines caching recommendations for developing research software in data science projects. Our recommendations provide a perspective to circumvent common problems such as propriety dependence, speed, etc. At the same time, caching contributes to the reproducibility of experiments in the open science workflow. Concerning the four guiding principles, i.e., Findability, Accessibility, Interoperability, and Reusability (FAIR), we foresee that including the proposed recommendation in a research software development will make the data related to that software FAIRer for both machines and humans. We exhibit the usefulness of some of the proposed recommendations on our recently completed research software project in mathematical information retrieval.
format	Online Article Text
id	pubmed-9075102
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-90751022022-05-07 Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer Schubotz, Moritz Satpute, Ankit Greiner-Petter, André Aizawa, Akiko Gipp, Bela Front Res Metr Anal Research Metrics and Analytics Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access. The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypotheses or experimental framework. In the worst case, others cannot reproduce the experiment and reuse the findings for subsequent research. Second, suppose the ad-hoc research software fails during often long-running computational expensive experiments. In that case, the overall effort to iteratively improve the software and rerun the experiments creates significant time pressure on the researchers. We suggest making caching an integral part of the research software development process, even before the first line of code is written. This article outlines caching recommendations for developing research software in data science projects. Our recommendations provide a perspective to circumvent common problems such as propriety dependence, speed, etc. At the same time, caching contributes to the reproducibility of experiments in the open science workflow. Concerning the four guiding principles, i.e., Findability, Accessibility, Interoperability, and Reusability (FAIR), we foresee that including the proposed recommendation in a research software development will make the data related to that software FAIRer for both machines and humans. We exhibit the usefulness of some of the proposed recommendations on our recently completed research software project in mathematical information retrieval. Frontiers Media S.A. 2022-04-22 /pmc/articles/PMC9075102/ /pubmed/35531060 http://dx.doi.org/10.3389/frma.2022.861944 Text en Copyright © 2022 Schubotz, Satpute, Greiner-Petter, Aizawa and Gipp. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Research Metrics and Analytics Schubotz, Moritz Satpute, Ankit Greiner-Petter, André Aizawa, Akiko Gipp, Bela Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer
title	Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer
title_full	Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer
title_fullStr	Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer
title_full_unstemmed	Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer
title_short	Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer
title_sort	caching and reproducibility: making data science experiments faster and fairer
topic	Research Metrics and Analytics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9075102/ https://www.ncbi.nlm.nih.gov/pubmed/35531060 http://dx.doi.org/10.3389/frma.2022.861944
work_keys_str_mv	AT schubotzmoritz cachingandreproducibilitymakingdatascienceexperimentsfasterandfairer AT satputeankit cachingandreproducibilitymakingdatascienceexperimentsfasterandfairer AT greinerpetterandre cachingandreproducibilitymakingdatascienceexperimentsfasterandfairer AT aizawaakiko cachingandreproducibilitymakingdatascienceexperimentsfasterandfairer AT gippbela cachingandreproducibilitymakingdatascienceexperimentsfasterandfairer

Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer

Ejemplares similares