Cargando…

Workflow analysis of data science code in public GitHub repositories

Despite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ramasamy, Dhivyabharathi, Sarasua, Cristina, Bacchelli, Alberto, Bernstein, Abraham
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer US 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9675706/ https://www.ncbi.nlm.nih.gov/pubmed/36420321 http://dx.doi.org/10.1007/s10664-022-10229-z

_version_	1784833431482400768
author	Ramasamy, Dhivyabharathi Sarasua, Cristina Bacchelli, Alberto Bernstein, Abraham
author_facet	Ramasamy, Dhivyabharathi Sarasua, Cristina Bacchelli, Alberto Bernstein, Abraham
author_sort	Ramasamy, Dhivyabharathi
collection	PubMed
description	Despite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and its workflows in detail. Such understanding is critical to recognise the needs of data scientists and, for instance, inform tooling support. To obtain a deeper understanding of the iterative and explorative nature of data science coding, we analysed 470 Jupyter notebooks publicly available in GitHub repositories. We focused on the extent to which data scientists transition between different types of data science activities, or steps (such as data preprocessing and modelling), as well as the frequency and co-occurrence of such transitions. For our analysis, we developed a dataset with the help of five data science experts, who manually annotated the data science steps for each code cell within the aforementioned 470 notebooks. Using the first-order Markov chain model, we extracted the transitions and analysed the transition probabilities between the different steps. In addition to providing deeper insights into the implementation practices of data science coding, our results provide evidence that the steps in a data science workflow are indeed iterative and reveal specific patterns. We also evaluated the use of the annotated dataset to train machine-learning classifiers to predict the data science step(s) of a given code cell. We investigate the representativeness of the classification by comparing the workflow analysis applied to (a) the predicted data set and (b) the data set labelled by experts, finding an F1-score of about 71% for the 10-class data science step prediction problem.
format	Online Article Text
id	pubmed-9675706
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer US
record_format	MEDLINE/PubMed
spelling	pubmed-96757062022-11-21 Workflow analysis of data science code in public GitHub repositories Ramasamy, Dhivyabharathi Sarasua, Cristina Bacchelli, Alberto Bernstein, Abraham Empir Softw Eng Article Despite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and its workflows in detail. Such understanding is critical to recognise the needs of data scientists and, for instance, inform tooling support. To obtain a deeper understanding of the iterative and explorative nature of data science coding, we analysed 470 Jupyter notebooks publicly available in GitHub repositories. We focused on the extent to which data scientists transition between different types of data science activities, or steps (such as data preprocessing and modelling), as well as the frequency and co-occurrence of such transitions. For our analysis, we developed a dataset with the help of five data science experts, who manually annotated the data science steps for each code cell within the aforementioned 470 notebooks. Using the first-order Markov chain model, we extracted the transitions and analysed the transition probabilities between the different steps. In addition to providing deeper insights into the implementation practices of data science coding, our results provide evidence that the steps in a data science workflow are indeed iterative and reveal specific patterns. We also evaluated the use of the annotated dataset to train machine-learning classifiers to predict the data science step(s) of a given code cell. We investigate the representativeness of the classification by comparing the workflow analysis applied to (a) the predicted data set and (b) the data set labelled by experts, finding an F1-score of about 71% for the 10-class data science step prediction problem. Springer US 2022-11-19 2023 /pmc/articles/PMC9675706/ /pubmed/36420321 http://dx.doi.org/10.1007/s10664-022-10229-z Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/ Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article Ramasamy, Dhivyabharathi Sarasua, Cristina Bacchelli, Alberto Bernstein, Abraham Workflow analysis of data science code in public GitHub repositories
title	Workflow analysis of data science code in public GitHub repositories
title_full	Workflow analysis of data science code in public GitHub repositories
title_fullStr	Workflow analysis of data science code in public GitHub repositories
title_full_unstemmed	Workflow analysis of data science code in public GitHub repositories
title_short	Workflow analysis of data science code in public GitHub repositories
title_sort	workflow analysis of data science code in public github repositories
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9675706/ https://www.ncbi.nlm.nih.gov/pubmed/36420321 http://dx.doi.org/10.1007/s10664-022-10229-z
work_keys_str_mv	AT ramasamydhivyabharathi workflowanalysisofdatasciencecodeinpublicgithubrepositories AT sarasuacristina workflowanalysisofdatasciencecodeinpublicgithubrepositories AT bacchellialberto workflowanalysisofdatasciencecodeinpublicgithubrepositories AT bernsteinabraham workflowanalysisofdatasciencecodeinpublicgithubrepositories

Workflow analysis of data science code in public GitHub repositories

Ejemplares similares