Cargando…

Code4ML: a large-scale dataset of annotated Machine Learning code

The use of program code as a data source is increasingly expanding among data scientists. The purpose of the usage varies from the semantic classification of code to the automatic generation of programs. However, the machine learning model application is somewhat limited without annotating the code...

Descripción completa

Detalles Bibliográficos
Autores principales:	Drozdova, Anastasia, Trofimova, Ekaterina, Guseva, Polina, Scherbakova, Anna, Ustyuzhanin, Andrey
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2023
Materias:	Data Mining and Machine Learning
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10280557/ https://www.ncbi.nlm.nih.gov/pubmed/37346615 http://dx.doi.org/10.7717/peerj-cs.1230

_version_	1785060821784592384
author	Drozdova, Anastasia Trofimova, Ekaterina Guseva, Polina Scherbakova, Anna Ustyuzhanin, Andrey
author_facet	Drozdova, Anastasia Trofimova, Ekaterina Guseva, Polina Scherbakova, Anna Ustyuzhanin, Andrey
author_sort	Drozdova, Anastasia
collection	PubMed
description	The use of program code as a data source is increasingly expanding among data scientists. The purpose of the usage varies from the semantic classification of code to the automatic generation of programs. However, the machine learning model application is somewhat limited without annotating the code snippets. To address the lack of annotated datasets, we present the Code4ML corpus. It contains code snippets, task summaries, competitions, and dataset descriptions publicly available from Kaggle—the leading platform for hosting data science competitions. The corpus consists of ~2.5 million snippets of ML code collected from ~100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose. Code4ML dataset can help address a number of software engineering or data science challenges through a data-driven approach. For example, it can be helpful for semantic code classification, code auto-completion, and code generation for an ML task specified in natural language.
format	Online Article Text
id	pubmed-10280557
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-102805572023-06-21 Code4ML: a large-scale dataset of annotated Machine Learning code Drozdova, Anastasia Trofimova, Ekaterina Guseva, Polina Scherbakova, Anna Ustyuzhanin, Andrey PeerJ Comput Sci Data Mining and Machine Learning The use of program code as a data source is increasingly expanding among data scientists. The purpose of the usage varies from the semantic classification of code to the automatic generation of programs. However, the machine learning model application is somewhat limited without annotating the code snippets. To address the lack of annotated datasets, we present the Code4ML corpus. It contains code snippets, task summaries, competitions, and dataset descriptions publicly available from Kaggle—the leading platform for hosting data science competitions. The corpus consists of ~2.5 million snippets of ML code collected from ~100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose. Code4ML dataset can help address a number of software engineering or data science challenges through a data-driven approach. For example, it can be helpful for semantic code classification, code auto-completion, and code generation for an ML task specified in natural language. PeerJ Inc. 2023-02-23 /pmc/articles/PMC10280557/ /pubmed/37346615 http://dx.doi.org/10.7717/peerj-cs.1230 Text en © 2023 Drozdova et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Data Mining and Machine Learning Drozdova, Anastasia Trofimova, Ekaterina Guseva, Polina Scherbakova, Anna Ustyuzhanin, Andrey Code4ML: a large-scale dataset of annotated Machine Learning code
title	Code4ML: a large-scale dataset of annotated Machine Learning code
title_full	Code4ML: a large-scale dataset of annotated Machine Learning code
title_fullStr	Code4ML: a large-scale dataset of annotated Machine Learning code
title_full_unstemmed	Code4ML: a large-scale dataset of annotated Machine Learning code
title_short	Code4ML: a large-scale dataset of annotated Machine Learning code
title_sort	code4ml: a large-scale dataset of annotated machine learning code
topic	Data Mining and Machine Learning
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10280557/ https://www.ncbi.nlm.nih.gov/pubmed/37346615 http://dx.doi.org/10.7717/peerj-cs.1230
work_keys_str_mv	AT drozdovaanastasia code4mlalargescaledatasetofannotatedmachinelearningcode AT trofimovaekaterina code4mlalargescaledatasetofannotatedmachinelearningcode AT gusevapolina code4mlalargescaledatasetofannotatedmachinelearningcode AT scherbakovaanna code4mlalargescaledatasetofannotatedmachinelearningcode AT ustyuzhaninandrey code4mlalargescaledatasetofannotatedmachinelearningcode

Code4ML: a large-scale dataset of annotated Machine Learning code

Ejemplares similares