Cargando…

Source code analysis dataset

The data in this article pair source code with three artifacts from 108,568 projects downloaded from Github that have a redistributable license and at least 10 stars. The first set of pairs connects snippets of source code in C, C++, Java, and Python with their corresponding comments, which are extr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gelman, Ben, Obayomi, Banjo, Moore, Jessica, Slater, David
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2019
Materias:	Computer Science
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6859235/ https://www.ncbi.nlm.nih.gov/pubmed/31763386 http://dx.doi.org/10.1016/j.dib.2019.104712

_version_	1783471089246011392
author	Gelman, Ben Obayomi, Banjo Moore, Jessica Slater, David
author_facet	Gelman, Ben Obayomi, Banjo Moore, Jessica Slater, David
author_sort	Gelman, Ben
collection	PubMed
description	The data in this article pair source code with three artifacts from 108,568 projects downloaded from Github that have a redistributable license and at least 10 stars. The first set of pairs connects snippets of source code in C, C++, Java, and Python with their corresponding comments, which are extracted using Doxygen. The second set of pairs connects raw C and C++ source code repositories with the build artifacts of that code, which are obtained by running the make command. The last set of pairs connects raw C and C++ source code repositories with potential code vulnerabilities, which are determined by running the Infer static analyzer. The code and comment pairs can be used for tasks such as predicting comments or creating natural language descriptions of code. The code and build artifact pairs can be used for tasks such as reverse engineering or improving intermediate representations of code from decompiled binaries. The code and static analyzer pairs can be used for tasks such as machine learning approaches to vulnerability discovery.
format	Online Article Text
id	pubmed-6859235
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-68592352019-11-22 Source code analysis dataset Gelman, Ben Obayomi, Banjo Moore, Jessica Slater, David Data Brief Computer Science The data in this article pair source code with three artifacts from 108,568 projects downloaded from Github that have a redistributable license and at least 10 stars. The first set of pairs connects snippets of source code in C, C++, Java, and Python with their corresponding comments, which are extracted using Doxygen. The second set of pairs connects raw C and C++ source code repositories with the build artifacts of that code, which are obtained by running the make command. The last set of pairs connects raw C and C++ source code repositories with potential code vulnerabilities, which are determined by running the Infer static analyzer. The code and comment pairs can be used for tasks such as predicting comments or creating natural language descriptions of code. The code and build artifact pairs can be used for tasks such as reverse engineering or improving intermediate representations of code from decompiled binaries. The code and static analyzer pairs can be used for tasks such as machine learning approaches to vulnerability discovery. Elsevier 2019-10-24 /pmc/articles/PMC6859235/ /pubmed/31763386 http://dx.doi.org/10.1016/j.dib.2019.104712 Text en © 2019 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Computer Science Gelman, Ben Obayomi, Banjo Moore, Jessica Slater, David Source code analysis dataset
title	Source code analysis dataset
title_full	Source code analysis dataset
title_fullStr	Source code analysis dataset
title_full_unstemmed	Source code analysis dataset
title_short	Source code analysis dataset
title_sort	source code analysis dataset
topic	Computer Science
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6859235/ https://www.ncbi.nlm.nih.gov/pubmed/31763386 http://dx.doi.org/10.1016/j.dib.2019.104712
work_keys_str_mv	AT gelmanben sourcecodeanalysisdataset AT obayomibanjo sourcecodeanalysisdataset AT moorejessica sourcecodeanalysisdataset AT slaterdavid sourcecodeanalysisdataset

Source code analysis dataset

Ejemplares similares