Cargando…

Detecting non-natural language artifacts for de-noising bug reports

Textual documents produced in the software engineering process are a popular target for natural language processing (NLP) and information retrieval (IR) approaches. However, issue tickets often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate th...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hirsch, Thomas, Hofer, Birgit
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer US 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9439617/ https://www.ncbi.nlm.nih.gov/pubmed/36065351 http://dx.doi.org/10.1007/s10515-022-00350-0

_version_	1784782103228973056
author	Hirsch, Thomas Hofer, Birgit
author_facet	Hirsch, Thomas Hofer, Birgit
author_sort	Hirsch, Thomas
collection	PubMed
description	Textual documents produced in the software engineering process are a popular target for natural language processing (NLP) and information retrieval (IR) approaches. However, issue tickets often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate the issue ticket sizes, but also can this noise constitute a real problem for some NLP approaches, and therefore has to be removed in the pre-processing of some approaches. In this paper, we present a machine learning based approach to classify textual content into natural language and non-natural language artifacts at line level. We show how data from GitHub issue trackers can be used for automated training set generation, and present a custom preprocessing approach for the task of artifact removal. The training sets are automatically created from Markdown annotated issue tickets and project documentation files. We use these generated training sets to train a Markdown agnostic model that is able to classify un-annotated content. We evaluate our approach on issue tickets from projects written in C++, Java, JavaScript, PHP, and Python. Our approach achieves ROC-AUC scores between 0.92 and 0.96 for language-specific models. A multi-language model trained on the issue tickets of all languages achieves ROC-AUC scores between 0.92 and 0.95. The provided models are intended to be used as noise reduction pre-processing steps for NLP and IR approaches working on issue tickets.
format	Online Article Text
id	pubmed-9439617
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer US
record_format	MEDLINE/PubMed
spelling	pubmed-94396172022-09-03 Detecting non-natural language artifacts for de-noising bug reports Hirsch, Thomas Hofer, Birgit Autom Softw Eng Article Textual documents produced in the software engineering process are a popular target for natural language processing (NLP) and information retrieval (IR) approaches. However, issue tickets often contain artifacts such as code snippets, log outputs and stack traces. These artifacts not only inflate the issue ticket sizes, but also can this noise constitute a real problem for some NLP approaches, and therefore has to be removed in the pre-processing of some approaches. In this paper, we present a machine learning based approach to classify textual content into natural language and non-natural language artifacts at line level. We show how data from GitHub issue trackers can be used for automated training set generation, and present a custom preprocessing approach for the task of artifact removal. The training sets are automatically created from Markdown annotated issue tickets and project documentation files. We use these generated training sets to train a Markdown agnostic model that is able to classify un-annotated content. We evaluate our approach on issue tickets from projects written in C++, Java, JavaScript, PHP, and Python. Our approach achieves ROC-AUC scores between 0.92 and 0.96 for language-specific models. A multi-language model trained on the issue tickets of all languages achieves ROC-AUC scores between 0.92 and 0.95. The provided models are intended to be used as noise reduction pre-processing steps for NLP and IR approaches working on issue tickets. Springer US 2022-08-24 2022 /pmc/articles/PMC9439617/ /pubmed/36065351 http://dx.doi.org/10.1007/s10515-022-00350-0 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article Hirsch, Thomas Hofer, Birgit Detecting non-natural language artifacts for de-noising bug reports
title	Detecting non-natural language artifacts for de-noising bug reports
title_full	Detecting non-natural language artifacts for de-noising bug reports
title_fullStr	Detecting non-natural language artifacts for de-noising bug reports
title_full_unstemmed	Detecting non-natural language artifacts for de-noising bug reports
title_short	Detecting non-natural language artifacts for de-noising bug reports
title_sort	detecting non-natural language artifacts for de-noising bug reports
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9439617/ https://www.ncbi.nlm.nih.gov/pubmed/36065351 http://dx.doi.org/10.1007/s10515-022-00350-0
work_keys_str_mv	AT hirschthomas detectingnonnaturallanguageartifactsfordenoisingbugreports AT hoferbirgit detectingnonnaturallanguageartifactsfordenoisingbugreports

Detecting non-natural language artifacts for de-noising bug reports

Ejemplares similares