Cargando…

Two datasets of defect reports labeled by a crowd of annotators of unknown reliability

Classifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domai...

Descripción completa

Detalles Bibliográficos
Autores principales: Hernández-González, Jerónimo, Rodriguez, Daniel, Inza, Iñaki, Harrison, Rachel, Lozano, Jose A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5996400/
https://www.ncbi.nlm.nih.gov/pubmed/29900248
http://dx.doi.org/10.1016/j.dib.2018.03.109
_version_ 1783330843992784896
author Hernández-González, Jerónimo
Rodriguez, Daniel
Inza, Iñaki
Harrison, Rachel
Lozano, Jose A.
author_facet Hernández-González, Jerónimo
Rodriguez, Daniel
Inza, Iñaki
Harrison, Rachel
Lozano, Jose A.
author_sort Hernández-González, Jerónimo
collection PubMed
description Classifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domain expert, the collected defects were categorized by a set of annotators of unknown reliability according to their impact from IBM's orthogonal defect classification taxonomy. Both datasets are prepared to solve the defect classification problem by means of techniques of the learning from crowds paradigm (Hernández-González et al. [1]). Two versions of both datasets are publicly shared. In the first version, the raw data is given: the text description of defects together with the category assigned by each annotator. In the second version, the text of each defect has been transformed to a descriptive vector using text-mining techniques.
format Online
Article
Text
id pubmed-5996400
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-59964002018-06-13 Two datasets of defect reports labeled by a crowd of annotators of unknown reliability Hernández-González, Jerónimo Rodriguez, Daniel Inza, Iñaki Harrison, Rachel Lozano, Jose A. Data Brief Computer Sciences    Classifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domain expert, the collected defects were categorized by a set of annotators of unknown reliability according to their impact from IBM's orthogonal defect classification taxonomy. Both datasets are prepared to solve the defect classification problem by means of techniques of the learning from crowds paradigm (Hernández-González et al. [1]). Two versions of both datasets are publicly shared. In the first version, the raw data is given: the text description of defects together with the category assigned by each annotator. In the second version, the text of each defect has been transformed to a descriptive vector using text-mining techniques. Elsevier 2018-03-28 /pmc/articles/PMC5996400/ /pubmed/29900248 http://dx.doi.org/10.1016/j.dib.2018.03.109 Text en © 2018 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Computer Sciences   
Hernández-González, Jerónimo
Rodriguez, Daniel
Inza, Iñaki
Harrison, Rachel
Lozano, Jose A.
Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_full Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_fullStr Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_full_unstemmed Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_short Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
title_sort two datasets of defect reports labeled by a crowd of annotators of unknown reliability
topic Computer Sciences   
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5996400/
https://www.ncbi.nlm.nih.gov/pubmed/29900248
http://dx.doi.org/10.1016/j.dib.2018.03.109
work_keys_str_mv AT hernandezgonzalezjeronimo twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability
AT rodriguezdaniel twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability
AT inzainaki twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability
AT harrisonrachel twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability
AT lozanojosea twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability