Cargando…
Two datasets of defect reports labeled by a crowd of annotators of unknown reliability
Classifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domai...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5996400/ https://www.ncbi.nlm.nih.gov/pubmed/29900248 http://dx.doi.org/10.1016/j.dib.2018.03.109 |
_version_ | 1783330843992784896 |
---|---|
author | Hernández-González, Jerónimo Rodriguez, Daniel Inza, Iñaki Harrison, Rachel Lozano, Jose A. |
author_facet | Hernández-González, Jerónimo Rodriguez, Daniel Inza, Iñaki Harrison, Rachel Lozano, Jose A. |
author_sort | Hernández-González, Jerónimo |
collection | PubMed |
description | Classifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domain expert, the collected defects were categorized by a set of annotators of unknown reliability according to their impact from IBM's orthogonal defect classification taxonomy. Both datasets are prepared to solve the defect classification problem by means of techniques of the learning from crowds paradigm (Hernández-González et al. [1]). Two versions of both datasets are publicly shared. In the first version, the raw data is given: the text description of defects together with the category assigned by each annotator. In the second version, the text of each defect has been transformed to a descriptive vector using text-mining techniques. |
format | Online Article Text |
id | pubmed-5996400 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-59964002018-06-13 Two datasets of defect reports labeled by a crowd of annotators of unknown reliability Hernández-González, Jerónimo Rodriguez, Daniel Inza, Iñaki Harrison, Rachel Lozano, Jose A. Data Brief Computer Sciences Classifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domain expert, the collected defects were categorized by a set of annotators of unknown reliability according to their impact from IBM's orthogonal defect classification taxonomy. Both datasets are prepared to solve the defect classification problem by means of techniques of the learning from crowds paradigm (Hernández-González et al. [1]). Two versions of both datasets are publicly shared. In the first version, the raw data is given: the text description of defects together with the category assigned by each annotator. In the second version, the text of each defect has been transformed to a descriptive vector using text-mining techniques. Elsevier 2018-03-28 /pmc/articles/PMC5996400/ /pubmed/29900248 http://dx.doi.org/10.1016/j.dib.2018.03.109 Text en © 2018 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Computer Sciences Hernández-González, Jerónimo Rodriguez, Daniel Inza, Iñaki Harrison, Rachel Lozano, Jose A. Two datasets of defect reports labeled by a crowd of annotators of unknown reliability |
title | Two datasets of defect reports labeled by a crowd of annotators of unknown reliability |
title_full | Two datasets of defect reports labeled by a crowd of annotators of unknown reliability |
title_fullStr | Two datasets of defect reports labeled by a crowd of annotators of unknown reliability |
title_full_unstemmed | Two datasets of defect reports labeled by a crowd of annotators of unknown reliability |
title_short | Two datasets of defect reports labeled by a crowd of annotators of unknown reliability |
title_sort | two datasets of defect reports labeled by a crowd of annotators of unknown reliability |
topic | Computer Sciences |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5996400/ https://www.ncbi.nlm.nih.gov/pubmed/29900248 http://dx.doi.org/10.1016/j.dib.2018.03.109 |
work_keys_str_mv | AT hernandezgonzalezjeronimo twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability AT rodriguezdaniel twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability AT inzainaki twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability AT harrisonrachel twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability AT lozanojosea twodatasetsofdefectreportslabeledbyacrowdofannotatorsofunknownreliability |