Cargando…

Predicting defects in imbalanced data using resampling methods: an empirical investigation

The development of correct and effective software defect prediction (SDP) models is one of the utmost needs of the software industry. Statistics of many defect-related open-source data sets depict the class imbalance problem in object-oriented projects. Models trained on imbalanced data leads to ina...

Descripción completa

Detalles Bibliográficos
Autores principales:	Malhotra, Ruchika, Jain, Juhi
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2022
Materias:	Data Mining and Machine Learning
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9137963/ https://www.ncbi.nlm.nih.gov/pubmed/35634102 http://dx.doi.org/10.7717/peerj-cs.573

_version_	1784714509289521152
author	Malhotra, Ruchika Jain, Juhi
author_facet	Malhotra, Ruchika Jain, Juhi
author_sort	Malhotra, Ruchika
collection	PubMed
description	The development of correct and effective software defect prediction (SDP) models is one of the utmost needs of the software industry. Statistics of many defect-related open-source data sets depict the class imbalance problem in object-oriented projects. Models trained on imbalanced data leads to inaccurate future predictions owing to biased learning and ineffective defect prediction. In addition to this large number of software metrics degrades the model performance. This study aims at (1) identification of useful metrics in the software using correlation feature selection, (2) extensive comparative analysis of 10 resampling methods to generate effective machine learning models for imbalanced data, (3) inclusion of stable performance evaluators—AUC, GMean, and Balance and (4) integration of statistical validation of results. The impact of 10 resampling methods is analyzed on selected features of 12 object-oriented Apache datasets using 15 machine learning techniques. The performances of developed models are analyzed using AUC, GMean, Balance, and sensitivity. Statistical results advocate the use of resampling methods to improve SDP. Random oversampling portrays the best predictive capability of developed defect prediction models. The study provides a guideline for identifying metrics that are influential for SDP. The performances of oversampling methods are superior to undersampling methods.
format	Online Article Text
id	pubmed-9137963
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-91379632022-05-28 Predicting defects in imbalanced data using resampling methods: an empirical investigation Malhotra, Ruchika Jain, Juhi PeerJ Comput Sci Data Mining and Machine Learning The development of correct and effective software defect prediction (SDP) models is one of the utmost needs of the software industry. Statistics of many defect-related open-source data sets depict the class imbalance problem in object-oriented projects. Models trained on imbalanced data leads to inaccurate future predictions owing to biased learning and ineffective defect prediction. In addition to this large number of software metrics degrades the model performance. This study aims at (1) identification of useful metrics in the software using correlation feature selection, (2) extensive comparative analysis of 10 resampling methods to generate effective machine learning models for imbalanced data, (3) inclusion of stable performance evaluators—AUC, GMean, and Balance and (4) integration of statistical validation of results. The impact of 10 resampling methods is analyzed on selected features of 12 object-oriented Apache datasets using 15 machine learning techniques. The performances of developed models are analyzed using AUC, GMean, Balance, and sensitivity. Statistical results advocate the use of resampling methods to improve SDP. Random oversampling portrays the best predictive capability of developed defect prediction models. The study provides a guideline for identifying metrics that are influential for SDP. The performances of oversampling methods are superior to undersampling methods. PeerJ Inc. 2022-04-29 /pmc/articles/PMC9137963/ /pubmed/35634102 http://dx.doi.org/10.7717/peerj-cs.573 Text en © 2022 Malhotra and Jain https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Data Mining and Machine Learning Malhotra, Ruchika Jain, Juhi Predicting defects in imbalanced data using resampling methods: an empirical investigation
title	Predicting defects in imbalanced data using resampling methods: an empirical investigation
title_full	Predicting defects in imbalanced data using resampling methods: an empirical investigation
title_fullStr	Predicting defects in imbalanced data using resampling methods: an empirical investigation
title_full_unstemmed	Predicting defects in imbalanced data using resampling methods: an empirical investigation
title_short	Predicting defects in imbalanced data using resampling methods: an empirical investigation
title_sort	predicting defects in imbalanced data using resampling methods: an empirical investigation
topic	Data Mining and Machine Learning
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9137963/ https://www.ncbi.nlm.nih.gov/pubmed/35634102 http://dx.doi.org/10.7717/peerj-cs.573
work_keys_str_mv	AT malhotraruchika predictingdefectsinimbalanceddatausingresamplingmethodsanempiricalinvestigation AT jainjuhi predictingdefectsinimbalanceddatausingresamplingmethodsanempiricalinvestigation

Predicting defects in imbalanced data using resampling methods: an empirical investigation

Ejemplares similares