Cargando…

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Machine Learning (ML) algorithms have been increasingly replacing people in several application domains—in which the majority suffer from data imbalance. In order to solve this problem, published studies implement data preprocessing techniques, cost-sensitive and ensemble learning. These solutions r...

Descripción completa

Detalles Bibliográficos
Autores principales:	Werner de Vargas, Vitor, Schneider Aranda, Jorge Arthur, dos Santos Costa, Ricardo, da Silva Pereira, Paulo Ricardo, Victória Barbosa, Jorge Luis
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer London 2022
Materias:	Survey Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9645765/ https://www.ncbi.nlm.nih.gov/pubmed/36405957 http://dx.doi.org/10.1007/s10115-022-01772-8

_version_	1784827025462132736
author	Werner de Vargas, Vitor Schneider Aranda, Jorge Arthur dos Santos Costa, Ricardo da Silva Pereira, Paulo Ricardo Victória Barbosa, Jorge Luis
author_facet	Werner de Vargas, Vitor Schneider Aranda, Jorge Arthur dos Santos Costa, Ricardo da Silva Pereira, Paulo Ricardo Victória Barbosa, Jorge Luis
author_sort	Werner de Vargas, Vitor
collection	PubMed
description	Machine Learning (ML) algorithms have been increasingly replacing people in several application domains—in which the majority suffer from data imbalance. In order to solve this problem, published studies implement data preprocessing techniques, cost-sensitive and ensemble learning. These solutions reduce the naturally occurring bias towards the majority sample through ML. This study uses a systematic mapping methodology to assess 9927 papers related to sampling techniques for ML in imbalanced data applications from 7 digital libraries. A filtering process selected 35 representative papers from various domains, such as health, finance, and engineering. As a result of a thorough quantitative analysis of these papers, this study proposes two taxonomies—illustrating sampling techniques and ML models. The results indicate that oversampling and classical ML are the most common preprocessing techniques and models, respectively. However, solutions with neural networks and ensemble ML models have the best performance—with potentially better results through hybrid sampling techniques. Finally, none of the 35 works apply simulation-based synthetic oversampling, indicating a path for future preprocessing solutions.
format	Online Article Text
id	pubmed-9645765
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer London
record_format	MEDLINE/PubMed
spelling	pubmed-96457652022-11-14 Imbalanced data preprocessing techniques for machine learning: a systematic mapping study Werner de Vargas, Vitor Schneider Aranda, Jorge Arthur dos Santos Costa, Ricardo da Silva Pereira, Paulo Ricardo Victória Barbosa, Jorge Luis Knowl Inf Syst Survey Paper Machine Learning (ML) algorithms have been increasingly replacing people in several application domains—in which the majority suffer from data imbalance. In order to solve this problem, published studies implement data preprocessing techniques, cost-sensitive and ensemble learning. These solutions reduce the naturally occurring bias towards the majority sample through ML. This study uses a systematic mapping methodology to assess 9927 papers related to sampling techniques for ML in imbalanced data applications from 7 digital libraries. A filtering process selected 35 representative papers from various domains, such as health, finance, and engineering. As a result of a thorough quantitative analysis of these papers, this study proposes two taxonomies—illustrating sampling techniques and ML models. The results indicate that oversampling and classical ML are the most common preprocessing techniques and models, respectively. However, solutions with neural networks and ensemble ML models have the best performance—with potentially better results through hybrid sampling techniques. Finally, none of the 35 works apply simulation-based synthetic oversampling, indicating a path for future preprocessing solutions. Springer London 2022-11-09 2023 /pmc/articles/PMC9645765/ /pubmed/36405957 http://dx.doi.org/10.1007/s10115-022-01772-8 Text en © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022, Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Survey Paper Werner de Vargas, Vitor Schneider Aranda, Jorge Arthur dos Santos Costa, Ricardo da Silva Pereira, Paulo Ricardo Victória Barbosa, Jorge Luis Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
title	Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
title_full	Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
title_fullStr	Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
title_full_unstemmed	Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
title_short	Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
title_sort	imbalanced data preprocessing techniques for machine learning: a systematic mapping study
topic	Survey Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9645765/ https://www.ncbi.nlm.nih.gov/pubmed/36405957 http://dx.doi.org/10.1007/s10115-022-01772-8
work_keys_str_mv	AT wernerdevargasvitor imbalanceddatapreprocessingtechniquesformachinelearningasystematicmappingstudy AT schneiderarandajorgearthur imbalanceddatapreprocessingtechniquesformachinelearningasystematicmappingstudy AT dossantoscostaricardo imbalanceddatapreprocessingtechniquesformachinelearningasystematicmappingstudy AT dasilvapereirapauloricardo imbalanceddatapreprocessingtechniquesformachinelearningasystematicmappingstudy AT victoriabarbosajorgeluis imbalanceddatapreprocessingtechniquesformachinelearningasystematicmappingstudy

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Ejemplares similares