Cargando…

Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study

High dimensionality and class imbalance have been largely recognized as important issues in machine learning. A vast amount of literature has indeed investigated suitable approaches to address the multiple challenges that arise when dealing with high-dimensional feature spaces (where each problem in...

Descripción completa

Detalles Bibliográficos
Autores principales: Pes, Barbara, Lai, Giuseppina
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8725666/
https://www.ncbi.nlm.nih.gov/pubmed/35036539
http://dx.doi.org/10.7717/peerj-cs.832
_version_ 1784626163072630784
author Pes, Barbara
Lai, Giuseppina
author_facet Pes, Barbara
Lai, Giuseppina
author_sort Pes, Barbara
collection PubMed
description High dimensionality and class imbalance have been largely recognized as important issues in machine learning. A vast amount of literature has indeed investigated suitable approaches to address the multiple challenges that arise when dealing with high-dimensional feature spaces (where each problem instance is described by a large number of features). As well, several learning strategies have been devised to cope with the adverse effects of imbalanced class distributions, which may severely impact on the generalization ability of the induced models. Nevertheless, although both the issues have been largely studied for several years, they have mostly been addressed separately, and their combined effects are yet to be fully understood. Indeed, little research has been so far conducted to investigate which approaches might be best suited to deal with datasets that are, at the same time, high-dimensional and class-imbalanced. To make a contribution in this direction, our work presents a comparative study among different learning strategies that leverage both feature selection, to cope with high dimensionality, as well as cost-sensitive learning methods, to cope with class imbalance. Specifically, different ways of incorporating misclassification costs into the learning process have been explored. Also different feature selection heuristics have been considered, both univariate and multivariate, to comparatively evaluate their effectiveness on imbalanced data. The experiments have been conducted on three challenging benchmarks from the genomic domain, gaining interesting insight into the beneficial impact of combining feature selection and cost-sensitive learning, especially in the presence of highly skewed data distributions.
format Online
Article
Text
id pubmed-8725666
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-87256662022-01-14 Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study Pes, Barbara Lai, Giuseppina PeerJ Comput Sci Bioinformatics High dimensionality and class imbalance have been largely recognized as important issues in machine learning. A vast amount of literature has indeed investigated suitable approaches to address the multiple challenges that arise when dealing with high-dimensional feature spaces (where each problem instance is described by a large number of features). As well, several learning strategies have been devised to cope with the adverse effects of imbalanced class distributions, which may severely impact on the generalization ability of the induced models. Nevertheless, although both the issues have been largely studied for several years, they have mostly been addressed separately, and their combined effects are yet to be fully understood. Indeed, little research has been so far conducted to investigate which approaches might be best suited to deal with datasets that are, at the same time, high-dimensional and class-imbalanced. To make a contribution in this direction, our work presents a comparative study among different learning strategies that leverage both feature selection, to cope with high dimensionality, as well as cost-sensitive learning methods, to cope with class imbalance. Specifically, different ways of incorporating misclassification costs into the learning process have been explored. Also different feature selection heuristics have been considered, both univariate and multivariate, to comparatively evaluate their effectiveness on imbalanced data. The experiments have been conducted on three challenging benchmarks from the genomic domain, gaining interesting insight into the beneficial impact of combining feature selection and cost-sensitive learning, especially in the presence of highly skewed data distributions. PeerJ Inc. 2021-12-24 /pmc/articles/PMC8725666/ /pubmed/35036539 http://dx.doi.org/10.7717/peerj-cs.832 Text en ©2021 Pes and Lai https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Pes, Barbara
Lai, Giuseppina
Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study
title Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study
title_full Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study
title_fullStr Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study
title_full_unstemmed Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study
title_short Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study
title_sort cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8725666/
https://www.ncbi.nlm.nih.gov/pubmed/35036539
http://dx.doi.org/10.7717/peerj-cs.832
work_keys_str_mv AT pesbarbara costsensitivelearningstrategiesforhighdimensionalandimbalanceddataacomparativestudy
AT laigiuseppina costsensitivelearningstrategiesforhighdimensionalandimbalanceddataacomparativestudy