Cargando…

SMOTE-CD: SMOTE for compositional data

Compositional data are a special kind of data, represented as a proportion carrying relative information. Although this type of data is widely spread, no solution exists to deal with the cases where the classes are not well balanced. After describing compositional data imbalance, this paper proposes...

Descripción completa

Detalles Bibliográficos
Autores principales: Nguyen, Teo, Mengersen, Kerrie, Sous, Damien, Liquet, Benoit
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10309641/
https://www.ncbi.nlm.nih.gov/pubmed/37384667
http://dx.doi.org/10.1371/journal.pone.0287705
_version_ 1785066482034540544
author Nguyen, Teo
Mengersen, Kerrie
Sous, Damien
Liquet, Benoit
author_facet Nguyen, Teo
Mengersen, Kerrie
Sous, Damien
Liquet, Benoit
author_sort Nguyen, Teo
collection PubMed
description Compositional data are a special kind of data, represented as a proportion carrying relative information. Although this type of data is widely spread, no solution exists to deal with the cases where the classes are not well balanced. After describing compositional data imbalance, this paper proposes an adaptation of the original Synthetic Minority Oversampling TEchnique (SMOTE) to deal with compositional data imbalance. The new approach, called SMOTE for Compositional Data (SMOTE-CD), generates synthetic examples by computing a linear combination of selected existing data points, using compositional data operations. The performance of the SMOTE-CD is tested with three different regressors (Gradient Boosting tree, Neural Networks, Dirichlet regressor) applied to two real datasets and to synthetic generated data, and the performance is evaluated using accuracy, cross-entropy, F1-score, R2 score and RMSE. The results show improvements across all metrics, but the impact of oversampling on performance varies depending on the model and the data. In some cases, oversampling may lead to a decrease in performance for the majority class. However, for the real data, the best performance across all models is achieved when oversampling is used. Notably, the F1-score is consistently increased with oversampling. Unlike the original technique, the performance is not improved when combining oversampling of the minority classes and undersampling of the majority class. The Python package smote-cd implements the method and is available online.
format Online
Article
Text
id pubmed-10309641
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-103096412023-06-30 SMOTE-CD: SMOTE for compositional data Nguyen, Teo Mengersen, Kerrie Sous, Damien Liquet, Benoit PLoS One Research Article Compositional data are a special kind of data, represented as a proportion carrying relative information. Although this type of data is widely spread, no solution exists to deal with the cases where the classes are not well balanced. After describing compositional data imbalance, this paper proposes an adaptation of the original Synthetic Minority Oversampling TEchnique (SMOTE) to deal with compositional data imbalance. The new approach, called SMOTE for Compositional Data (SMOTE-CD), generates synthetic examples by computing a linear combination of selected existing data points, using compositional data operations. The performance of the SMOTE-CD is tested with three different regressors (Gradient Boosting tree, Neural Networks, Dirichlet regressor) applied to two real datasets and to synthetic generated data, and the performance is evaluated using accuracy, cross-entropy, F1-score, R2 score and RMSE. The results show improvements across all metrics, but the impact of oversampling on performance varies depending on the model and the data. In some cases, oversampling may lead to a decrease in performance for the majority class. However, for the real data, the best performance across all models is achieved when oversampling is used. Notably, the F1-score is consistently increased with oversampling. Unlike the original technique, the performance is not improved when combining oversampling of the minority classes and undersampling of the majority class. The Python package smote-cd implements the method and is available online. Public Library of Science 2023-06-29 /pmc/articles/PMC10309641/ /pubmed/37384667 http://dx.doi.org/10.1371/journal.pone.0287705 Text en © 2023 Nguyen et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Nguyen, Teo
Mengersen, Kerrie
Sous, Damien
Liquet, Benoit
SMOTE-CD: SMOTE for compositional data
title SMOTE-CD: SMOTE for compositional data
title_full SMOTE-CD: SMOTE for compositional data
title_fullStr SMOTE-CD: SMOTE for compositional data
title_full_unstemmed SMOTE-CD: SMOTE for compositional data
title_short SMOTE-CD: SMOTE for compositional data
title_sort smote-cd: smote for compositional data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10309641/
https://www.ncbi.nlm.nih.gov/pubmed/37384667
http://dx.doi.org/10.1371/journal.pone.0287705
work_keys_str_mv AT nguyenteo smotecdsmoteforcompositionaldata
AT mengersenkerrie smotecdsmoteforcompositionaldata
AT sousdamien smotecdsmoteforcompositionaldata
AT liquetbenoit smotecdsmoteforcompositionaldata