Cargando…
An oversampling method for imbalanced data based on spatial distribution of minority samples SD-KMSMOTE
With the rapid expansion of data, the problem of data imbalance has become increasingly prominent in the fields of medical treatment, finance, network, etc. And it is typically solved using the oversampling method. However, most existing oversampling methods randomly sample or sample only for a part...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9546831/ https://www.ncbi.nlm.nih.gov/pubmed/36207460 http://dx.doi.org/10.1038/s41598-022-21046-1 |
_version_ | 1784805131963858944 |
---|---|
author | Yang, Wensheng Pan, Chengsheng Zhang, Yanyan |
author_facet | Yang, Wensheng Pan, Chengsheng Zhang, Yanyan |
author_sort | Yang, Wensheng |
collection | PubMed |
description | With the rapid expansion of data, the problem of data imbalance has become increasingly prominent in the fields of medical treatment, finance, network, etc. And it is typically solved using the oversampling method. However, most existing oversampling methods randomly sample or sample only for a particular area, which affects the classification results. To solve the above limitations, this study proposes an imbalanced data oversampling method, SD-KMSMOTE, based on the spatial distribution of minority samples. A filter noise pre-treatment is added, the category information of the near-neighbouring samples is considered, and the existing minority class sample noise is removed. These conditions lead to the design of a new sample synthesis method, and the rules for calculating the weight values are constructed on this basis. The spatial distribution of minority class samples is considered comprehensively; they are clustered, and the sub-clusters that contain useful information are assigned larger weight values and more synthetic sample numbers. The experimental results show that the experimental results outperform existing methods in terms of precision, recall, F1 score, G-mean, and area under the curve values when the proposed method is used to expand the imbalanced dataset in the field of medicine and other fields. |
format | Online Article Text |
id | pubmed-9546831 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-95468312022-10-09 An oversampling method for imbalanced data based on spatial distribution of minority samples SD-KMSMOTE Yang, Wensheng Pan, Chengsheng Zhang, Yanyan Sci Rep Article With the rapid expansion of data, the problem of data imbalance has become increasingly prominent in the fields of medical treatment, finance, network, etc. And it is typically solved using the oversampling method. However, most existing oversampling methods randomly sample or sample only for a particular area, which affects the classification results. To solve the above limitations, this study proposes an imbalanced data oversampling method, SD-KMSMOTE, based on the spatial distribution of minority samples. A filter noise pre-treatment is added, the category information of the near-neighbouring samples is considered, and the existing minority class sample noise is removed. These conditions lead to the design of a new sample synthesis method, and the rules for calculating the weight values are constructed on this basis. The spatial distribution of minority class samples is considered comprehensively; they are clustered, and the sub-clusters that contain useful information are assigned larger weight values and more synthetic sample numbers. The experimental results show that the experimental results outperform existing methods in terms of precision, recall, F1 score, G-mean, and area under the curve values when the proposed method is used to expand the imbalanced dataset in the field of medicine and other fields. Nature Publishing Group UK 2022-10-07 /pmc/articles/PMC9546831/ /pubmed/36207460 http://dx.doi.org/10.1038/s41598-022-21046-1 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Yang, Wensheng Pan, Chengsheng Zhang, Yanyan An oversampling method for imbalanced data based on spatial distribution of minority samples SD-KMSMOTE |
title | An oversampling method for imbalanced data based on spatial distribution of minority samples SD-KMSMOTE |
title_full | An oversampling method for imbalanced data based on spatial distribution of minority samples SD-KMSMOTE |
title_fullStr | An oversampling method for imbalanced data based on spatial distribution of minority samples SD-KMSMOTE |
title_full_unstemmed | An oversampling method for imbalanced data based on spatial distribution of minority samples SD-KMSMOTE |
title_short | An oversampling method for imbalanced data based on spatial distribution of minority samples SD-KMSMOTE |
title_sort | oversampling method for imbalanced data based on spatial distribution of minority samples sd-kmsmote |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9546831/ https://www.ncbi.nlm.nih.gov/pubmed/36207460 http://dx.doi.org/10.1038/s41598-022-21046-1 |
work_keys_str_mv | AT yangwensheng anoversamplingmethodforimbalanceddatabasedonspatialdistributionofminoritysamplessdkmsmote AT panchengsheng anoversamplingmethodforimbalanceddatabasedonspatialdistributionofminoritysamplessdkmsmote AT zhangyanyan anoversamplingmethodforimbalanceddatabasedonspatialdistributionofminoritysamplessdkmsmote AT yangwensheng oversamplingmethodforimbalanceddatabasedonspatialdistributionofminoritysamplessdkmsmote AT panchengsheng oversamplingmethodforimbalanceddatabasedonspatialdistributionofminoritysamplessdkmsmote AT zhangyanyan oversamplingmethodforimbalanceddatabasedonspatialdistributionofminoritysamplessdkmsmote |