Cargando…

Research on expansion and classification of imbalanced data based on SMOTE algorithm

With the development of artificial intelligence, big data classification technology provides the advantageous help for the medicine auxiliary diagnosis research. While due to the different conditions in the different sample collection, the medical big data is often imbalanced. The class-imbalance pr...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Shujuan, Dai, Yuntao, Shen, Jihong, Xuan, Jingxue
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8674253/
https://www.ncbi.nlm.nih.gov/pubmed/34912009
http://dx.doi.org/10.1038/s41598-021-03430-5
_version_ 1784615607828742144
author Wang, Shujuan
Dai, Yuntao
Shen, Jihong
Xuan, Jingxue
author_facet Wang, Shujuan
Dai, Yuntao
Shen, Jihong
Xuan, Jingxue
author_sort Wang, Shujuan
collection PubMed
description With the development of artificial intelligence, big data classification technology provides the advantageous help for the medicine auxiliary diagnosis research. While due to the different conditions in the different sample collection, the medical big data is often imbalanced. The class-imbalance problem has been reported as a serious obstacle to the classification performance of many standard learning algorithms. SMOTE algorithm could be used to generate sample points randomly to improve imbalance rate, but its application is affected by the marginalization generation and blindness of parameter selection. Focusing on this problem, an improved SMOTE algorithm based on Normal distribution is proposed in this paper, so that the new sample points are distributed closer to the center of the minority sample with a higher probability to avoid the marginalization of the expanded data. Experiments show that the classification effect is better when use proposed algorithm to expand the imbalanced dataset of Pima, WDBC, WPBC, Ionosphere and Breast-cancer-wisconsin than the original SMOTE algorithm. In addition, the parameter selection of the proposed algorithm is analyzed and it is found that the classification effect is the best when the distribution characteristics of the original data was maintained best by selecting appropriate parameters in our designed experiments.
format Online
Article
Text
id pubmed-8674253
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-86742532021-12-16 Research on expansion and classification of imbalanced data based on SMOTE algorithm Wang, Shujuan Dai, Yuntao Shen, Jihong Xuan, Jingxue Sci Rep Article With the development of artificial intelligence, big data classification technology provides the advantageous help for the medicine auxiliary diagnosis research. While due to the different conditions in the different sample collection, the medical big data is often imbalanced. The class-imbalance problem has been reported as a serious obstacle to the classification performance of many standard learning algorithms. SMOTE algorithm could be used to generate sample points randomly to improve imbalance rate, but its application is affected by the marginalization generation and blindness of parameter selection. Focusing on this problem, an improved SMOTE algorithm based on Normal distribution is proposed in this paper, so that the new sample points are distributed closer to the center of the minority sample with a higher probability to avoid the marginalization of the expanded data. Experiments show that the classification effect is better when use proposed algorithm to expand the imbalanced dataset of Pima, WDBC, WPBC, Ionosphere and Breast-cancer-wisconsin than the original SMOTE algorithm. In addition, the parameter selection of the proposed algorithm is analyzed and it is found that the classification effect is the best when the distribution characteristics of the original data was maintained best by selecting appropriate parameters in our designed experiments. Nature Publishing Group UK 2021-12-15 /pmc/articles/PMC8674253/ /pubmed/34912009 http://dx.doi.org/10.1038/s41598-021-03430-5 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Wang, Shujuan
Dai, Yuntao
Shen, Jihong
Xuan, Jingxue
Research on expansion and classification of imbalanced data based on SMOTE algorithm
title Research on expansion and classification of imbalanced data based on SMOTE algorithm
title_full Research on expansion and classification of imbalanced data based on SMOTE algorithm
title_fullStr Research on expansion and classification of imbalanced data based on SMOTE algorithm
title_full_unstemmed Research on expansion and classification of imbalanced data based on SMOTE algorithm
title_short Research on expansion and classification of imbalanced data based on SMOTE algorithm
title_sort research on expansion and classification of imbalanced data based on smote algorithm
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8674253/
https://www.ncbi.nlm.nih.gov/pubmed/34912009
http://dx.doi.org/10.1038/s41598-021-03430-5
work_keys_str_mv AT wangshujuan researchonexpansionandclassificationofimbalanceddatabasedonsmotealgorithm
AT daiyuntao researchonexpansionandclassificationofimbalanceddatabasedonsmotealgorithm
AT shenjihong researchonexpansionandclassificationofimbalanceddatabasedonsmotealgorithm
AT xuanjingxue researchonexpansionandclassificationofimbalanceddatabasedonsmotealgorithm