Cargando…

Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis

A considerable amount of health record (HR) data has been stored due to recent advances in the digitalization of medical systems. However, it is not always easy to analyze HR data, particularly when the number of persons with a target disease is too small in comparison with the population. This situ...

Descripción completa

Detalles Bibliográficos
Autores principales: Fujiwara, Koichi, Huang, Yukun, Hori, Kentaro, Nishioji, Kenichi, Kobayashi, Masao, Kamaguchi, Mai, Kano, Manabu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7248318/
https://www.ncbi.nlm.nih.gov/pubmed/32509717
http://dx.doi.org/10.3389/fpubh.2020.00178
_version_ 1783538345288138752
author Fujiwara, Koichi
Huang, Yukun
Hori, Kentaro
Nishioji, Kenichi
Kobayashi, Masao
Kamaguchi, Mai
Kano, Manabu
author_facet Fujiwara, Koichi
Huang, Yukun
Hori, Kentaro
Nishioji, Kenichi
Kobayashi, Masao
Kamaguchi, Mai
Kano, Manabu
author_sort Fujiwara, Koichi
collection PubMed
description A considerable amount of health record (HR) data has been stored due to recent advances in the digitalization of medical systems. However, it is not always easy to analyze HR data, particularly when the number of persons with a target disease is too small in comparison with the population. This situation is called the imbalanced data problem. Over-sampling and under-sampling are two approaches for redressing an imbalance between minority and majority examples, which can be combined into ensemble algorithms. However, these approaches do not function when the absolute number of minority examples is small, which is called the extremely imbalanced and small minority (EISM) data problem. The present work proposes a new algorithm called boosting combined with heuristic under-sampling and distribution-based sampling (HUSDOS-Boost) to solve the EISM data problem. To make an artificially balanced dataset from the original imbalanced datasets, HUSDOS-Boost uses both under-sampling and over-sampling to eliminate redundant majority examples based on prior boosting results and to generate artificial minority examples by following the minority class distribution. The performance and characteristics of HUSDOS-Boost were evaluated through application to eight imbalanced datasets. In addition, the algorithm was applied to original clinical HR data to detect patients with stomach cancer. These results showed that HUSDOS-Boost outperformed current imbalanced data handling methods, particularly when the data are EISM. Thus, the proposed HUSDOS-Boost is a useful methodology of HR data analysis.
format Online
Article
Text
id pubmed-7248318
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-72483182020-06-05 Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis Fujiwara, Koichi Huang, Yukun Hori, Kentaro Nishioji, Kenichi Kobayashi, Masao Kamaguchi, Mai Kano, Manabu Front Public Health Public Health A considerable amount of health record (HR) data has been stored due to recent advances in the digitalization of medical systems. However, it is not always easy to analyze HR data, particularly when the number of persons with a target disease is too small in comparison with the population. This situation is called the imbalanced data problem. Over-sampling and under-sampling are two approaches for redressing an imbalance between minority and majority examples, which can be combined into ensemble algorithms. However, these approaches do not function when the absolute number of minority examples is small, which is called the extremely imbalanced and small minority (EISM) data problem. The present work proposes a new algorithm called boosting combined with heuristic under-sampling and distribution-based sampling (HUSDOS-Boost) to solve the EISM data problem. To make an artificially balanced dataset from the original imbalanced datasets, HUSDOS-Boost uses both under-sampling and over-sampling to eliminate redundant majority examples based on prior boosting results and to generate artificial minority examples by following the minority class distribution. The performance and characteristics of HUSDOS-Boost were evaluated through application to eight imbalanced datasets. In addition, the algorithm was applied to original clinical HR data to detect patients with stomach cancer. These results showed that HUSDOS-Boost outperformed current imbalanced data handling methods, particularly when the data are EISM. Thus, the proposed HUSDOS-Boost is a useful methodology of HR data analysis. Frontiers Media S.A. 2020-05-19 /pmc/articles/PMC7248318/ /pubmed/32509717 http://dx.doi.org/10.3389/fpubh.2020.00178 Text en Copyright © 2020 Fujiwara, Huang, Hori, Nishioji, Kobayashi, Kamaguchi and Kano. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Public Health
Fujiwara, Koichi
Huang, Yukun
Hori, Kentaro
Nishioji, Kenichi
Kobayashi, Masao
Kamaguchi, Mai
Kano, Manabu
Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis
title Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis
title_full Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis
title_fullStr Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis
title_full_unstemmed Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis
title_short Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis
title_sort over- and under-sampling approach for extremely imbalanced and small minority data problem in health record analysis
topic Public Health
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7248318/
https://www.ncbi.nlm.nih.gov/pubmed/32509717
http://dx.doi.org/10.3389/fpubh.2020.00178
work_keys_str_mv AT fujiwarakoichi overandundersamplingapproachforextremelyimbalancedandsmallminoritydataprobleminhealthrecordanalysis
AT huangyukun overandundersamplingapproachforextremelyimbalancedandsmallminoritydataprobleminhealthrecordanalysis
AT horikentaro overandundersamplingapproachforextremelyimbalancedandsmallminoritydataprobleminhealthrecordanalysis
AT nishiojikenichi overandundersamplingapproachforextremelyimbalancedandsmallminoritydataprobleminhealthrecordanalysis
AT kobayashimasao overandundersamplingapproachforextremelyimbalancedandsmallminoritydataprobleminhealthrecordanalysis
AT kamaguchimai overandundersamplingapproachforextremelyimbalancedandsmallminoritydataprobleminhealthrecordanalysis
AT kanomanabu overandundersamplingapproachforextremelyimbalancedandsmallminoritydataprobleminhealthrecordanalysis