Cargando…
Optimization of Skewed Data Using Sampling-Based Preprocessing Approach
In the past few years, classification has undergone some major evolution. With a constant surge of the amount of data gathered from different sources, efficient processing and analysis of data is becoming difficult. Due to the uneven distribution of data among classes, data classification with machi...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7378392/ https://www.ncbi.nlm.nih.gov/pubmed/32766193 http://dx.doi.org/10.3389/fpubh.2020.00274 |
_version_ | 1783562409890283520 |
---|---|
author | Mishra, Sushruta Mallick, Pradeep Kumar Jena, Lambodar Chae, Gyoo-Soo |
author_facet | Mishra, Sushruta Mallick, Pradeep Kumar Jena, Lambodar Chae, Gyoo-Soo |
author_sort | Mishra, Sushruta |
collection | PubMed |
description | In the past few years, classification has undergone some major evolution. With a constant surge of the amount of data gathered from different sources, efficient processing and analysis of data is becoming difficult. Due to the uneven distribution of data among classes, data classification with machine-learning techniques has become more tedious. While most algorithms focus on major data samples, they ignore the minor class data. Thus, the data-skewing issue is one of the critical problems that need attention of researchers. The paper stresses upon data preprocessing using sampling techniques to overcome the data-skewing problem. Here, three different sampling techniques such as Resampling, SpreadSubSampling, and SMOTE are implemented to reduce this uneven data distribution issue and classified with the K-nearest neighbor algorithm. The performance of classification is evaluated with various performance metrics to determine the efficiency of classification. |
format | Online Article Text |
id | pubmed-7378392 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-73783922020-08-05 Optimization of Skewed Data Using Sampling-Based Preprocessing Approach Mishra, Sushruta Mallick, Pradeep Kumar Jena, Lambodar Chae, Gyoo-Soo Front Public Health Public Health In the past few years, classification has undergone some major evolution. With a constant surge of the amount of data gathered from different sources, efficient processing and analysis of data is becoming difficult. Due to the uneven distribution of data among classes, data classification with machine-learning techniques has become more tedious. While most algorithms focus on major data samples, they ignore the minor class data. Thus, the data-skewing issue is one of the critical problems that need attention of researchers. The paper stresses upon data preprocessing using sampling techniques to overcome the data-skewing problem. Here, three different sampling techniques such as Resampling, SpreadSubSampling, and SMOTE are implemented to reduce this uneven data distribution issue and classified with the K-nearest neighbor algorithm. The performance of classification is evaluated with various performance metrics to determine the efficiency of classification. Frontiers Media S.A. 2020-07-16 /pmc/articles/PMC7378392/ /pubmed/32766193 http://dx.doi.org/10.3389/fpubh.2020.00274 Text en Copyright © 2020 Mishra, Mallick, Jena and Chae. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Public Health Mishra, Sushruta Mallick, Pradeep Kumar Jena, Lambodar Chae, Gyoo-Soo Optimization of Skewed Data Using Sampling-Based Preprocessing Approach |
title | Optimization of Skewed Data Using Sampling-Based Preprocessing Approach |
title_full | Optimization of Skewed Data Using Sampling-Based Preprocessing Approach |
title_fullStr | Optimization of Skewed Data Using Sampling-Based Preprocessing Approach |
title_full_unstemmed | Optimization of Skewed Data Using Sampling-Based Preprocessing Approach |
title_short | Optimization of Skewed Data Using Sampling-Based Preprocessing Approach |
title_sort | optimization of skewed data using sampling-based preprocessing approach |
topic | Public Health |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7378392/ https://www.ncbi.nlm.nih.gov/pubmed/32766193 http://dx.doi.org/10.3389/fpubh.2020.00274 |
work_keys_str_mv | AT mishrasushruta optimizationofskeweddatausingsamplingbasedpreprocessingapproach AT mallickpradeepkumar optimizationofskeweddatausingsamplingbasedpreprocessingapproach AT jenalambodar optimizationofskeweddatausingsamplingbasedpreprocessingapproach AT chaegyoosoo optimizationofskeweddatausingsamplingbasedpreprocessingapproach |