Cargando…

Optimization of Skewed Data Using Sampling-Based Preprocessing Approach

In the past few years, classification has undergone some major evolution. With a constant surge of the amount of data gathered from different sources, efficient processing and analysis of data is becoming difficult. Due to the uneven distribution of data among classes, data classification with machi...

Descripción completa

Detalles Bibliográficos
Autores principales: Mishra, Sushruta, Mallick, Pradeep Kumar, Jena, Lambodar, Chae, Gyoo-Soo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7378392/
https://www.ncbi.nlm.nih.gov/pubmed/32766193
http://dx.doi.org/10.3389/fpubh.2020.00274
_version_ 1783562409890283520
author Mishra, Sushruta
Mallick, Pradeep Kumar
Jena, Lambodar
Chae, Gyoo-Soo
author_facet Mishra, Sushruta
Mallick, Pradeep Kumar
Jena, Lambodar
Chae, Gyoo-Soo
author_sort Mishra, Sushruta
collection PubMed
description In the past few years, classification has undergone some major evolution. With a constant surge of the amount of data gathered from different sources, efficient processing and analysis of data is becoming difficult. Due to the uneven distribution of data among classes, data classification with machine-learning techniques has become more tedious. While most algorithms focus on major data samples, they ignore the minor class data. Thus, the data-skewing issue is one of the critical problems that need attention of researchers. The paper stresses upon data preprocessing using sampling techniques to overcome the data-skewing problem. Here, three different sampling techniques such as Resampling, SpreadSubSampling, and SMOTE are implemented to reduce this uneven data distribution issue and classified with the K-nearest neighbor algorithm. The performance of classification is evaluated with various performance metrics to determine the efficiency of classification.
format Online
Article
Text
id pubmed-7378392
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-73783922020-08-05 Optimization of Skewed Data Using Sampling-Based Preprocessing Approach Mishra, Sushruta Mallick, Pradeep Kumar Jena, Lambodar Chae, Gyoo-Soo Front Public Health Public Health In the past few years, classification has undergone some major evolution. With a constant surge of the amount of data gathered from different sources, efficient processing and analysis of data is becoming difficult. Due to the uneven distribution of data among classes, data classification with machine-learning techniques has become more tedious. While most algorithms focus on major data samples, they ignore the minor class data. Thus, the data-skewing issue is one of the critical problems that need attention of researchers. The paper stresses upon data preprocessing using sampling techniques to overcome the data-skewing problem. Here, three different sampling techniques such as Resampling, SpreadSubSampling, and SMOTE are implemented to reduce this uneven data distribution issue and classified with the K-nearest neighbor algorithm. The performance of classification is evaluated with various performance metrics to determine the efficiency of classification. Frontiers Media S.A. 2020-07-16 /pmc/articles/PMC7378392/ /pubmed/32766193 http://dx.doi.org/10.3389/fpubh.2020.00274 Text en Copyright © 2020 Mishra, Mallick, Jena and Chae. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Public Health
Mishra, Sushruta
Mallick, Pradeep Kumar
Jena, Lambodar
Chae, Gyoo-Soo
Optimization of Skewed Data Using Sampling-Based Preprocessing Approach
title Optimization of Skewed Data Using Sampling-Based Preprocessing Approach
title_full Optimization of Skewed Data Using Sampling-Based Preprocessing Approach
title_fullStr Optimization of Skewed Data Using Sampling-Based Preprocessing Approach
title_full_unstemmed Optimization of Skewed Data Using Sampling-Based Preprocessing Approach
title_short Optimization of Skewed Data Using Sampling-Based Preprocessing Approach
title_sort optimization of skewed data using sampling-based preprocessing approach
topic Public Health
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7378392/
https://www.ncbi.nlm.nih.gov/pubmed/32766193
http://dx.doi.org/10.3389/fpubh.2020.00274
work_keys_str_mv AT mishrasushruta optimizationofskeweddatausingsamplingbasedpreprocessingapproach
AT mallickpradeepkumar optimizationofskeweddatausingsamplingbasedpreprocessingapproach
AT jenalambodar optimizationofskeweddatausingsamplingbasedpreprocessingapproach
AT chaegyoosoo optimizationofskeweddatausingsamplingbasedpreprocessingapproach