Cargando…

Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization

The filtering feature-selection algorithm is a kind of important approach to dimensionality reduction in the field of the text categorization. Most of filtering feature-selection algorithms evaluate the significance of a feature for category based on balanced dataset and do not consider the imbalanc...

Descripción completa

Detalles Bibliográficos
Autores principales: Yang, Jieming, Qu, Zhaoyang, Liu, Zhiying
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi Publishing Corporation 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4058251/
https://www.ncbi.nlm.nih.gov/pubmed/24971386
http://dx.doi.org/10.1155/2014/625342
_version_ 1782321105162207232
author Yang, Jieming
Qu, Zhaoyang
Liu, Zhiying
author_facet Yang, Jieming
Qu, Zhaoyang
Liu, Zhiying
author_sort Yang, Jieming
collection PubMed
description The filtering feature-selection algorithm is a kind of important approach to dimensionality reduction in the field of the text categorization. Most of filtering feature-selection algorithms evaluate the significance of a feature for category based on balanced dataset and do not consider the imbalance factor of dataset. In this paper, a new scheme was proposed, which can weaken the adverse effect caused by the imbalance factor in the corpus. We evaluated the improved versions of nine well-known feature-selection methods (Information Gain, Chi statistic, Document Frequency, Orthogonal Centroid Feature Selection, DIA association factor, Comprehensive Measurement Feature Selection, Deviation from Poisson Feature Selection, improved Gini index, and Mutual Information) using naïve Bayes and support vector machines on three benchmark document collections (20-Newsgroups, Reuters-21578, and WebKB). The experimental results show that the improved scheme can significantly enhance the performance of the feature-selection methods.
format Online
Article
Text
id pubmed-4058251
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Hindawi Publishing Corporation
record_format MEDLINE/PubMed
spelling pubmed-40582512014-06-26 Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization Yang, Jieming Qu, Zhaoyang Liu, Zhiying ScientificWorldJournal Research Article The filtering feature-selection algorithm is a kind of important approach to dimensionality reduction in the field of the text categorization. Most of filtering feature-selection algorithms evaluate the significance of a feature for category based on balanced dataset and do not consider the imbalance factor of dataset. In this paper, a new scheme was proposed, which can weaken the adverse effect caused by the imbalance factor in the corpus. We evaluated the improved versions of nine well-known feature-selection methods (Information Gain, Chi statistic, Document Frequency, Orthogonal Centroid Feature Selection, DIA association factor, Comprehensive Measurement Feature Selection, Deviation from Poisson Feature Selection, improved Gini index, and Mutual Information) using naïve Bayes and support vector machines on three benchmark document collections (20-Newsgroups, Reuters-21578, and WebKB). The experimental results show that the improved scheme can significantly enhance the performance of the feature-selection methods. Hindawi Publishing Corporation 2014 2014-05-26 /pmc/articles/PMC4058251/ /pubmed/24971386 http://dx.doi.org/10.1155/2014/625342 Text en Copyright © 2014 Jieming Yang et al. https://creativecommons.org/licenses/by/3.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Yang, Jieming
Qu, Zhaoyang
Liu, Zhiying
Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization
title Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization
title_full Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization
title_fullStr Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization
title_full_unstemmed Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization
title_short Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization
title_sort improved feature-selection method considering the imbalance problem in text categorization
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4058251/
https://www.ncbi.nlm.nih.gov/pubmed/24971386
http://dx.doi.org/10.1155/2014/625342
work_keys_str_mv AT yangjieming improvedfeatureselectionmethodconsideringtheimbalanceproblemintextcategorization
AT quzhaoyang improvedfeatureselectionmethodconsideringtheimbalanceproblemintextcategorization
AT liuzhiying improvedfeatureselectionmethodconsideringtheimbalanceproblemintextcategorization