Cargando…

Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming

The rapid growth of Internet technologies has led to an enormous increase in the number of electronic documents used worldwide. To organize and manage big data for unstructured documents effectively and efficiently, text categorization has been employed in recent decades. To conduct text categorizat...

Descripción completa

Detalles Bibliográficos
Autores principales: Lim, Hyunki, Kim, Dae-Won
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516869/
https://www.ncbi.nlm.nih.gov/pubmed/33286170
http://dx.doi.org/10.3390/e22040395
_version_ 1783587098470645760
author Lim, Hyunki
Kim, Dae-Won
author_facet Lim, Hyunki
Kim, Dae-Won
author_sort Lim, Hyunki
collection PubMed
description The rapid growth of Internet technologies has led to an enormous increase in the number of electronic documents used worldwide. To organize and manage big data for unstructured documents effectively and efficiently, text categorization has been employed in recent decades. To conduct text categorization tasks, documents are usually represented using the bag-of-words model, owing to its simplicity. In this representation for text classification, feature selection becomes an essential method because all terms in the vocabulary induce enormous feature space corresponding to the documents. In this paper, we propose a new feature selection method that considers term similarity to avoid the selection of redundant terms. Term similarity is measured using a general method such as mutual information, and serves as a second measure in feature selection in addition to term ranking. To consider balance of term ranking and term similarity for feature selection, we use a quadratic programming-based numerical optimization approach. Experimental results demonstrate that considering term similarity is effective and has higher accuracy than conventional methods.
format Online
Article
Text
id pubmed-7516869
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-75168692020-11-09 Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming Lim, Hyunki Kim, Dae-Won Entropy (Basel) Article The rapid growth of Internet technologies has led to an enormous increase in the number of electronic documents used worldwide. To organize and manage big data for unstructured documents effectively and efficiently, text categorization has been employed in recent decades. To conduct text categorization tasks, documents are usually represented using the bag-of-words model, owing to its simplicity. In this representation for text classification, feature selection becomes an essential method because all terms in the vocabulary induce enormous feature space corresponding to the documents. In this paper, we propose a new feature selection method that considers term similarity to avoid the selection of redundant terms. Term similarity is measured using a general method such as mutual information, and serves as a second measure in feature selection in addition to term ranking. To consider balance of term ranking and term similarity for feature selection, we use a quadratic programming-based numerical optimization approach. Experimental results demonstrate that considering term similarity is effective and has higher accuracy than conventional methods. MDPI 2020-03-30 /pmc/articles/PMC7516869/ /pubmed/33286170 http://dx.doi.org/10.3390/e22040395 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Lim, Hyunki
Kim, Dae-Won
Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming
title Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming
title_full Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming
title_fullStr Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming
title_full_unstemmed Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming
title_short Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming
title_sort generalized term similarity for feature selection in text classification using quadratic programming
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516869/
https://www.ncbi.nlm.nih.gov/pubmed/33286170
http://dx.doi.org/10.3390/e22040395
work_keys_str_mv AT limhyunki generalizedtermsimilarityforfeatureselectionintextclassificationusingquadraticprogramming
AT kimdaewon generalizedtermsimilarityforfeatureselectionintextclassificationusingquadraticprogramming