Cargando…

Relevance popularity: A term event model based feature selection scheme for text classification

Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e...

Descripción completa

Detalles Bibliográficos
Autores principales: Feng, Guozhong, An, Baiguo, Yang, Fengqin, Wang, Han, Zhang, Libiao
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5381872/
https://www.ncbi.nlm.nih.gov/pubmed/28379986
http://dx.doi.org/10.1371/journal.pone.0174341
_version_ 1782520005925011456
author Feng, Guozhong
An, Baiguo
Yang, Fengqin
Wang, Han
Zhang, Libiao
author_facet Feng, Guozhong
An, Baiguo
Yang, Fengqin
Wang, Han
Zhang, Libiao
author_sort Feng, Guozhong
collection PubMed
description Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods.
format Online
Article
Text
id pubmed-5381872
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-53818722017-04-19 Relevance popularity: A term event model based feature selection scheme for text classification Feng, Guozhong An, Baiguo Yang, Fengqin Wang, Han Zhang, Libiao PLoS One Research Article Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods. Public Library of Science 2017-04-05 /pmc/articles/PMC5381872/ /pubmed/28379986 http://dx.doi.org/10.1371/journal.pone.0174341 Text en © 2017 Feng et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Feng, Guozhong
An, Baiguo
Yang, Fengqin
Wang, Han
Zhang, Libiao
Relevance popularity: A term event model based feature selection scheme for text classification
title Relevance popularity: A term event model based feature selection scheme for text classification
title_full Relevance popularity: A term event model based feature selection scheme for text classification
title_fullStr Relevance popularity: A term event model based feature selection scheme for text classification
title_full_unstemmed Relevance popularity: A term event model based feature selection scheme for text classification
title_short Relevance popularity: A term event model based feature selection scheme for text classification
title_sort relevance popularity: a term event model based feature selection scheme for text classification
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5381872/
https://www.ncbi.nlm.nih.gov/pubmed/28379986
http://dx.doi.org/10.1371/journal.pone.0174341
work_keys_str_mv AT fengguozhong relevancepopularityatermeventmodelbasedfeatureselectionschemefortextclassification
AT anbaiguo relevancepopularityatermeventmodelbasedfeatureselectionschemefortextclassification
AT yangfengqin relevancepopularityatermeventmodelbasedfeatureselectionschemefortextclassification
AT wanghan relevancepopularityatermeventmodelbasedfeatureselectionschemefortextclassification
AT zhanglibiao relevancepopularityatermeventmodelbasedfeatureselectionschemefortextclassification