Cargando…
Relevance popularity: A term event model based feature selection scheme for text classification
Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5381872/ https://www.ncbi.nlm.nih.gov/pubmed/28379986 http://dx.doi.org/10.1371/journal.pone.0174341 |
_version_ | 1782520005925011456 |
---|---|
author | Feng, Guozhong An, Baiguo Yang, Fengqin Wang, Han Zhang, Libiao |
author_facet | Feng, Guozhong An, Baiguo Yang, Fengqin Wang, Han Zhang, Libiao |
author_sort | Feng, Guozhong |
collection | PubMed |
description | Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods. |
format | Online Article Text |
id | pubmed-5381872 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-53818722017-04-19 Relevance popularity: A term event model based feature selection scheme for text classification Feng, Guozhong An, Baiguo Yang, Fengqin Wang, Han Zhang, Libiao PLoS One Research Article Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods. Public Library of Science 2017-04-05 /pmc/articles/PMC5381872/ /pubmed/28379986 http://dx.doi.org/10.1371/journal.pone.0174341 Text en © 2017 Feng et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Feng, Guozhong An, Baiguo Yang, Fengqin Wang, Han Zhang, Libiao Relevance popularity: A term event model based feature selection scheme for text classification |
title | Relevance popularity: A term event model based feature selection scheme for text classification |
title_full | Relevance popularity: A term event model based feature selection scheme for text classification |
title_fullStr | Relevance popularity: A term event model based feature selection scheme for text classification |
title_full_unstemmed | Relevance popularity: A term event model based feature selection scheme for text classification |
title_short | Relevance popularity: A term event model based feature selection scheme for text classification |
title_sort | relevance popularity: a term event model based feature selection scheme for text classification |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5381872/ https://www.ncbi.nlm.nih.gov/pubmed/28379986 http://dx.doi.org/10.1371/journal.pone.0174341 |
work_keys_str_mv | AT fengguozhong relevancepopularityatermeventmodelbasedfeatureselectionschemefortextclassification AT anbaiguo relevancepopularityatermeventmodelbasedfeatureselectionschemefortextclassification AT yangfengqin relevancepopularityatermeventmodelbasedfeatureselectionschemefortextclassification AT wanghan relevancepopularityatermeventmodelbasedfeatureselectionschemefortextclassification AT zhanglibiao relevancepopularityatermeventmodelbasedfeatureselectionschemefortextclassification |