Cargando…
Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents
The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4929121/ https://www.ncbi.nlm.nih.gov/pubmed/27386386 http://dx.doi.org/10.1186/s40064-016-2573-y |
_version_ | 1782440556609142784 |
---|---|
author | Agnihotri, Deepak Verma, Kesari Tripathi, Priyanka |
author_facet | Agnihotri, Deepak Verma, Kesari Tripathi, Priyanka |
author_sort | Agnihotri, Deepak |
collection | PubMed |
description | The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of the N-Grams (SSNG) is proposed using a two pass filtering based feature selection (TPF) approach. Initially, in the first pass of the TPF, the SSNG method chooses various informative N-Grams from the entire extracted N-Grams of the corpus. Subsequently, in the second pass the well-known Chi Square (χ(2)) method is being used to select few most informative N-Grams. Further, to classify the documents the two standard classifiers Multinomial Naive Bayes and Linear Support Vector Machine have been applied on the ten standard text data sets. In most of the datasets, the experimental results state the performance and success rate of SSNG method using TPF approach is superior to the state-of-the-art methods viz. Mutual Information, Information Gain, Odds Ratio, Discriminating Feature Selection and χ(2). |
format | Online Article Text |
id | pubmed-4929121 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-49291212016-07-06 Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents Agnihotri, Deepak Verma, Kesari Tripathi, Priyanka Springerplus Research The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of the N-Grams (SSNG) is proposed using a two pass filtering based feature selection (TPF) approach. Initially, in the first pass of the TPF, the SSNG method chooses various informative N-Grams from the entire extracted N-Grams of the corpus. Subsequently, in the second pass the well-known Chi Square (χ(2)) method is being used to select few most informative N-Grams. Further, to classify the documents the two standard classifiers Multinomial Naive Bayes and Linear Support Vector Machine have been applied on the ten standard text data sets. In most of the datasets, the experimental results state the performance and success rate of SSNG method using TPF approach is superior to the state-of-the-art methods viz. Mutual Information, Information Gain, Odds Ratio, Discriminating Feature Selection and χ(2). Springer International Publishing 2016-06-30 /pmc/articles/PMC4929121/ /pubmed/27386386 http://dx.doi.org/10.1186/s40064-016-2573-y Text en © The Author(s) 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. |
spellingShingle | Research Agnihotri, Deepak Verma, Kesari Tripathi, Priyanka Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents |
title | Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents |
title_full | Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents |
title_fullStr | Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents |
title_full_unstemmed | Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents |
title_short | Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents |
title_sort | computing symmetrical strength of n-grams: a two pass filtering approach in automatic classification of text documents |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4929121/ https://www.ncbi.nlm.nih.gov/pubmed/27386386 http://dx.doi.org/10.1186/s40064-016-2573-y |
work_keys_str_mv | AT agnihotrideepak computingsymmetricalstrengthofngramsatwopassfilteringapproachinautomaticclassificationoftextdocuments AT vermakesari computingsymmetricalstrengthofngramsatwopassfilteringapproachinautomaticclassificationoftextdocuments AT tripathipriyanka computingsymmetricalstrengthofngramsatwopassfilteringapproachinautomaticclassificationoftextdocuments |