Cargando…

Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents

The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating...

Descripción completa

Detalles Bibliográficos
Autores principales: Agnihotri, Deepak, Verma, Kesari, Tripathi, Priyanka
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4929121/
https://www.ncbi.nlm.nih.gov/pubmed/27386386
http://dx.doi.org/10.1186/s40064-016-2573-y
_version_ 1782440556609142784
author Agnihotri, Deepak
Verma, Kesari
Tripathi, Priyanka
author_facet Agnihotri, Deepak
Verma, Kesari
Tripathi, Priyanka
author_sort Agnihotri, Deepak
collection PubMed
description The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of the N-Grams (SSNG) is proposed using a two pass filtering based feature selection (TPF) approach. Initially, in the first pass of the TPF, the SSNG method chooses various informative N-Grams from the entire extracted N-Grams of the corpus. Subsequently, in the second pass the well-known Chi Square (χ(2)) method is being used to select few most informative N-Grams. Further, to classify the documents the two standard classifiers Multinomial Naive Bayes and Linear Support Vector Machine have been applied on the ten standard text data sets. In most of the datasets, the experimental results state the performance and success rate of SSNG method using TPF approach is superior to the state-of-the-art methods viz. Mutual Information, Information Gain, Odds Ratio, Discriminating Feature Selection and χ(2).
format Online
Article
Text
id pubmed-4929121
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-49291212016-07-06 Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents Agnihotri, Deepak Verma, Kesari Tripathi, Priyanka Springerplus Research The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of the N-Grams (SSNG) is proposed using a two pass filtering based feature selection (TPF) approach. Initially, in the first pass of the TPF, the SSNG method chooses various informative N-Grams from the entire extracted N-Grams of the corpus. Subsequently, in the second pass the well-known Chi Square (χ(2)) method is being used to select few most informative N-Grams. Further, to classify the documents the two standard classifiers Multinomial Naive Bayes and Linear Support Vector Machine have been applied on the ten standard text data sets. In most of the datasets, the experimental results state the performance and success rate of SSNG method using TPF approach is superior to the state-of-the-art methods viz. Mutual Information, Information Gain, Odds Ratio, Discriminating Feature Selection and χ(2). Springer International Publishing 2016-06-30 /pmc/articles/PMC4929121/ /pubmed/27386386 http://dx.doi.org/10.1186/s40064-016-2573-y Text en © The Author(s) 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle Research
Agnihotri, Deepak
Verma, Kesari
Tripathi, Priyanka
Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents
title Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents
title_full Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents
title_fullStr Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents
title_full_unstemmed Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents
title_short Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents
title_sort computing symmetrical strength of n-grams: a two pass filtering approach in automatic classification of text documents
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4929121/
https://www.ncbi.nlm.nih.gov/pubmed/27386386
http://dx.doi.org/10.1186/s40064-016-2573-y
work_keys_str_mv AT agnihotrideepak computingsymmetricalstrengthofngramsatwopassfilteringapproachinautomaticclassificationoftextdocuments
AT vermakesari computingsymmetricalstrengthofngramsatwopassfilteringapproachinautomaticclassificationoftextdocuments
AT tripathipriyanka computingsymmetricalstrengthofngramsatwopassfilteringapproachinautomaticclassificationoftextdocuments