Cargando…

The ineffectiveness of within-document term frequency in text classification

For the purposes of classification it is common to represent a document as a bag of words. Such a representation consists of the individual terms making up the document together with the number of times each term appears in the document. All classification methods make use of the terms. It is common...

Descripción completa

Detalles Bibliográficos
Autores principales: Wilbur, W. John, Kim, Won
Formato: Texto
Lenguaje:English
Publicado: Springer Netherlands 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2744136/
https://www.ncbi.nlm.nih.gov/pubmed/19802376
http://dx.doi.org/10.1007/s10791-008-9069-5
_version_ 1782171892657946624
author Wilbur, W. John
Kim, Won
author_facet Wilbur, W. John
Kim, Won
author_sort Wilbur, W. John
collection PubMed
description For the purposes of classification it is common to represent a document as a bag of words. Such a representation consists of the individual terms making up the document together with the number of times each term appears in the document. All classification methods make use of the terms. It is common to also make use of the local term frequencies at the price of some added complication in the model. Examples are the naïve Bayes multinomial model (MM), the Dirichlet compound multinomial model (DCM) and the exponential-family approximation of the DCM (EDCM), as well as support vector machines (SVM). Although it is usually claimed that incorporating local word frequency in a document improves text classification performance, we here test whether such claims are true or not. In this paper we show experimentally that simplified forms of the MM, EDCM, and SVM models which ignore the frequency of each word in a document perform about at the same level as MM, DCM, EDCM and SVM models which incorporate local term frequency. We also present a new form of the naïve Bayes multivariate Bernoulli model (MBM) which is able to make use of local term frequency and show again that it offers no significant advantage over the plain MBM. We conclude that word burstiness is so strong that additional occurrences of a word essentially add no useful information to a classifier.
format Text
id pubmed-2744136
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher Springer Netherlands
record_format MEDLINE/PubMed
spelling pubmed-27441362009-10-01 The ineffectiveness of within-document term frequency in text classification Wilbur, W. John Kim, Won Inf Retr Boston Article For the purposes of classification it is common to represent a document as a bag of words. Such a representation consists of the individual terms making up the document together with the number of times each term appears in the document. All classification methods make use of the terms. It is common to also make use of the local term frequencies at the price of some added complication in the model. Examples are the naïve Bayes multinomial model (MM), the Dirichlet compound multinomial model (DCM) and the exponential-family approximation of the DCM (EDCM), as well as support vector machines (SVM). Although it is usually claimed that incorporating local word frequency in a document improves text classification performance, we here test whether such claims are true or not. In this paper we show experimentally that simplified forms of the MM, EDCM, and SVM models which ignore the frequency of each word in a document perform about at the same level as MM, DCM, EDCM and SVM models which incorporate local term frequency. We also present a new form of the naïve Bayes multivariate Bernoulli model (MBM) which is able to make use of local term frequency and show again that it offers no significant advantage over the plain MBM. We conclude that word burstiness is so strong that additional occurrences of a word essentially add no useful information to a classifier. Springer Netherlands 2008-09-21 2009 /pmc/articles/PMC2744136/ /pubmed/19802376 http://dx.doi.org/10.1007/s10791-008-9069-5 Text en © The Author(s) 2008 https://creativecommons.org/licenses/by-nc/4.0/ This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
spellingShingle Article
Wilbur, W. John
Kim, Won
The ineffectiveness of within-document term frequency in text classification
title The ineffectiveness of within-document term frequency in text classification
title_full The ineffectiveness of within-document term frequency in text classification
title_fullStr The ineffectiveness of within-document term frequency in text classification
title_full_unstemmed The ineffectiveness of within-document term frequency in text classification
title_short The ineffectiveness of within-document term frequency in text classification
title_sort ineffectiveness of within-document term frequency in text classification
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2744136/
https://www.ncbi.nlm.nih.gov/pubmed/19802376
http://dx.doi.org/10.1007/s10791-008-9069-5
work_keys_str_mv AT wilburwjohn theineffectivenessofwithindocumenttermfrequencyintextclassification
AT kimwon theineffectivenessofwithindocumenttermfrequencyintextclassification
AT wilburwjohn ineffectivenessofwithindocumenttermfrequencyintextclassification
AT kimwon ineffectivenessofwithindocumenttermfrequencyintextclassification