Cargando…

LDA filter: A Latent Dirichlet Allocation preprocess method for Weka

This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over...

Descripción completa

Detalles Bibliográficos
Autores principales: Celard, P., Vieira, A. Seara, Iglesias, E. L., Borrajo, L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7652301/
https://www.ncbi.nlm.nih.gov/pubmed/33166342
http://dx.doi.org/10.1371/journal.pone.0241701
_version_ 1783607683707830272
author Celard, P.
Vieira, A. Seara
Iglesias, E. L.
Borrajo, L.
author_facet Celard, P.
Vieira, A. Seara
Iglesias, E. L.
Borrajo, L.
author_sort Celard, P.
collection PubMed
description This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over an entire vocabulary. Our main objective is to use the probability of a document belonging to each topic to implement a new text representation model. This proposed technique is deployed as an extension of the Weka software as a new filter. To demonstrate its performance, the created filter is tested with different classifiers such as a Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Naive Bayes in different documental corpora (OHSUMED, Reuters-21578, 20Newsgroup, Yahoo! Answers, YELP Polarity, and TREC Genomics 2015). Then, it is compared with the Bag of Words (BoW) representation technique. Results suggest that the application of our proposed filter achieves similar accuracy as BoW but greatly improves classification processing times.
format Online
Article
Text
id pubmed-7652301
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-76523012020-11-18 LDA filter: A Latent Dirichlet Allocation preprocess method for Weka Celard, P. Vieira, A. Seara Iglesias, E. L. Borrajo, L. PLoS One Research Article This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over an entire vocabulary. Our main objective is to use the probability of a document belonging to each topic to implement a new text representation model. This proposed technique is deployed as an extension of the Weka software as a new filter. To demonstrate its performance, the created filter is tested with different classifiers such as a Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Naive Bayes in different documental corpora (OHSUMED, Reuters-21578, 20Newsgroup, Yahoo! Answers, YELP Polarity, and TREC Genomics 2015). Then, it is compared with the Bag of Words (BoW) representation technique. Results suggest that the application of our proposed filter achieves similar accuracy as BoW but greatly improves classification processing times. Public Library of Science 2020-11-09 /pmc/articles/PMC7652301/ /pubmed/33166342 http://dx.doi.org/10.1371/journal.pone.0241701 Text en © 2020 Celard et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Celard, P.
Vieira, A. Seara
Iglesias, E. L.
Borrajo, L.
LDA filter: A Latent Dirichlet Allocation preprocess method for Weka
title LDA filter: A Latent Dirichlet Allocation preprocess method for Weka
title_full LDA filter: A Latent Dirichlet Allocation preprocess method for Weka
title_fullStr LDA filter: A Latent Dirichlet Allocation preprocess method for Weka
title_full_unstemmed LDA filter: A Latent Dirichlet Allocation preprocess method for Weka
title_short LDA filter: A Latent Dirichlet Allocation preprocess method for Weka
title_sort lda filter: a latent dirichlet allocation preprocess method for weka
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7652301/
https://www.ncbi.nlm.nih.gov/pubmed/33166342
http://dx.doi.org/10.1371/journal.pone.0241701
work_keys_str_mv AT celardp ldafilteralatentdirichletallocationpreprocessmethodforweka
AT vieiraaseara ldafilteralatentdirichletallocationpreprocessmethodforweka
AT iglesiasel ldafilteralatentdirichletallocationpreprocessmethodforweka
AT borrajol ldafilteralatentdirichletallocationpreprocessmethodforweka