Cargando…
LDA filter: A Latent Dirichlet Allocation preprocess method for Weka
This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7652301/ https://www.ncbi.nlm.nih.gov/pubmed/33166342 http://dx.doi.org/10.1371/journal.pone.0241701 |
_version_ | 1783607683707830272 |
---|---|
author | Celard, P. Vieira, A. Seara Iglesias, E. L. Borrajo, L. |
author_facet | Celard, P. Vieira, A. Seara Iglesias, E. L. Borrajo, L. |
author_sort | Celard, P. |
collection | PubMed |
description | This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over an entire vocabulary. Our main objective is to use the probability of a document belonging to each topic to implement a new text representation model. This proposed technique is deployed as an extension of the Weka software as a new filter. To demonstrate its performance, the created filter is tested with different classifiers such as a Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Naive Bayes in different documental corpora (OHSUMED, Reuters-21578, 20Newsgroup, Yahoo! Answers, YELP Polarity, and TREC Genomics 2015). Then, it is compared with the Bag of Words (BoW) representation technique. Results suggest that the application of our proposed filter achieves similar accuracy as BoW but greatly improves classification processing times. |
format | Online Article Text |
id | pubmed-7652301 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-76523012020-11-18 LDA filter: A Latent Dirichlet Allocation preprocess method for Weka Celard, P. Vieira, A. Seara Iglesias, E. L. Borrajo, L. PLoS One Research Article This work presents an alternative method to represent documents based on LDA (Latent Dirichlet Allocation) and how it affects to classification algorithms, in comparison to common text representation. LDA assumes that each document deals with a set of predefined topics, which are distributions over an entire vocabulary. Our main objective is to use the probability of a document belonging to each topic to implement a new text representation model. This proposed technique is deployed as an extension of the Weka software as a new filter. To demonstrate its performance, the created filter is tested with different classifiers such as a Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Naive Bayes in different documental corpora (OHSUMED, Reuters-21578, 20Newsgroup, Yahoo! Answers, YELP Polarity, and TREC Genomics 2015). Then, it is compared with the Bag of Words (BoW) representation technique. Results suggest that the application of our proposed filter achieves similar accuracy as BoW but greatly improves classification processing times. Public Library of Science 2020-11-09 /pmc/articles/PMC7652301/ /pubmed/33166342 http://dx.doi.org/10.1371/journal.pone.0241701 Text en © 2020 Celard et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Celard, P. Vieira, A. Seara Iglesias, E. L. Borrajo, L. LDA filter: A Latent Dirichlet Allocation preprocess method for Weka |
title | LDA filter: A Latent Dirichlet Allocation preprocess method for Weka |
title_full | LDA filter: A Latent Dirichlet Allocation preprocess method for Weka |
title_fullStr | LDA filter: A Latent Dirichlet Allocation preprocess method for Weka |
title_full_unstemmed | LDA filter: A Latent Dirichlet Allocation preprocess method for Weka |
title_short | LDA filter: A Latent Dirichlet Allocation preprocess method for Weka |
title_sort | lda filter: a latent dirichlet allocation preprocess method for weka |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7652301/ https://www.ncbi.nlm.nih.gov/pubmed/33166342 http://dx.doi.org/10.1371/journal.pone.0241701 |
work_keys_str_mv | AT celardp ldafilteralatentdirichletallocationpreprocessmethodforweka AT vieiraaseara ldafilteralatentdirichletallocationpreprocessmethodforweka AT iglesiasel ldafilteralatentdirichletallocationpreprocessmethodforweka AT borrajol ldafilteralatentdirichletallocationpreprocessmethodforweka |