Cargando…
Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics
Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3886981/ https://www.ncbi.nlm.nih.gov/pubmed/24416136 http://dx.doi.org/10.1371/journal.pone.0082119 |
_version_ | 1782478948697899008 |
---|---|
author | Luo, Le Li, Li |
author_facet | Luo, Le Li, Li |
author_sort | Luo, Le |
collection | PubMed |
description | Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications. |
format | Online Article Text |
id | pubmed-3886981 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-38869812014-01-10 Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics Luo, Le Li, Li PLoS One Research Article Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications. Public Library of Science 2014-01-09 /pmc/articles/PMC3886981/ /pubmed/24416136 http://dx.doi.org/10.1371/journal.pone.0082119 Text en © 2014 Luo, Li http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited. |
spellingShingle | Research Article Luo, Le Li, Li Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics |
title | Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics |
title_full | Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics |
title_fullStr | Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics |
title_full_unstemmed | Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics |
title_short | Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics |
title_sort | defining and evaluating classification algorithm for high-dimensional data based on latent topics |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3886981/ https://www.ncbi.nlm.nih.gov/pubmed/24416136 http://dx.doi.org/10.1371/journal.pone.0082119 |
work_keys_str_mv | AT luole definingandevaluatingclassificationalgorithmforhighdimensionaldatabasedonlatenttopics AT lili definingandevaluatingclassificationalgorithmforhighdimensionaldatabasedonlatenttopics |