Cargando…

Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics

Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory...

Descripción completa

Detalles Bibliográficos
Autores principales:	Luo, Le, Li, Li
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3886981/ https://www.ncbi.nlm.nih.gov/pubmed/24416136 http://dx.doi.org/10.1371/journal.pone.0082119

_version_	1782478948697899008
author	Luo, Le Li, Li
author_facet	Luo, Le Li, Li
author_sort	Luo, Le
collection	PubMed
description	Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications.
format	Online Article Text
id	pubmed-3886981
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-38869812014-01-10 Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics Luo, Le Li, Li PLoS One Research Article Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications. Public Library of Science 2014-01-09 /pmc/articles/PMC3886981/ /pubmed/24416136 http://dx.doi.org/10.1371/journal.pone.0082119 Text en © 2014 Luo, Li http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Luo, Le Li, Li Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics
title	Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics
title_full	Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics
title_fullStr	Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics
title_full_unstemmed	Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics
title_short	Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics
title_sort	defining and evaluating classification algorithm for high-dimensional data based on latent topics
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3886981/ https://www.ncbi.nlm.nih.gov/pubmed/24416136 http://dx.doi.org/10.1371/journal.pone.0082119
work_keys_str_mv	AT luole definingandevaluatingclassificationalgorithmforhighdimensionaldatabasedonlatenttopics AT lili definingandevaluatingclassificationalgorithmforhighdimensionaldatabasedonlatenttopics

Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics

Ejemplares similares