Cargando…

Thematic clustering of text documents using an EM-based approach

Clustering textual contents is an important step in mining useful information on the web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to e...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kim, Sun, Wilbur, W John
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3465205/ https://www.ncbi.nlm.nih.gov/pubmed/23046528 http://dx.doi.org/10.1186/2041-1480-3-S3-S6

_version_	1782245527228776448
author	Kim, Sun Wilbur, W John
author_facet	Kim, Sun Wilbur, W John
author_sort	Kim, Sun
collection	PubMed
description	Clustering textual contents is an important step in mining useful information on the web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans in general since it cannot explain the main subject of each cluster. Utilizing semantic information can solve this problem, but it needs a well-defined ontology or pre-labeled gold standard set. In this paper, we present a thematic clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct subjects, hence it converges to a locally optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for clustering performance. The experimental results show that the proposed method provides a competitive performance compared to other state-of-the-art approaches. We also show that the extracted themes from the MEDLINE(® )dataset represent the subjects of clusters reasonably well.
format	Online Article Text
id	pubmed-3465205
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-34652052012-10-18 Thematic clustering of text documents using an EM-based approach Kim, Sun Wilbur, W John J Biomed Semantics Research Clustering textual contents is an important step in mining useful information on the web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans in general since it cannot explain the main subject of each cluster. Utilizing semantic information can solve this problem, but it needs a well-defined ontology or pre-labeled gold standard set. In this paper, we present a thematic clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct subjects, hence it converges to a locally optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for clustering performance. The experimental results show that the proposed method provides a competitive performance compared to other state-of-the-art approaches. We also show that the extracted themes from the MEDLINE(® )dataset represent the subjects of clusters reasonably well. BioMed Central 2012-10-05 /pmc/articles/PMC3465205/ /pubmed/23046528 http://dx.doi.org/10.1186/2041-1480-3-S3-S6 Text en Copyright ©2012 The article is a work of the United States Government; Title U.S.C 5 105 provides that copyright protection is not available for any work of the United States government in the United satiates; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Kim, Sun Wilbur, W John Thematic clustering of text documents using an EM-based approach
title	Thematic clustering of text documents using an EM-based approach
title_full	Thematic clustering of text documents using an EM-based approach
title_fullStr	Thematic clustering of text documents using an EM-based approach
title_full_unstemmed	Thematic clustering of text documents using an EM-based approach
title_short	Thematic clustering of text documents using an EM-based approach
title_sort	thematic clustering of text documents using an em-based approach
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3465205/ https://www.ncbi.nlm.nih.gov/pubmed/23046528 http://dx.doi.org/10.1186/2041-1480-3-S3-S6
work_keys_str_mv	AT kimsun thematicclusteringoftextdocumentsusinganembasedapproach AT wilburwjohn thematicclusteringoftextdocumentsusinganembasedapproach

Thematic clustering of text documents using an EM-based approach

Ejemplares similares