Cargando…

WEClustering: word embeddings based text clustering technique for large datasets

A massive amount of textual data now exists in digital repositories in the form of research articles, news articles, reviews, Wikipedia articles, and books, etc. Text clustering is a fundamental data mining technique to perform categorization, topic extraction, and information retrieval. Textual dat...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mehta, Vivek, Bawa, Seema, Singh, Jasmeet
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2021
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8421191/ https://www.ncbi.nlm.nih.gov/pubmed/34777978 http://dx.doi.org/10.1007/s40747-021-00512-9

_version_	1783749026588393472
author	Mehta, Vivek Bawa, Seema Singh, Jasmeet
author_facet	Mehta, Vivek Bawa, Seema Singh, Jasmeet
author_sort	Mehta, Vivek
collection	PubMed
description	A massive amount of textual data now exists in digital repositories in the form of research articles, news articles, reviews, Wikipedia articles, and books, etc. Text clustering is a fundamental data mining technique to perform categorization, topic extraction, and information retrieval. Textual datasets, especially which contain a large number of documents are sparse and have high dimensionality. Hence, traditional clustering techniques such as K-means, Agglomerative clustering, and DBSCAN cannot perform well. In this paper, a clustering technique especially suitable to large text datasets is proposed that overcome these limitations. The proposed technique is based on word embeddings derived from a recent deep learning model named “Bidirectional Encoders Representations using Transformers”. The proposed technique is named as WEClustering. The proposed technique deals with the problem of high dimensionality in an effective manner, hence, more accurate clusters are formed. The technique is validated on several datasets of varying sizes and its performance is compared with other widely used and state of the art clustering techniques. The experimental comparison shows that the proposed clustering technique gives a significant improvement over other techniques as measured by metrics such Purity and Adjusted Rand Index.
format	Online Article Text
id	pubmed-8421191
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-84211912021-09-07 WEClustering: word embeddings based text clustering technique for large datasets Mehta, Vivek Bawa, Seema Singh, Jasmeet Complex Intell Systems Original Article A massive amount of textual data now exists in digital repositories in the form of research articles, news articles, reviews, Wikipedia articles, and books, etc. Text clustering is a fundamental data mining technique to perform categorization, topic extraction, and information retrieval. Textual datasets, especially which contain a large number of documents are sparse and have high dimensionality. Hence, traditional clustering techniques such as K-means, Agglomerative clustering, and DBSCAN cannot perform well. In this paper, a clustering technique especially suitable to large text datasets is proposed that overcome these limitations. The proposed technique is based on word embeddings derived from a recent deep learning model named “Bidirectional Encoders Representations using Transformers”. The proposed technique is named as WEClustering. The proposed technique deals with the problem of high dimensionality in an effective manner, hence, more accurate clusters are formed. The technique is validated on several datasets of varying sizes and its performance is compared with other widely used and state of the art clustering techniques. The experimental comparison shows that the proposed clustering technique gives a significant improvement over other techniques as measured by metrics such Purity and Adjusted Rand Index. Springer International Publishing 2021-09-07 2021 /pmc/articles/PMC8421191/ /pubmed/34777978 http://dx.doi.org/10.1007/s40747-021-00512-9 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Original Article Mehta, Vivek Bawa, Seema Singh, Jasmeet WEClustering: word embeddings based text clustering technique for large datasets
title	WEClustering: word embeddings based text clustering technique for large datasets
title_full	WEClustering: word embeddings based text clustering technique for large datasets
title_fullStr	WEClustering: word embeddings based text clustering technique for large datasets
title_full_unstemmed	WEClustering: word embeddings based text clustering technique for large datasets
title_short	WEClustering: word embeddings based text clustering technique for large datasets
title_sort	weclustering: word embeddings based text clustering technique for large datasets
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8421191/ https://www.ncbi.nlm.nih.gov/pubmed/34777978 http://dx.doi.org/10.1007/s40747-021-00512-9
work_keys_str_mv	AT mehtavivek weclusteringwordembeddingsbasedtextclusteringtechniqueforlargedatasets AT bawaseema weclusteringwordembeddingsbasedtextclusteringtechniqueforlargedatasets AT singhjasmeet weclusteringwordembeddingsbasedtextclusteringtechniqueforlargedatasets

WEClustering: word embeddings based text clustering technique for large datasets

Ejemplares similares