Cargando…

An index-based algorithm for fast on-line query processing of latent semantic analysis

Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line qu...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhang, Mingxi, Li, Pohan, Wang, Wei
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5433746/ https://www.ncbi.nlm.nih.gov/pubmed/28520747 http://dx.doi.org/10.1371/journal.pone.0177523

_version_	1783236914031099904
author	Zhang, Mingxi Li, Pohan Wang, Wei
author_facet	Zhang, Mingxi Li, Pohan Wang, Wei
author_sort	Zhang, Mingxi
collection	PubMed
description	Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the query request especially when the dataset becomes large. In this paper, we study the efficiency problem of on-line query processing for LSA towards efficiently searching the similar documents to a given query. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity that is stored in a designed index called partial index. For reducing the searching space, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. Based on partial index, we develop an efficient algorithm called ILSA for supporting fast on-line query processing. The given query is transformed into a pseudo document vector, and the similarities between query and candidate documents are computed by accumulating the partial similarities obtained from the index nodes corresponds to non-zero entries in the pseudo document vector. Compared to the LSA algorithm, ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm.
format	Online Article Text
id	pubmed-5433746
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-54337462017-05-26 An index-based algorithm for fast on-line query processing of latent semantic analysis Zhang, Mingxi Li, Pohan Wang, Wei PLoS One Research Article Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the query request especially when the dataset becomes large. In this paper, we study the efficiency problem of on-line query processing for LSA towards efficiently searching the similar documents to a given query. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity that is stored in a designed index called partial index. For reducing the searching space, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. Based on partial index, we develop an efficient algorithm called ILSA for supporting fast on-line query processing. The given query is transformed into a pseudo document vector, and the similarities between query and candidate documents are computed by accumulating the partial similarities obtained from the index nodes corresponds to non-zero entries in the pseudo document vector. Compared to the LSA algorithm, ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm. Public Library of Science 2017-05-16 /pmc/articles/PMC5433746/ /pubmed/28520747 http://dx.doi.org/10.1371/journal.pone.0177523 Text en © 2017 Zhang et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Zhang, Mingxi Li, Pohan Wang, Wei An index-based algorithm for fast on-line query processing of latent semantic analysis
title	An index-based algorithm for fast on-line query processing of latent semantic analysis
title_full	An index-based algorithm for fast on-line query processing of latent semantic analysis
title_fullStr	An index-based algorithm for fast on-line query processing of latent semantic analysis
title_full_unstemmed	An index-based algorithm for fast on-line query processing of latent semantic analysis
title_short	An index-based algorithm for fast on-line query processing of latent semantic analysis
title_sort	index-based algorithm for fast on-line query processing of latent semantic analysis
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5433746/ https://www.ncbi.nlm.nih.gov/pubmed/28520747 http://dx.doi.org/10.1371/journal.pone.0177523
work_keys_str_mv	AT zhangmingxi anindexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis AT lipohan anindexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis AT wangwei anindexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis AT zhangmingxi indexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis AT lipohan indexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis AT wangwei indexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis

An index-based algorithm for fast on-line query processing of latent semantic analysis

Ejemplares similares