Cargando…

An index-based algorithm for fast on-line query processing of latent semantic analysis

Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line qu...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Mingxi, Li, Pohan, Wang, Wei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5433746/
https://www.ncbi.nlm.nih.gov/pubmed/28520747
http://dx.doi.org/10.1371/journal.pone.0177523
_version_ 1783236914031099904
author Zhang, Mingxi
Li, Pohan
Wang, Wei
author_facet Zhang, Mingxi
Li, Pohan
Wang, Wei
author_sort Zhang, Mingxi
collection PubMed
description Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the query request especially when the dataset becomes large. In this paper, we study the efficiency problem of on-line query processing for LSA towards efficiently searching the similar documents to a given query. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity that is stored in a designed index called partial index. For reducing the searching space, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. Based on partial index, we develop an efficient algorithm called ILSA for supporting fast on-line query processing. The given query is transformed into a pseudo document vector, and the similarities between query and candidate documents are computed by accumulating the partial similarities obtained from the index nodes corresponds to non-zero entries in the pseudo document vector. Compared to the LSA algorithm, ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm.
format Online
Article
Text
id pubmed-5433746
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-54337462017-05-26 An index-based algorithm for fast on-line query processing of latent semantic analysis Zhang, Mingxi Li, Pohan Wang, Wei PLoS One Research Article Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the query request especially when the dataset becomes large. In this paper, we study the efficiency problem of on-line query processing for LSA towards efficiently searching the similar documents to a given query. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity that is stored in a designed index called partial index. For reducing the searching space, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. Based on partial index, we develop an efficient algorithm called ILSA for supporting fast on-line query processing. The given query is transformed into a pseudo document vector, and the similarities between query and candidate documents are computed by accumulating the partial similarities obtained from the index nodes corresponds to non-zero entries in the pseudo document vector. Compared to the LSA algorithm, ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm. Public Library of Science 2017-05-16 /pmc/articles/PMC5433746/ /pubmed/28520747 http://dx.doi.org/10.1371/journal.pone.0177523 Text en © 2017 Zhang et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Zhang, Mingxi
Li, Pohan
Wang, Wei
An index-based algorithm for fast on-line query processing of latent semantic analysis
title An index-based algorithm for fast on-line query processing of latent semantic analysis
title_full An index-based algorithm for fast on-line query processing of latent semantic analysis
title_fullStr An index-based algorithm for fast on-line query processing of latent semantic analysis
title_full_unstemmed An index-based algorithm for fast on-line query processing of latent semantic analysis
title_short An index-based algorithm for fast on-line query processing of latent semantic analysis
title_sort index-based algorithm for fast on-line query processing of latent semantic analysis
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5433746/
https://www.ncbi.nlm.nih.gov/pubmed/28520747
http://dx.doi.org/10.1371/journal.pone.0177523
work_keys_str_mv AT zhangmingxi anindexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis
AT lipohan anindexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis
AT wangwei anindexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis
AT zhangmingxi indexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis
AT lipohan indexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis
AT wangwei indexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis