Cargando…

Distributed query-aware quantization for high-dimensional similarity searches

The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest Neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between mult...

Descripción completa

Detalles Bibliográficos
Autores principales:	Guzun, Gheorghi, Canahuate, Guadalupe
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2018
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5946695/ https://www.ncbi.nlm.nih.gov/pubmed/29756125 http://dx.doi.org/10.5441/002/edbt.2018.33

_version_	1783322252149784576
author	Guzun, Gheorghi Canahuate, Guadalupe
author_facet	Guzun, Gheorghi Canahuate, Guadalupe
author_sort	Guzun, Gheorghi
collection	PubMed
description	The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest Neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic. There is potential to improve accuracy when a query-dependent quantization is used. In this paper we propose a Query dependent Equi-Depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions.
format	Online Article Text
id	pubmed-5946695
institution	National Center for Biotechnology Information
language	English
publishDate	2018
record_format	MEDLINE/PubMed
spelling	pubmed-59466952018-05-11 Distributed query-aware quantization for high-dimensional similarity searches Guzun, Gheorghi Canahuate, Guadalupe Adv Database Technol Article The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest Neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic. There is potential to improve accuracy when a query-dependent quantization is used. In this paper we propose a Query dependent Equi-Depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions. 2018-03 /pmc/articles/PMC5946695/ /pubmed/29756125 http://dx.doi.org/10.5441/002/edbt.2018.33 Text en Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 (http://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle	Article Guzun, Gheorghi Canahuate, Guadalupe Distributed query-aware quantization for high-dimensional similarity searches
title	Distributed query-aware quantization for high-dimensional similarity searches
title_full	Distributed query-aware quantization for high-dimensional similarity searches
title_fullStr	Distributed query-aware quantization for high-dimensional similarity searches
title_full_unstemmed	Distributed query-aware quantization for high-dimensional similarity searches
title_short	Distributed query-aware quantization for high-dimensional similarity searches
title_sort	distributed query-aware quantization for high-dimensional similarity searches
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5946695/ https://www.ncbi.nlm.nih.gov/pubmed/29756125 http://dx.doi.org/10.5441/002/edbt.2018.33
work_keys_str_mv	AT guzungheorghi distributedqueryawarequantizationforhighdimensionalsimilaritysearches AT canahuateguadalupe distributedqueryawarequantizationforhighdimensionalsimilaritysearches

Distributed query-aware quantization for high-dimensional similarity searches

Ejemplares similares