Cargando…

Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing

Motivation: Similarity searching and clustering of chemical compounds by structural similarities are important computational approaches for identifying drug-like small molecules. Most algorithms available for these tasks are limited by their speed and scalability, and cannot handle today's larg...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cao, Yiqun, Jiang, Tao, Girke, Thomas
Formato:	Texto
Lenguaje:	English
Publicado:	Oxford University Press 2010
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2844998/ https://www.ncbi.nlm.nih.gov/pubmed/20179075 http://dx.doi.org/10.1093/bioinformatics/btq067

_version_	1782179353471221760
author	Cao, Yiqun Jiang, Tao Girke, Thomas
author_facet	Cao, Yiqun Jiang, Tao Girke, Thomas
author_sort	Cao, Yiqun
collection	PubMed
description	Motivation: Similarity searching and clustering of chemical compounds by structural similarities are important computational approaches for identifying drug-like small molecules. Most algorithms available for these tasks are limited by their speed and scalability, and cannot handle today's large compound databases with several million entries. Results: In this article, we introduce a new algorithm for accelerated similarity searching and clustering of very large compound sets using embedding and indexing (EI) techniques. First, we present EI-Search as a general purpose similarity search method for finding objects with similar features in large databases and apply it here to searching and clustering of large compound sets. The method embeds the compounds in a high-dimensional Euclidean space and searches this space using an efficient index-aware nearest neighbor search method based on locality sensitive hashing (LSH). Second, to cluster large compound sets, we introduce the EI-Clustering algorithm that combines the EI-Search method with Jarvis–Patrick clustering. Both methods were tested on three large datasets with sizes ranging from about 260 000 to over 19 million compounds. In comparison to sequential search methods, the EI-Search method was 40–200 times faster, while maintaining comparable recall rates. The EI-Clustering method allowed us to significantly reduce the CPU time required to cluster these large compound libraries from several months to only a few days. Availability: Software implementations and online services have been developed based on the methods introduced in this study. The online services provide access to the generated clustering results and ultra-fast similarity searching of the PubChem Compound database with subsecond response time. Contact: thomas.girke@ucr.edu Supplementary information: Supplementary data are available at Bioinformatics online.
format	Text
id	pubmed-2844998
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-28449982010-03-29 Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing Cao, Yiqun Jiang, Tao Girke, Thomas Bioinformatics Original Papers Motivation: Similarity searching and clustering of chemical compounds by structural similarities are important computational approaches for identifying drug-like small molecules. Most algorithms available for these tasks are limited by their speed and scalability, and cannot handle today's large compound databases with several million entries. Results: In this article, we introduce a new algorithm for accelerated similarity searching and clustering of very large compound sets using embedding and indexing (EI) techniques. First, we present EI-Search as a general purpose similarity search method for finding objects with similar features in large databases and apply it here to searching and clustering of large compound sets. The method embeds the compounds in a high-dimensional Euclidean space and searches this space using an efficient index-aware nearest neighbor search method based on locality sensitive hashing (LSH). Second, to cluster large compound sets, we introduce the EI-Clustering algorithm that combines the EI-Search method with Jarvis–Patrick clustering. Both methods were tested on three large datasets with sizes ranging from about 260 000 to over 19 million compounds. In comparison to sequential search methods, the EI-Search method was 40–200 times faster, while maintaining comparable recall rates. The EI-Clustering method allowed us to significantly reduce the CPU time required to cluster these large compound libraries from several months to only a few days. Availability: Software implementations and online services have been developed based on the methods introduced in this study. The online services provide access to the generated clustering results and ultra-fast similarity searching of the PubChem Compound database with subsecond response time. Contact: thomas.girke@ucr.edu Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2010-04-01 2010-02-23 /pmc/articles/PMC2844998/ /pubmed/20179075 http://dx.doi.org/10.1093/bioinformatics/btq067 Text en © The Author(s) 2010. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Cao, Yiqun Jiang, Tao Girke, Thomas Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing
title	Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing
title_full	Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing
title_fullStr	Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing
title_full_unstemmed	Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing
title_short	Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing
title_sort	accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2844998/ https://www.ncbi.nlm.nih.gov/pubmed/20179075 http://dx.doi.org/10.1093/bioinformatics/btq067
work_keys_str_mv	AT caoyiqun acceleratedsimilaritysearchingandclusteringoflargecompoundsetsbygeometricembeddingandlocalitysensitivehashing AT jiangtao acceleratedsimilaritysearchingandclusteringoflargecompoundsetsbygeometricembeddingandlocalitysensitivehashing AT girkethomas acceleratedsimilaritysearchingandclusteringoflargecompoundsetsbygeometricembeddingandlocalitysensitivehashing

Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing

Ejemplares similares