Cargando…

VAE-Sim: A Novel Molecular Similarity Measure Based on a Variational Autoencoder

Molecular similarity is an elusive but core “unsupervised” cheminformatics concept, yet different “fingerprint” encodings of molecular structures return very different similarity values, even when using the same similarity metric. Each encoding may be of value when applied to other problems with obj...

Descripción completa

Detalles Bibliográficos
Autores principales: Samanta, Soumitra, O’Hagan, Steve, Swainston, Neil, Roberts, Timothy J., Kell, Douglas B.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7435890/
https://www.ncbi.nlm.nih.gov/pubmed/32751155
http://dx.doi.org/10.3390/molecules25153446
_version_ 1783572427507236864
author Samanta, Soumitra
O’Hagan, Steve
Swainston, Neil
Roberts, Timothy J.
Kell, Douglas B.
author_facet Samanta, Soumitra
O’Hagan, Steve
Swainston, Neil
Roberts, Timothy J.
Kell, Douglas B.
author_sort Samanta, Soumitra
collection PubMed
description Molecular similarity is an elusive but core “unsupervised” cheminformatics concept, yet different “fingerprint” encodings of molecular structures return very different similarity values, even when using the same similarity metric. Each encoding may be of value when applied to other problems with objective or target functions, implying that a priori none are “better” than the others, nor than encoding-free metrics such as maximum common substructure (MCSS). We here introduce a novel approach to molecular similarity, in the form of a variational autoencoder (VAE). This learns the joint distribution p(z|x) where z is a latent vector and x are the (same) input/output data. It takes the form of a “bowtie”-shaped artificial neural network. In the middle is a “bottleneck layer” or latent vector in which inputs are transformed into, and represented as, a vector of numbers (encoding), with a reverse process (decoding) seeking to return the SMILES string that was the input. We train a VAE on over six million druglike molecules and natural products (including over one million in the final holdout set). The VAE vector distances provide a rapid and novel metric for molecular similarity that is both easily and rapidly calculated. We describe the method and its application to a typical similarity problem in cheminformatics.
format Online
Article
Text
id pubmed-7435890
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-74358902020-08-24 VAE-Sim: A Novel Molecular Similarity Measure Based on a Variational Autoencoder Samanta, Soumitra O’Hagan, Steve Swainston, Neil Roberts, Timothy J. Kell, Douglas B. Molecules Article Molecular similarity is an elusive but core “unsupervised” cheminformatics concept, yet different “fingerprint” encodings of molecular structures return very different similarity values, even when using the same similarity metric. Each encoding may be of value when applied to other problems with objective or target functions, implying that a priori none are “better” than the others, nor than encoding-free metrics such as maximum common substructure (MCSS). We here introduce a novel approach to molecular similarity, in the form of a variational autoencoder (VAE). This learns the joint distribution p(z|x) where z is a latent vector and x are the (same) input/output data. It takes the form of a “bowtie”-shaped artificial neural network. In the middle is a “bottleneck layer” or latent vector in which inputs are transformed into, and represented as, a vector of numbers (encoding), with a reverse process (decoding) seeking to return the SMILES string that was the input. We train a VAE on over six million druglike molecules and natural products (including over one million in the final holdout set). The VAE vector distances provide a rapid and novel metric for molecular similarity that is both easily and rapidly calculated. We describe the method and its application to a typical similarity problem in cheminformatics. MDPI 2020-07-29 /pmc/articles/PMC7435890/ /pubmed/32751155 http://dx.doi.org/10.3390/molecules25153446 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Samanta, Soumitra
O’Hagan, Steve
Swainston, Neil
Roberts, Timothy J.
Kell, Douglas B.
VAE-Sim: A Novel Molecular Similarity Measure Based on a Variational Autoencoder
title VAE-Sim: A Novel Molecular Similarity Measure Based on a Variational Autoencoder
title_full VAE-Sim: A Novel Molecular Similarity Measure Based on a Variational Autoencoder
title_fullStr VAE-Sim: A Novel Molecular Similarity Measure Based on a Variational Autoencoder
title_full_unstemmed VAE-Sim: A Novel Molecular Similarity Measure Based on a Variational Autoencoder
title_short VAE-Sim: A Novel Molecular Similarity Measure Based on a Variational Autoencoder
title_sort vae-sim: a novel molecular similarity measure based on a variational autoencoder
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7435890/
https://www.ncbi.nlm.nih.gov/pubmed/32751155
http://dx.doi.org/10.3390/molecules25153446
work_keys_str_mv AT samantasoumitra vaesimanovelmolecularsimilaritymeasurebasedonavariationalautoencoder
AT ohagansteve vaesimanovelmolecularsimilaritymeasurebasedonavariationalautoencoder
AT swainstonneil vaesimanovelmolecularsimilaritymeasurebasedonavariationalautoencoder
AT robertstimothyj vaesimanovelmolecularsimilaritymeasurebasedonavariationalautoencoder
AT kelldouglasb vaesimanovelmolecularsimilaritymeasurebasedonavariationalautoencoder