Cargando…

Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints

Methods for dimensionality reduction are showing significant contributions to knowledge generation in high-dimensional modeling scenarios throughout many disciplines. By achieving a lower dimensional representation (also called embedding), fewer computing resources are needed in downstream machine l...

Descripción completa

Detalles Bibliográficos
Autores principales: Lovrić, Mario, Đuričić, Tomislav, Tran, Han T. N., Hussain, Hussain, Lacić, Emanuel, Rasmussen, Morten A., Kern, Roman
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8400160/
https://www.ncbi.nlm.nih.gov/pubmed/34451855
http://dx.doi.org/10.3390/ph14080758
_version_ 1783745249146830848
author Lovrić, Mario
Đuričić, Tomislav
Tran, Han T. N.
Hussain, Hussain
Lacić, Emanuel
Rasmussen, Morten A.
Kern, Roman
author_facet Lovrić, Mario
Đuričić, Tomislav
Tran, Han T. N.
Hussain, Hussain
Lacić, Emanuel
Rasmussen, Morten A.
Kern, Roman
author_sort Lovrić, Mario
collection PubMed
description Methods for dimensionality reduction are showing significant contributions to knowledge generation in high-dimensional modeling scenarios throughout many disciplines. By achieving a lower dimensional representation (also called embedding), fewer computing resources are needed in downstream machine learning tasks, thus leading to a faster training time, lower complexity, and statistical flexibility. In this work, we investigate the utility of three prominent unsupervised embedding techniques (principal component analysis—PCA, uniform manifold approximation and projection—UMAP, and variational autoencoders—VAEs) for solving classification tasks in the domain of toxicology. To this end, we compare these embedding techniques against a set of molecular fingerprint-based models that do not utilize additional pre-preprocessing of features. Inspired by the success of transfer learning in several fields, we further study the performance of embedders when trained on an external dataset of chemical compounds. To gain a better understanding of their characteristics, we evaluate the embedders with different embedding dimensionalities, and with different sizes of the external dataset. Our findings show that the recently popularized UMAP approach can be utilized alongside known techniques such as PCA and VAE as a pre-compression technique in the toxicology domain. Nevertheless, the generative model of VAE shows an advantage in pre-compressing the data with respect to classification accuracy.
format Online
Article
Text
id pubmed-8400160
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-84001602021-08-29 Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints Lovrić, Mario Đuričić, Tomislav Tran, Han T. N. Hussain, Hussain Lacić, Emanuel Rasmussen, Morten A. Kern, Roman Pharmaceuticals (Basel) Article Methods for dimensionality reduction are showing significant contributions to knowledge generation in high-dimensional modeling scenarios throughout many disciplines. By achieving a lower dimensional representation (also called embedding), fewer computing resources are needed in downstream machine learning tasks, thus leading to a faster training time, lower complexity, and statistical flexibility. In this work, we investigate the utility of three prominent unsupervised embedding techniques (principal component analysis—PCA, uniform manifold approximation and projection—UMAP, and variational autoencoders—VAEs) for solving classification tasks in the domain of toxicology. To this end, we compare these embedding techniques against a set of molecular fingerprint-based models that do not utilize additional pre-preprocessing of features. Inspired by the success of transfer learning in several fields, we further study the performance of embedders when trained on an external dataset of chemical compounds. To gain a better understanding of their characteristics, we evaluate the embedders with different embedding dimensionalities, and with different sizes of the external dataset. Our findings show that the recently popularized UMAP approach can be utilized alongside known techniques such as PCA and VAE as a pre-compression technique in the toxicology domain. Nevertheless, the generative model of VAE shows an advantage in pre-compressing the data with respect to classification accuracy. MDPI 2021-08-02 /pmc/articles/PMC8400160/ /pubmed/34451855 http://dx.doi.org/10.3390/ph14080758 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Lovrić, Mario
Đuričić, Tomislav
Tran, Han T. N.
Hussain, Hussain
Lacić, Emanuel
Rasmussen, Morten A.
Kern, Roman
Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints
title Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints
title_full Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints
title_fullStr Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints
title_full_unstemmed Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints
title_short Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints
title_sort should we embed in chemistry? a comparison of unsupervised transfer learning with pca, umap, and vae on molecular fingerprints
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8400160/
https://www.ncbi.nlm.nih.gov/pubmed/34451855
http://dx.doi.org/10.3390/ph14080758
work_keys_str_mv AT lovricmario shouldweembedinchemistryacomparisonofunsupervisedtransferlearningwithpcaumapandvaeonmolecularfingerprints
AT đuricictomislav shouldweembedinchemistryacomparisonofunsupervisedtransferlearningwithpcaumapandvaeonmolecularfingerprints
AT tranhantn shouldweembedinchemistryacomparisonofunsupervisedtransferlearningwithpcaumapandvaeonmolecularfingerprints
AT hussainhussain shouldweembedinchemistryacomparisonofunsupervisedtransferlearningwithpcaumapandvaeonmolecularfingerprints
AT lacicemanuel shouldweembedinchemistryacomparisonofunsupervisedtransferlearningwithpcaumapandvaeonmolecularfingerprints
AT rasmussenmortena shouldweembedinchemistryacomparisonofunsupervisedtransferlearningwithpcaumapandvaeonmolecularfingerprints
AT kernroman shouldweembedinchemistryacomparisonofunsupervisedtransferlearningwithpcaumapandvaeonmolecularfingerprints