Cargando…

Unsupervised Representation Learning for Proteochemometric Modeling

In silico protein–ligand binding prediction is an ongoing area of research in computational chemistry and machine learning based drug discovery, as an accurate predictive model could greatly reduce the time and resources necessary for the detection and prioritization of possible drug candidates. Pro...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kim, Paul T., Winter, Robin, Clevert, Djork-Arné
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8657702/ https://www.ncbi.nlm.nih.gov/pubmed/34884688 http://dx.doi.org/10.3390/ijms222312882

_version_	1784612561962926080
author	Kim, Paul T. Winter, Robin Clevert, Djork-Arné
author_facet	Kim, Paul T. Winter, Robin Clevert, Djork-Arné
author_sort	Kim, Paul T.
collection	PubMed
description	In silico protein–ligand binding prediction is an ongoing area of research in computational chemistry and machine learning based drug discovery, as an accurate predictive model could greatly reduce the time and resources necessary for the detection and prioritization of possible drug candidates. Proteochemometric modeling (PCM) attempts to create an accurate model of the protein–ligand interaction space by combining explicit protein and ligand descriptors. This requires the creation of information-rich, uniform and computer interpretable representations of proteins and ligands. Previous studies in PCM modeling rely on pre-defined, handcrafted feature extraction methods, and many methods use protein descriptors that require alignment or are otherwise specific to a particular group of related proteins. However, recent advances in representation learning have shown that unsupervised machine learning can be used to generate embeddings that outperform complex, human-engineered representations. Several different embedding methods for proteins and molecules have been developed based on various language-modeling methods. Here, we demonstrate the utility of these unsupervised representations and compare three protein embeddings and two compound embeddings in a fair manner. We evaluate performance on various splits of a benchmark dataset, as well as on an internal dataset of protein–ligand binding activities and find that unsupervised-learned representations significantly outperform handcrafted representations.
format	Online Article Text
id	pubmed-8657702
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-86577022021-12-10 Unsupervised Representation Learning for Proteochemometric Modeling Kim, Paul T. Winter, Robin Clevert, Djork-Arné Int J Mol Sci Article In silico protein–ligand binding prediction is an ongoing area of research in computational chemistry and machine learning based drug discovery, as an accurate predictive model could greatly reduce the time and resources necessary for the detection and prioritization of possible drug candidates. Proteochemometric modeling (PCM) attempts to create an accurate model of the protein–ligand interaction space by combining explicit protein and ligand descriptors. This requires the creation of information-rich, uniform and computer interpretable representations of proteins and ligands. Previous studies in PCM modeling rely on pre-defined, handcrafted feature extraction methods, and many methods use protein descriptors that require alignment or are otherwise specific to a particular group of related proteins. However, recent advances in representation learning have shown that unsupervised machine learning can be used to generate embeddings that outperform complex, human-engineered representations. Several different embedding methods for proteins and molecules have been developed based on various language-modeling methods. Here, we demonstrate the utility of these unsupervised representations and compare three protein embeddings and two compound embeddings in a fair manner. We evaluate performance on various splits of a benchmark dataset, as well as on an internal dataset of protein–ligand binding activities and find that unsupervised-learned representations significantly outperform handcrafted representations. MDPI 2021-11-28 /pmc/articles/PMC8657702/ /pubmed/34884688 http://dx.doi.org/10.3390/ijms222312882 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Kim, Paul T. Winter, Robin Clevert, Djork-Arné Unsupervised Representation Learning for Proteochemometric Modeling
title	Unsupervised Representation Learning for Proteochemometric Modeling
title_full	Unsupervised Representation Learning for Proteochemometric Modeling
title_fullStr	Unsupervised Representation Learning for Proteochemometric Modeling
title_full_unstemmed	Unsupervised Representation Learning for Proteochemometric Modeling
title_short	Unsupervised Representation Learning for Proteochemometric Modeling
title_sort	unsupervised representation learning for proteochemometric modeling
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8657702/ https://www.ncbi.nlm.nih.gov/pubmed/34884688 http://dx.doi.org/10.3390/ijms222312882
work_keys_str_mv	AT kimpault unsupervisedrepresentationlearningforproteochemometricmodeling AT winterrobin unsupervisedrepresentationlearningforproteochemometricmodeling AT clevertdjorkarne unsupervisedrepresentationlearningforproteochemometricmodeling

Unsupervised Representation Learning for Proteochemometric Modeling

Ejemplares similares