Cargando…

Learning chemistry: exploring the suitability of machine learning for the task of structure-based chemical ontology classification

Chemical data is increasingly openly available in databases such as PubChem, which contains approximately 110 million compound entries as of February 2021. With the availability of data at such scale, the burden has shifted to organisation, analysis and interpretation. Chemical ontologies provide st...

Descripción completa

Detalles Bibliográficos
Autores principales: Hastings, Janna, Glauer, Martin, Memariani, Adel, Neuhaus, Fabian, Mossakowski, Till
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7962259/
https://www.ncbi.nlm.nih.gov/pubmed/33726837
http://dx.doi.org/10.1186/s13321-021-00500-8
_version_ 1783665434429489152
author Hastings, Janna
Glauer, Martin
Memariani, Adel
Neuhaus, Fabian
Mossakowski, Till
author_facet Hastings, Janna
Glauer, Martin
Memariani, Adel
Neuhaus, Fabian
Mossakowski, Till
author_sort Hastings, Janna
collection PubMed
description Chemical data is increasingly openly available in databases such as PubChem, which contains approximately 110 million compound entries as of February 2021. With the availability of data at such scale, the burden has shifted to organisation, analysis and interpretation. Chemical ontologies provide structured classifications of chemical entities that can be used for navigation and filtering of the large chemical space. ChEBI is a prominent example of a chemical ontology, widely used in life science contexts. However, ChEBI is manually maintained and as such cannot easily scale to the full scope of public chemical data. There is a need for tools that are able to automatically classify chemical data into chemical ontologies, which can be framed as a hierarchical multi-class classification problem. In this paper we evaluate machine learning approaches for this task, comparing different learning frameworks including logistic regression, decision trees and long short-term memory artificial neural networks, and different encoding approaches for the chemical structures, including cheminformatics fingerprints and character-based encoding from chemical line notation representations. We find that classical learning approaches such as logistic regression perform well with sets of relatively specific, disjoint chemical classes, while the neural network is able to handle larger sets of overlapping classes but needs more examples per class to learn from, and is not able to make a class prediction for every molecule. Future work will explore hybrid and ensemble approaches, as well as alternative network architectures including neuro-symbolic approaches.
format Online
Article
Text
id pubmed-7962259
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-79622592021-03-16 Learning chemistry: exploring the suitability of machine learning for the task of structure-based chemical ontology classification Hastings, Janna Glauer, Martin Memariani, Adel Neuhaus, Fabian Mossakowski, Till J Cheminform Research Article Chemical data is increasingly openly available in databases such as PubChem, which contains approximately 110 million compound entries as of February 2021. With the availability of data at such scale, the burden has shifted to organisation, analysis and interpretation. Chemical ontologies provide structured classifications of chemical entities that can be used for navigation and filtering of the large chemical space. ChEBI is a prominent example of a chemical ontology, widely used in life science contexts. However, ChEBI is manually maintained and as such cannot easily scale to the full scope of public chemical data. There is a need for tools that are able to automatically classify chemical data into chemical ontologies, which can be framed as a hierarchical multi-class classification problem. In this paper we evaluate machine learning approaches for this task, comparing different learning frameworks including logistic regression, decision trees and long short-term memory artificial neural networks, and different encoding approaches for the chemical structures, including cheminformatics fingerprints and character-based encoding from chemical line notation representations. We find that classical learning approaches such as logistic regression perform well with sets of relatively specific, disjoint chemical classes, while the neural network is able to handle larger sets of overlapping classes but needs more examples per class to learn from, and is not able to make a class prediction for every molecule. Future work will explore hybrid and ensemble approaches, as well as alternative network architectures including neuro-symbolic approaches. Springer International Publishing 2021-03-16 /pmc/articles/PMC7962259/ /pubmed/33726837 http://dx.doi.org/10.1186/s13321-021-00500-8 Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Hastings, Janna
Glauer, Martin
Memariani, Adel
Neuhaus, Fabian
Mossakowski, Till
Learning chemistry: exploring the suitability of machine learning for the task of structure-based chemical ontology classification
title Learning chemistry: exploring the suitability of machine learning for the task of structure-based chemical ontology classification
title_full Learning chemistry: exploring the suitability of machine learning for the task of structure-based chemical ontology classification
title_fullStr Learning chemistry: exploring the suitability of machine learning for the task of structure-based chemical ontology classification
title_full_unstemmed Learning chemistry: exploring the suitability of machine learning for the task of structure-based chemical ontology classification
title_short Learning chemistry: exploring the suitability of machine learning for the task of structure-based chemical ontology classification
title_sort learning chemistry: exploring the suitability of machine learning for the task of structure-based chemical ontology classification
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7962259/
https://www.ncbi.nlm.nih.gov/pubmed/33726837
http://dx.doi.org/10.1186/s13321-021-00500-8
work_keys_str_mv AT hastingsjanna learningchemistryexploringthesuitabilityofmachinelearningforthetaskofstructurebasedchemicalontologyclassification
AT glauermartin learningchemistryexploringthesuitabilityofmachinelearningforthetaskofstructurebasedchemicalontologyclassification
AT memarianiadel learningchemistryexploringthesuitabilityofmachinelearningforthetaskofstructurebasedchemicalontologyclassification
AT neuhausfabian learningchemistryexploringthesuitabilityofmachinelearningforthetaskofstructurebasedchemicalontologyclassification
AT mossakowskitill learningchemistryexploringthesuitabilityofmachinelearningforthetaskofstructurebasedchemicalontologyclassification