Cargando…
Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier
We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8496104/ https://www.ncbi.nlm.nih.gov/pubmed/34620215 http://dx.doi.org/10.1186/s13321-021-00535-x |
_version_ | 1784579691036803072 |
---|---|
author | Handsel, Jennifer Matthews, Brian Knight, Nicola J. Coles, Simon J. |
author_facet | Handsel, Jennifer Matthews, Brian Knight, Nicola J. Coles, Simon J. |
author_sort | Handsel, Jennifer |
collection | PubMed |
description | We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine’s online PubChem service. Training took seven days on a Tesla K80 GPU, and the model achieved a test set accuracy of 91%. The model performed particularly well on organics, with the exception of macrocycles, and was comparable to commercial IUPAC name generation software. The predictions were less accurate for inorganic and organometallic compounds. This can be explained by inherent limitations of standard InChI for representing inorganics, as well as low coverage in the training data. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-021-00535-x. |
format | Online Article Text |
id | pubmed-8496104 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-84961042021-10-07 Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier Handsel, Jennifer Matthews, Brian Knight, Nicola J. Coles, Simon J. J Cheminform Research Article We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine’s online PubChem service. Training took seven days on a Tesla K80 GPU, and the model achieved a test set accuracy of 91%. The model performed particularly well on organics, with the exception of macrocycles, and was comparable to commercial IUPAC name generation software. The predictions were less accurate for inorganic and organometallic compounds. This can be explained by inherent limitations of standard InChI for representing inorganics, as well as low coverage in the training data. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-021-00535-x. Springer International Publishing 2021-10-07 /pmc/articles/PMC8496104/ /pubmed/34620215 http://dx.doi.org/10.1186/s13321-021-00535-x Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Handsel, Jennifer Matthews, Brian Knight, Nicola J. Coles, Simon J. Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier |
title | Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier |
title_full | Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier |
title_fullStr | Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier |
title_full_unstemmed | Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier |
title_short | Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier |
title_sort | translating the inchi: adapting neural machine translation to predict iupac names from a chemical identifier |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8496104/ https://www.ncbi.nlm.nih.gov/pubmed/34620215 http://dx.doi.org/10.1186/s13321-021-00535-x |
work_keys_str_mv | AT handseljennifer translatingtheinchiadaptingneuralmachinetranslationtopredictiupacnamesfromachemicalidentifier AT matthewsbrian translatingtheinchiadaptingneuralmachinetranslationtopredictiupacnamesfromachemicalidentifier AT knightnicolaj translatingtheinchiadaptingneuralmachinetranslationtopredictiupacnamesfromachemicalidentifier AT colessimonj translatingtheinchiadaptingneuralmachinetranslationtopredictiupacnamesfromachemicalidentifier |