Cargando…

MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra

The ‘inverse problem’ of mass spectrometric molecular identification (‘given a mass spectrum, calculate/predict the 2D structure of the molecule whence it came’) is largely unsolved, and is especially acute in metabolomics where many small molecules remain unidentified. This is largely because the n...

Descripción completa

Detalles Bibliográficos
Autores principales:	Shrivastava, Aditya Divyakant, Swainston, Neil, Samanta, Soumitra, Roberts, Ivayla, Wright Muelas, Marina, Kell, Douglas B.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8699281/ https://www.ncbi.nlm.nih.gov/pubmed/34944436 http://dx.doi.org/10.3390/biom11121793

_version_	1784620476791783424
author	Shrivastava, Aditya Divyakant Swainston, Neil Samanta, Soumitra Roberts, Ivayla Wright Muelas, Marina Kell, Douglas B.
author_facet	Shrivastava, Aditya Divyakant Swainston, Neil Samanta, Soumitra Roberts, Ivayla Wright Muelas, Marina Kell, Douglas B.
author_sort	Shrivastava, Aditya Divyakant
collection	PubMed
description	The ‘inverse problem’ of mass spectrometric molecular identification (‘given a mass spectrum, calculate/predict the 2D structure of the molecule whence it came’) is largely unsolved, and is especially acute in metabolomics where many small molecules remain unidentified. This is largely because the number of experimentally available electrospray mass spectra of small molecules is quite limited. However, the forward problem (‘calculate a small molecule’s likely fragmentation and hence at least some of its mass spectrum from its structure alone’) is much more tractable, because the strengths of different chemical bonds are roughly known. This kind of molecular identification problem may be cast as a language translation problem in which the source language is a list of high-resolution mass spectral peaks and the ‘translation’ a representation (for instance in SMILES) of the molecule. It is thus suitable for attack using the deep neural networks known as transformers. We here present MassGenie, a method that uses a transformer-based deep neural network, trained on ~6 million chemical structures with augmented SMILES encoding and their paired molecular fragments as generated in silico, explicitly including the protonated molecular ion. This architecture (containing some 400 million elements) is used to predict the structure of a molecule from the various fragments that may be expected to be observed when some of its bonds are broken. Despite being given essentially no detailed nor explicit rules about molecular fragmentation methods, isotope patterns, rearrangements, neutral losses, and the like, MassGenie learns the effective properties of the mass spectral fragment and valency space, and can generate candidate molecular structures that are very close or identical to those of the ‘true’ molecules. We also use VAE-Sim, a previously published variational autoencoder, to generate candidate molecules that are ‘similar’ to the top hit. In addition to using the ‘top hits’ directly, we can produce a rank order of these by ‘round-tripping’ candidate molecules and comparing them with the true molecules, where known. As a proof of principle, we confine ourselves to positive electrospray mass spectra from molecules with a molecular mass of 500Da or lower, including those in the last CASMI challenge (for which the results are known), getting 49/93 (53%) precisely correct. The transformer method, applied here for the first time to mass spectral interpretation, works extremely effectively both for mass spectra generated in silico and on experimentally obtained mass spectra from pure compounds. It seems to act as a Las Vegas algorithm, in that it either gives the correct answer or simply states that it cannot find one. The ability to create and to ‘learn’ millions of fragmentation patterns in silico, and therefrom generate candidate structures (that do not have to be in existing libraries) directly, thus opens up entirely the field of de novo small molecule structure prediction from experimental mass spectra.
format	Online Article Text
id	pubmed-8699281
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-86992812021-12-24 MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra Shrivastava, Aditya Divyakant Swainston, Neil Samanta, Soumitra Roberts, Ivayla Wright Muelas, Marina Kell, Douglas B. Biomolecules Article The ‘inverse problem’ of mass spectrometric molecular identification (‘given a mass spectrum, calculate/predict the 2D structure of the molecule whence it came’) is largely unsolved, and is especially acute in metabolomics where many small molecules remain unidentified. This is largely because the number of experimentally available electrospray mass spectra of small molecules is quite limited. However, the forward problem (‘calculate a small molecule’s likely fragmentation and hence at least some of its mass spectrum from its structure alone’) is much more tractable, because the strengths of different chemical bonds are roughly known. This kind of molecular identification problem may be cast as a language translation problem in which the source language is a list of high-resolution mass spectral peaks and the ‘translation’ a representation (for instance in SMILES) of the molecule. It is thus suitable for attack using the deep neural networks known as transformers. We here present MassGenie, a method that uses a transformer-based deep neural network, trained on ~6 million chemical structures with augmented SMILES encoding and their paired molecular fragments as generated in silico, explicitly including the protonated molecular ion. This architecture (containing some 400 million elements) is used to predict the structure of a molecule from the various fragments that may be expected to be observed when some of its bonds are broken. Despite being given essentially no detailed nor explicit rules about molecular fragmentation methods, isotope patterns, rearrangements, neutral losses, and the like, MassGenie learns the effective properties of the mass spectral fragment and valency space, and can generate candidate molecular structures that are very close or identical to those of the ‘true’ molecules. We also use VAE-Sim, a previously published variational autoencoder, to generate candidate molecules that are ‘similar’ to the top hit. In addition to using the ‘top hits’ directly, we can produce a rank order of these by ‘round-tripping’ candidate molecules and comparing them with the true molecules, where known. As a proof of principle, we confine ourselves to positive electrospray mass spectra from molecules with a molecular mass of 500Da or lower, including those in the last CASMI challenge (for which the results are known), getting 49/93 (53%) precisely correct. The transformer method, applied here for the first time to mass spectral interpretation, works extremely effectively both for mass spectra generated in silico and on experimentally obtained mass spectra from pure compounds. It seems to act as a Las Vegas algorithm, in that it either gives the correct answer or simply states that it cannot find one. The ability to create and to ‘learn’ millions of fragmentation patterns in silico, and therefrom generate candidate structures (that do not have to be in existing libraries) directly, thus opens up entirely the field of de novo small molecule structure prediction from experimental mass spectra. MDPI 2021-11-30 /pmc/articles/PMC8699281/ /pubmed/34944436 http://dx.doi.org/10.3390/biom11121793 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Shrivastava, Aditya Divyakant Swainston, Neil Samanta, Soumitra Roberts, Ivayla Wright Muelas, Marina Kell, Douglas B. MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra
title	MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra
title_full	MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra
title_fullStr	MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra
title_full_unstemmed	MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra
title_short	MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra
title_sort	massgenie: a transformer-based deep learning method for identifying small molecules from their mass spectra
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8699281/ https://www.ncbi.nlm.nih.gov/pubmed/34944436 http://dx.doi.org/10.3390/biom11121793
work_keys_str_mv	AT shrivastavaadityadivyakant massgenieatransformerbaseddeeplearningmethodforidentifyingsmallmoleculesfromtheirmassspectra AT swainstonneil massgenieatransformerbaseddeeplearningmethodforidentifyingsmallmoleculesfromtheirmassspectra AT samantasoumitra massgenieatransformerbaseddeeplearningmethodforidentifyingsmallmoleculesfromtheirmassspectra AT robertsivayla massgenieatransformerbaseddeeplearningmethodforidentifyingsmallmoleculesfromtheirmassspectra AT wrightmuelasmarina massgenieatransformerbaseddeeplearningmethodforidentifyingsmallmoleculesfromtheirmassspectra AT kelldouglasb massgenieatransformerbaseddeeplearningmethodforidentifyingsmallmoleculesfromtheirmassspectra

MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra

Ejemplares similares