Cargando…
Machine learning for identification of silylated derivatives from mass spectra
MOTIVATION: Compound structure identification is using increasingly more sophisticated computational tools, among which machine learning tools are a recent addition that quickly gains in importance. These tools, of which the method titled Compound Structure Identification:Input Output Kernel Regress...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9476372/ https://www.ncbi.nlm.nih.gov/pubmed/36109826 http://dx.doi.org/10.1186/s13321-022-00636-1 |
_version_ | 1784790121787162624 |
---|---|
author | Ljoncheva, Milka Stepišnik, Tomaž Kosjek, Tina Džeroski, Sašo |
author_facet | Ljoncheva, Milka Stepišnik, Tomaž Kosjek, Tina Džeroski, Sašo |
author_sort | Ljoncheva, Milka |
collection | PubMed |
description | MOTIVATION: Compound structure identification is using increasingly more sophisticated computational tools, among which machine learning tools are a recent addition that quickly gains in importance. These tools, of which the method titled Compound Structure Identification:Input Output Kernel Regression (CSI:IOKR) is an excellent example, have been used to elucidate compound structure from mass spectral (MS) data with significant accuracy, confidence and speed. They have, however, largely focused on data coming from liquid chromatography coupled to tandem mass spectrometry (LC–MS). Gas chromatography coupled to mass spectrometry (GC–MS) is an alternative which offers several advantages as compared to LC–MS, including higher data reproducibility. Of special importance is the substantial compound coverage offered by GC–MS, further expanded by derivatization procedures, such as silylation, which can improve the volatility, thermal stability and chromatographic peak shape of semi-volatile analytes. Despite these advantages and the increasing size of compound databases and MS libraries, GC–MS data have not yet been used by machine learning approaches to compound structure identification. RESULTS: This study presents a successful application of the CSI:IOKR machine learning method for the identification of environmental contaminants from GC–MS spectra. We use CSI:IOKR as an alternative to exhaustive search of MS libraries, independent of instrumental platform and data processing software. We use a comprehensive dataset of GC–MS spectra of trimethylsilyl derivatives and their molecular structures, derived from a large commercially available MS library, to train a model that maps between spectra and molecular structures. We test the learned model on a different dataset of GC–MS spectra of trimethylsilyl derivatives of environmental contaminants, generated in-house and made publicly available. The results show that 37% (resp. 50%) of the tested compounds are correctly ranked among the top 10 (resp. 20) candidate compounds suggested by the model. Even though spectral comparisons with reference standards or de novo structural elucidations are neccessary to validate the predictions, machine learning provides efficient candidate prioritization and reduction of the time spent for compound annotation. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-022-00636-1. |
format | Online Article Text |
id | pubmed-9476372 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-94763722022-09-16 Machine learning for identification of silylated derivatives from mass spectra Ljoncheva, Milka Stepišnik, Tomaž Kosjek, Tina Džeroski, Sašo J Cheminform Research MOTIVATION: Compound structure identification is using increasingly more sophisticated computational tools, among which machine learning tools are a recent addition that quickly gains in importance. These tools, of which the method titled Compound Structure Identification:Input Output Kernel Regression (CSI:IOKR) is an excellent example, have been used to elucidate compound structure from mass spectral (MS) data with significant accuracy, confidence and speed. They have, however, largely focused on data coming from liquid chromatography coupled to tandem mass spectrometry (LC–MS). Gas chromatography coupled to mass spectrometry (GC–MS) is an alternative which offers several advantages as compared to LC–MS, including higher data reproducibility. Of special importance is the substantial compound coverage offered by GC–MS, further expanded by derivatization procedures, such as silylation, which can improve the volatility, thermal stability and chromatographic peak shape of semi-volatile analytes. Despite these advantages and the increasing size of compound databases and MS libraries, GC–MS data have not yet been used by machine learning approaches to compound structure identification. RESULTS: This study presents a successful application of the CSI:IOKR machine learning method for the identification of environmental contaminants from GC–MS spectra. We use CSI:IOKR as an alternative to exhaustive search of MS libraries, independent of instrumental platform and data processing software. We use a comprehensive dataset of GC–MS spectra of trimethylsilyl derivatives and their molecular structures, derived from a large commercially available MS library, to train a model that maps between spectra and molecular structures. We test the learned model on a different dataset of GC–MS spectra of trimethylsilyl derivatives of environmental contaminants, generated in-house and made publicly available. The results show that 37% (resp. 50%) of the tested compounds are correctly ranked among the top 10 (resp. 20) candidate compounds suggested by the model. Even though spectral comparisons with reference standards or de novo structural elucidations are neccessary to validate the predictions, machine learning provides efficient candidate prioritization and reduction of the time spent for compound annotation. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-022-00636-1. Springer International Publishing 2022-09-15 /pmc/articles/PMC9476372/ /pubmed/36109826 http://dx.doi.org/10.1186/s13321-022-00636-1 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Ljoncheva, Milka Stepišnik, Tomaž Kosjek, Tina Džeroski, Sašo Machine learning for identification of silylated derivatives from mass spectra |
title | Machine learning for identification of silylated derivatives from mass spectra |
title_full | Machine learning for identification of silylated derivatives from mass spectra |
title_fullStr | Machine learning for identification of silylated derivatives from mass spectra |
title_full_unstemmed | Machine learning for identification of silylated derivatives from mass spectra |
title_short | Machine learning for identification of silylated derivatives from mass spectra |
title_sort | machine learning for identification of silylated derivatives from mass spectra |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9476372/ https://www.ncbi.nlm.nih.gov/pubmed/36109826 http://dx.doi.org/10.1186/s13321-022-00636-1 |
work_keys_str_mv | AT ljonchevamilka machinelearningforidentificationofsilylatedderivativesfrommassspectra AT stepisniktomaz machinelearningforidentificationofsilylatedderivativesfrommassspectra AT kosjektina machinelearningforidentificationofsilylatedderivativesfrommassspectra AT dzeroskisaso machinelearningforidentificationofsilylatedderivativesfrommassspectra |