Cargando…

Large-Scale Distributed Training of Transformers for Chemical Fingerprinting

[Image: see text] Transformer models have become a popular choice for various machine learning tasks due to their often outstanding performance. Recently, transformers have been used in chemistry for classifying reactions, reaction prediction, physiochemical property prediction, and more. These mode...

Descripción completa

Detalles Bibliográficos
Autores principales: Abdel-Aty, Hisham, Gould, Ian R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Chemical Society 2022
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9597661/
https://www.ncbi.nlm.nih.gov/pubmed/36195574
http://dx.doi.org/10.1021/acs.jcim.2c00715
_version_ 1784816144543121408
author Abdel-Aty, Hisham
Gould, Ian R.
author_facet Abdel-Aty, Hisham
Gould, Ian R.
author_sort Abdel-Aty, Hisham
collection PubMed
description [Image: see text] Transformer models have become a popular choice for various machine learning tasks due to their often outstanding performance. Recently, transformers have been used in chemistry for classifying reactions, reaction prediction, physiochemical property prediction, and more. These models require huge amounts of data and localized compute to train effectively. In this work, we demonstrate that these models can successfully be trained for chemical problems in a distributed manner across many computers—a more common scenario for chemistry institutions. We introduce MFBERT: Molecular Fingerprints through Bidirectional Encoder Representations from Transformers. We use distributed computing to pre-train a transformer model on one of the largest aggregate datasets in chemical literature and achieve state-of-the-art scores on a virtual screening benchmark for molecular fingerprints. We then fine-tune our model on smaller, more specific datasets to generate more targeted fingerprints and assess their quality. We utilize a SentencePiece tokenization model, where the whole procedure from raw molecular representation to molecular fingerprints becomes data-driven, with no explicit tokenization rules.
format Online
Article
Text
id pubmed-9597661
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher American Chemical Society
record_format MEDLINE/PubMed
spelling pubmed-95976612022-10-27 Large-Scale Distributed Training of Transformers for Chemical Fingerprinting Abdel-Aty, Hisham Gould, Ian R. J Chem Inf Model [Image: see text] Transformer models have become a popular choice for various machine learning tasks due to their often outstanding performance. Recently, transformers have been used in chemistry for classifying reactions, reaction prediction, physiochemical property prediction, and more. These models require huge amounts of data and localized compute to train effectively. In this work, we demonstrate that these models can successfully be trained for chemical problems in a distributed manner across many computers—a more common scenario for chemistry institutions. We introduce MFBERT: Molecular Fingerprints through Bidirectional Encoder Representations from Transformers. We use distributed computing to pre-train a transformer model on one of the largest aggregate datasets in chemical literature and achieve state-of-the-art scores on a virtual screening benchmark for molecular fingerprints. We then fine-tune our model on smaller, more specific datasets to generate more targeted fingerprints and assess their quality. We utilize a SentencePiece tokenization model, where the whole procedure from raw molecular representation to molecular fingerprints becomes data-driven, with no explicit tokenization rules. American Chemical Society 2022-10-04 2022-10-24 /pmc/articles/PMC9597661/ /pubmed/36195574 http://dx.doi.org/10.1021/acs.jcim.2c00715 Text en © 2022 The Authors. Published by American Chemical Society https://creativecommons.org/licenses/by/4.0/Permits the broadest form of re-use including for commercial purposes, provided that author attribution and integrity are maintained (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Abdel-Aty, Hisham
Gould, Ian R.
Large-Scale Distributed Training of Transformers for Chemical Fingerprinting
title Large-Scale Distributed Training of Transformers for Chemical Fingerprinting
title_full Large-Scale Distributed Training of Transformers for Chemical Fingerprinting
title_fullStr Large-Scale Distributed Training of Transformers for Chemical Fingerprinting
title_full_unstemmed Large-Scale Distributed Training of Transformers for Chemical Fingerprinting
title_short Large-Scale Distributed Training of Transformers for Chemical Fingerprinting
title_sort large-scale distributed training of transformers for chemical fingerprinting
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9597661/
https://www.ncbi.nlm.nih.gov/pubmed/36195574
http://dx.doi.org/10.1021/acs.jcim.2c00715
work_keys_str_mv AT abdelatyhisham largescaledistributedtrainingoftransformersforchemicalfingerprinting
AT gouldianr largescaledistributedtrainingoftransformersforchemicalfingerprinting