Cargando…

ChemPix: automated recognition of hand-drawn hydrocarbon structures using deep learning

Inputting molecules into chemistry software, such as quantum chemistry packages, currently requires domain expertise, expensive software and/or cumbersome procedures. Leveraging recent breakthroughs in machine learning, we develop ChemPix: an offline, hand-drawn hydrocarbon structure recognition too...

Descripción completa

Detalles Bibliográficos
Autores principales: Weir, Hayley, Thompson, Keiran, Woodward, Amelia, Choi, Benjamin, Braun, Augustin, Martínez, Todd J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: The Royal Society of Chemistry 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8365825/
https://www.ncbi.nlm.nih.gov/pubmed/34447555
http://dx.doi.org/10.1039/d1sc02957f
_version_ 1783738786268577792
author Weir, Hayley
Thompson, Keiran
Woodward, Amelia
Choi, Benjamin
Braun, Augustin
Martínez, Todd J.
author_facet Weir, Hayley
Thompson, Keiran
Woodward, Amelia
Choi, Benjamin
Braun, Augustin
Martínez, Todd J.
author_sort Weir, Hayley
collection PubMed
description Inputting molecules into chemistry software, such as quantum chemistry packages, currently requires domain expertise, expensive software and/or cumbersome procedures. Leveraging recent breakthroughs in machine learning, we develop ChemPix: an offline, hand-drawn hydrocarbon structure recognition tool designed to remove these barriers. A neural image captioning approach consisting of a convolutional neural network (CNN) encoder and a long short-term memory (LSTM) decoder learned a mapping from photographs of hand-drawn hydrocarbon structures to machine-readable SMILES representations. We generated a large auxiliary training dataset, based on RDKit molecular images, by combining image augmentation, image degradation and background addition. Additionally, a small dataset of ∼600 hand-drawn hydrocarbon chemical structures was crowd-sourced using a phone web application. These datasets were used to train the image-to-SMILES neural network with the goal of maximizing the hand-drawn hydrocarbon recognition accuracy. By forming a committee of the trained neural networks where each network casts one vote for the predicted molecule, we achieved a nearly 10 percentage point improvement of the molecule recognition accuracy and were able to assign a confidence value for the prediction based on the number of agreeing votes. The ensemble model achieved an accuracy of 76% on hand-drawn hydrocarbons, increasing to 86% if the top 3 predictions were considered.
format Online
Article
Text
id pubmed-8365825
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher The Royal Society of Chemistry
record_format MEDLINE/PubMed
spelling pubmed-83658252021-08-25 ChemPix: automated recognition of hand-drawn hydrocarbon structures using deep learning Weir, Hayley Thompson, Keiran Woodward, Amelia Choi, Benjamin Braun, Augustin Martínez, Todd J. Chem Sci Chemistry Inputting molecules into chemistry software, such as quantum chemistry packages, currently requires domain expertise, expensive software and/or cumbersome procedures. Leveraging recent breakthroughs in machine learning, we develop ChemPix: an offline, hand-drawn hydrocarbon structure recognition tool designed to remove these barriers. A neural image captioning approach consisting of a convolutional neural network (CNN) encoder and a long short-term memory (LSTM) decoder learned a mapping from photographs of hand-drawn hydrocarbon structures to machine-readable SMILES representations. We generated a large auxiliary training dataset, based on RDKit molecular images, by combining image augmentation, image degradation and background addition. Additionally, a small dataset of ∼600 hand-drawn hydrocarbon chemical structures was crowd-sourced using a phone web application. These datasets were used to train the image-to-SMILES neural network with the goal of maximizing the hand-drawn hydrocarbon recognition accuracy. By forming a committee of the trained neural networks where each network casts one vote for the predicted molecule, we achieved a nearly 10 percentage point improvement of the molecule recognition accuracy and were able to assign a confidence value for the prediction based on the number of agreeing votes. The ensemble model achieved an accuracy of 76% on hand-drawn hydrocarbons, increasing to 86% if the top 3 predictions were considered. The Royal Society of Chemistry 2021-07-03 /pmc/articles/PMC8365825/ /pubmed/34447555 http://dx.doi.org/10.1039/d1sc02957f Text en This journal is © The Royal Society of Chemistry https://creativecommons.org/licenses/by-nc/3.0/
spellingShingle Chemistry
Weir, Hayley
Thompson, Keiran
Woodward, Amelia
Choi, Benjamin
Braun, Augustin
Martínez, Todd J.
ChemPix: automated recognition of hand-drawn hydrocarbon structures using deep learning
title ChemPix: automated recognition of hand-drawn hydrocarbon structures using deep learning
title_full ChemPix: automated recognition of hand-drawn hydrocarbon structures using deep learning
title_fullStr ChemPix: automated recognition of hand-drawn hydrocarbon structures using deep learning
title_full_unstemmed ChemPix: automated recognition of hand-drawn hydrocarbon structures using deep learning
title_short ChemPix: automated recognition of hand-drawn hydrocarbon structures using deep learning
title_sort chempix: automated recognition of hand-drawn hydrocarbon structures using deep learning
topic Chemistry
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8365825/
https://www.ncbi.nlm.nih.gov/pubmed/34447555
http://dx.doi.org/10.1039/d1sc02957f
work_keys_str_mv AT weirhayley chempixautomatedrecognitionofhanddrawnhydrocarbonstructuresusingdeeplearning
AT thompsonkeiran chempixautomatedrecognitionofhanddrawnhydrocarbonstructuresusingdeeplearning
AT woodwardamelia chempixautomatedrecognitionofhanddrawnhydrocarbonstructuresusingdeeplearning
AT choibenjamin chempixautomatedrecognitionofhanddrawnhydrocarbonstructuresusingdeeplearning
AT braunaugustin chempixautomatedrecognitionofhanddrawnhydrocarbonstructuresusingdeeplearning
AT martineztoddj chempixautomatedrecognitionofhanddrawnhydrocarbonstructuresusingdeeplearning