Cargando…

Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space

Identification of new chemical compounds with desired structural diversity and biological properties plays an essential role in drug discovery, yet the construction of such a potential space with elements of ‘near-drug’ properties is still a challenging task. In this work, we proposed a multimodal c...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Jie, Shen, Zihao, Liao, Yichen, Yuan, Zhen, Li, Shiliang, He, Gaoqi, Lan, Man, Qian, Xuhong, Zhang, Kai, Li, Honglin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9677486/
https://www.ncbi.nlm.nih.gov/pubmed/36252922
http://dx.doi.org/10.1093/bib/bbac461
_version_ 1784833821868294144
author Wang, Jie
Shen, Zihao
Liao, Yichen
Yuan, Zhen
Li, Shiliang
He, Gaoqi
Lan, Man
Qian, Xuhong
Zhang, Kai
Li, Honglin
author_facet Wang, Jie
Shen, Zihao
Liao, Yichen
Yuan, Zhen
Li, Shiliang
He, Gaoqi
Lan, Man
Qian, Xuhong
Zhang, Kai
Li, Honglin
author_sort Wang, Jie
collection PubMed
description Identification of new chemical compounds with desired structural diversity and biological properties plays an essential role in drug discovery, yet the construction of such a potential space with elements of ‘near-drug’ properties is still a challenging task. In this work, we proposed a multimodal chemical information reconstruction system to automatically process, extract and align heterogeneous information from the text descriptions and structural images of chemical patents. Our key innovation lies in a heterogeneous data generator that produces cross-modality training data in the form of text descriptions and Markush structure images, from which a two-branch model with image- and text-processing units can then learn to both recognize heterogeneous chemical entities and simultaneously capture their correspondence. In particular, we have collected chemical structures from ChEMBL database and chemical patents from the European Patent Office and the US Patent and Trademark Office using keywords ‘A61P, compound, structure’ in the years from 2010 to 2020, and generated heterogeneous chemical information datasets with 210K structural images and 7818 annotated text snippets. Based on the reconstructed results and substituent replacement rules, structural libraries of a huge number of near-drug compounds can be generated automatically. In quantitative evaluations, our model can correctly reconstruct 97% of the molecular images into structured format and achieve an F1-score around 97–98% in the recognition of chemical entities, which demonstrated the effectiveness of our model in automatic information extraction from chemical patents, and hopefully transforming them to a user-friendly, structured molecular database enriching the near-drug space to realize the intelligent retrieval technology of chemical knowledge.
format Online
Article
Text
id pubmed-9677486
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-96774862022-11-21 Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space Wang, Jie Shen, Zihao Liao, Yichen Yuan, Zhen Li, Shiliang He, Gaoqi Lan, Man Qian, Xuhong Zhang, Kai Li, Honglin Brief Bioinform Problem Solving Protocol Identification of new chemical compounds with desired structural diversity and biological properties plays an essential role in drug discovery, yet the construction of such a potential space with elements of ‘near-drug’ properties is still a challenging task. In this work, we proposed a multimodal chemical information reconstruction system to automatically process, extract and align heterogeneous information from the text descriptions and structural images of chemical patents. Our key innovation lies in a heterogeneous data generator that produces cross-modality training data in the form of text descriptions and Markush structure images, from which a two-branch model with image- and text-processing units can then learn to both recognize heterogeneous chemical entities and simultaneously capture their correspondence. In particular, we have collected chemical structures from ChEMBL database and chemical patents from the European Patent Office and the US Patent and Trademark Office using keywords ‘A61P, compound, structure’ in the years from 2010 to 2020, and generated heterogeneous chemical information datasets with 210K structural images and 7818 annotated text snippets. Based on the reconstructed results and substituent replacement rules, structural libraries of a huge number of near-drug compounds can be generated automatically. In quantitative evaluations, our model can correctly reconstruct 97% of the molecular images into structured format and achieve an F1-score around 97–98% in the recognition of chemical entities, which demonstrated the effectiveness of our model in automatic information extraction from chemical patents, and hopefully transforming them to a user-friendly, structured molecular database enriching the near-drug space to realize the intelligent retrieval technology of chemical knowledge. Oxford University Press 2022-10-17 /pmc/articles/PMC9677486/ /pubmed/36252922 http://dx.doi.org/10.1093/bib/bbac461 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Problem Solving Protocol
Wang, Jie
Shen, Zihao
Liao, Yichen
Yuan, Zhen
Li, Shiliang
He, Gaoqi
Lan, Man
Qian, Xuhong
Zhang, Kai
Li, Honglin
Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space
title Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space
title_full Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space
title_fullStr Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space
title_full_unstemmed Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space
title_short Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space
title_sort multi-modal chemical information reconstruction from images and texts for exploring the near-drug space
topic Problem Solving Protocol
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9677486/
https://www.ncbi.nlm.nih.gov/pubmed/36252922
http://dx.doi.org/10.1093/bib/bbac461
work_keys_str_mv AT wangjie multimodalchemicalinformationreconstructionfromimagesandtextsforexploringtheneardrugspace
AT shenzihao multimodalchemicalinformationreconstructionfromimagesandtextsforexploringtheneardrugspace
AT liaoyichen multimodalchemicalinformationreconstructionfromimagesandtextsforexploringtheneardrugspace
AT yuanzhen multimodalchemicalinformationreconstructionfromimagesandtextsforexploringtheneardrugspace
AT lishiliang multimodalchemicalinformationreconstructionfromimagesandtextsforexploringtheneardrugspace
AT hegaoqi multimodalchemicalinformationreconstructionfromimagesandtextsforexploringtheneardrugspace
AT lanman multimodalchemicalinformationreconstructionfromimagesandtextsforexploringtheneardrugspace
AT qianxuhong multimodalchemicalinformationreconstructionfromimagesandtextsforexploringtheneardrugspace
AT zhangkai multimodalchemicalinformationreconstructionfromimagesandtextsforexploringtheneardrugspace
AT lihonglin multimodalchemicalinformationreconstructionfromimagesandtextsforexploringtheneardrugspace