Cargando…

YoDe-Segmentation: automated noise-free retrieval of molecular structures from scientific publications

In chemistry-related disciplines, a vast repository of molecular structural data has been documented in scientific publications but remains inaccessible to computational analyses owing to its non-machine-readable format. Optical chemical structure recognition (OCSR) addresses this gap by converting...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhou, Chong, Liu, Wei, Song, Xiyue, Yang, Mengling, Peng, Xiaowang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10662772/
https://www.ncbi.nlm.nih.gov/pubmed/37986007
http://dx.doi.org/10.1186/s13321-023-00783-z
_version_ 1785148601520881664
author Zhou, Chong
Liu, Wei
Song, Xiyue
Yang, Mengling
Peng, Xiaowang
author_facet Zhou, Chong
Liu, Wei
Song, Xiyue
Yang, Mengling
Peng, Xiaowang
author_sort Zhou, Chong
collection PubMed
description In chemistry-related disciplines, a vast repository of molecular structural data has been documented in scientific publications but remains inaccessible to computational analyses owing to its non-machine-readable format. Optical chemical structure recognition (OCSR) addresses this gap by converting images of chemical molecular structures into a format accessible to computers and convenient for storage, paving the way for further analyses and studies on chemical information. A pivotal initial step in OCSR is automating the noise-free extraction of molecular descriptions from literature. Despite efforts utilising rule-based and deep learning approaches for the extraction process, the accuracy achieved to date is unsatisfactory. To address this issue, we introduce a deep learning model named YoDe-Segmentation in this study, engineered for the automated retrieval of molecular structures from scientific documents. This model operates via a three-stage process encompassing detection, mask generation, and calculation. Initially, it identifies and isolates molecular structures during the detection phase. Subsequently, mask maps are created based on these isolated structures in the mask generation stage. In the final calculation stage, refined and separated mask maps are combined with the isolated molecular structure images, resulting in the acquisition of pure molecular structures. Our model underwent rigorous testing using texts from multiple chemistry-centric journals, with the outcomes subjected to manual validation. The results revealed the superior performance of YoDe-Segmentation compared to alternative algorithms, documenting an average extraction efficiency of 97.62%. This outcome not only highlights the robustness and reliability of the model but also suggests its applicability on a broad scale.
format Online
Article
Text
id pubmed-10662772
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-106627722023-11-20 YoDe-Segmentation: automated noise-free retrieval of molecular structures from scientific publications Zhou, Chong Liu, Wei Song, Xiyue Yang, Mengling Peng, Xiaowang J Cheminform Research In chemistry-related disciplines, a vast repository of molecular structural data has been documented in scientific publications but remains inaccessible to computational analyses owing to its non-machine-readable format. Optical chemical structure recognition (OCSR) addresses this gap by converting images of chemical molecular structures into a format accessible to computers and convenient for storage, paving the way for further analyses and studies on chemical information. A pivotal initial step in OCSR is automating the noise-free extraction of molecular descriptions from literature. Despite efforts utilising rule-based and deep learning approaches for the extraction process, the accuracy achieved to date is unsatisfactory. To address this issue, we introduce a deep learning model named YoDe-Segmentation in this study, engineered for the automated retrieval of molecular structures from scientific documents. This model operates via a three-stage process encompassing detection, mask generation, and calculation. Initially, it identifies and isolates molecular structures during the detection phase. Subsequently, mask maps are created based on these isolated structures in the mask generation stage. In the final calculation stage, refined and separated mask maps are combined with the isolated molecular structure images, resulting in the acquisition of pure molecular structures. Our model underwent rigorous testing using texts from multiple chemistry-centric journals, with the outcomes subjected to manual validation. The results revealed the superior performance of YoDe-Segmentation compared to alternative algorithms, documenting an average extraction efficiency of 97.62%. This outcome not only highlights the robustness and reliability of the model but also suggests its applicability on a broad scale. Springer International Publishing 2023-11-20 /pmc/articles/PMC10662772/ /pubmed/37986007 http://dx.doi.org/10.1186/s13321-023-00783-z Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Zhou, Chong
Liu, Wei
Song, Xiyue
Yang, Mengling
Peng, Xiaowang
YoDe-Segmentation: automated noise-free retrieval of molecular structures from scientific publications
title YoDe-Segmentation: automated noise-free retrieval of molecular structures from scientific publications
title_full YoDe-Segmentation: automated noise-free retrieval of molecular structures from scientific publications
title_fullStr YoDe-Segmentation: automated noise-free retrieval of molecular structures from scientific publications
title_full_unstemmed YoDe-Segmentation: automated noise-free retrieval of molecular structures from scientific publications
title_short YoDe-Segmentation: automated noise-free retrieval of molecular structures from scientific publications
title_sort yode-segmentation: automated noise-free retrieval of molecular structures from scientific publications
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10662772/
https://www.ncbi.nlm.nih.gov/pubmed/37986007
http://dx.doi.org/10.1186/s13321-023-00783-z
work_keys_str_mv AT zhouchong yodesegmentationautomatednoisefreeretrievalofmolecularstructuresfromscientificpublications
AT liuwei yodesegmentationautomatednoisefreeretrievalofmolecularstructuresfromscientificpublications
AT songxiyue yodesegmentationautomatednoisefreeretrievalofmolecularstructuresfromscientificpublications
AT yangmengling yodesegmentationautomatednoisefreeretrievalofmolecularstructuresfromscientificpublications
AT pengxiaowang yodesegmentationautomatednoisefreeretrievalofmolecularstructuresfromscientificpublications