Cargando…

DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature

Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual informat...

Descripción completa

Detalles Bibliográficos
Autores principales: Rajan, Kohulan, Brinkhaus, Henning Otto, Sorokina, Maria, Zielesny, Achim, Steinbeck, Christoph
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7941967/
https://www.ncbi.nlm.nih.gov/pubmed/33685498
http://dx.doi.org/10.1186/s13321-021-00496-1
_version_ 1783662223115157504
author Rajan, Kohulan
Brinkhaus, Henning Otto
Sorokina, Maria
Zielesny, Achim
Steinbeck, Christoph
author_facet Rajan, Kohulan
Brinkhaus, Henning Otto
Sorokina, Maria
Zielesny, Achim
Steinbeck, Christoph
author_sort Rajan, Kohulan
collection PubMed
description Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature. The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs. By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at https://decimer.ai, lets the user upload a pdf file and retrieve the segmented structure depictions. [Image: see text]
format Online
Article
Text
id pubmed-7941967
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-79419672021-03-10 DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature Rajan, Kohulan Brinkhaus, Henning Otto Sorokina, Maria Zielesny, Achim Steinbeck, Christoph J Cheminform Software Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature. The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs. By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at https://decimer.ai, lets the user upload a pdf file and retrieve the segmented structure depictions. [Image: see text] Springer International Publishing 2021-03-08 /pmc/articles/PMC7941967/ /pubmed/33685498 http://dx.doi.org/10.1186/s13321-021-00496-1 Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Rajan, Kohulan
Brinkhaus, Henning Otto
Sorokina, Maria
Zielesny, Achim
Steinbeck, Christoph
DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature
title DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature
title_full DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature
title_fullStr DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature
title_full_unstemmed DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature
title_short DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature
title_sort decimer-segmentation: automated extraction of chemical structure depictions from scientific literature
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7941967/
https://www.ncbi.nlm.nih.gov/pubmed/33685498
http://dx.doi.org/10.1186/s13321-021-00496-1
work_keys_str_mv AT rajankohulan decimersegmentationautomatedextractionofchemicalstructuredepictionsfromscientificliterature
AT brinkhaushenningotto decimersegmentationautomatedextractionofchemicalstructuredepictionsfromscientificliterature
AT sorokinamaria decimersegmentationautomatedextractionofchemicalstructuredepictionsfromscientificliterature
AT zielesnyachim decimersegmentationautomatedextractionofchemicalstructuredepictionsfromscientificliterature
AT steinbeckchristoph decimersegmentationautomatedextractionofchemicalstructuredepictionsfromscientificliterature