Cargando…

ChemTables: a dataset for semantic classification on tables in chemical patents

Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent d...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhai, Zenan, Druckenbrodt, Christian, Thorne, Camilo, Akhondi, Saber A., Nguyen, Dat Quoc, Cohn, Trevor, Verspoor, Karin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8665561/
https://www.ncbi.nlm.nih.gov/pubmed/34895295
http://dx.doi.org/10.1186/s13321-021-00568-2
_version_ 1784614034592497664
author Zhai, Zenan
Druckenbrodt, Christian
Thorne, Camilo
Akhondi, Saber A.
Nguyen, Dat Quoc
Cohn, Trevor
Verspoor, Karin
author_facet Zhai, Zenan
Druckenbrodt, Christian
Thorne, Camilo
Akhondi, Saber A.
Nguyen, Dat Quoc
Cohn, Trevor
Verspoor, Karin
author_sort Zhai, Zenan
collection PubMed
description Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called ChemTables, which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on ChemTables. The best performing model, Table-BERT, achieves a performance of 88.66 micro-averaged [Formula: see text] score on the table classification task. The ChemTables dataset is publicly available at https://doi.org/10.17632/g7tjh7tbrj.3, subject to the CC BY NC 3.0 license. Code/models evaluated in this work are in a Github repository https://github.com/zenanz/ChemTables.
format Online
Article
Text
id pubmed-8665561
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-86655612021-12-13 ChemTables: a dataset for semantic classification on tables in chemical patents Zhai, Zenan Druckenbrodt, Christian Thorne, Camilo Akhondi, Saber A. Nguyen, Dat Quoc Cohn, Trevor Verspoor, Karin J Cheminform Research Article Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called ChemTables, which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on ChemTables. The best performing model, Table-BERT, achieves a performance of 88.66 micro-averaged [Formula: see text] score on the table classification task. The ChemTables dataset is publicly available at https://doi.org/10.17632/g7tjh7tbrj.3, subject to the CC BY NC 3.0 license. Code/models evaluated in this work are in a Github repository https://github.com/zenanz/ChemTables. Springer International Publishing 2021-12-11 /pmc/articles/PMC8665561/ /pubmed/34895295 http://dx.doi.org/10.1186/s13321-021-00568-2 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Zhai, Zenan
Druckenbrodt, Christian
Thorne, Camilo
Akhondi, Saber A.
Nguyen, Dat Quoc
Cohn, Trevor
Verspoor, Karin
ChemTables: a dataset for semantic classification on tables in chemical patents
title ChemTables: a dataset for semantic classification on tables in chemical patents
title_full ChemTables: a dataset for semantic classification on tables in chemical patents
title_fullStr ChemTables: a dataset for semantic classification on tables in chemical patents
title_full_unstemmed ChemTables: a dataset for semantic classification on tables in chemical patents
title_short ChemTables: a dataset for semantic classification on tables in chemical patents
title_sort chemtables: a dataset for semantic classification on tables in chemical patents
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8665561/
https://www.ncbi.nlm.nih.gov/pubmed/34895295
http://dx.doi.org/10.1186/s13321-021-00568-2
work_keys_str_mv AT zhaizenan chemtablesadatasetforsemanticclassificationontablesinchemicalpatents
AT druckenbrodtchristian chemtablesadatasetforsemanticclassificationontablesinchemicalpatents
AT thornecamilo chemtablesadatasetforsemanticclassificationontablesinchemicalpatents
AT akhondisabera chemtablesadatasetforsemanticclassificationontablesinchemicalpatents
AT nguyendatquoc chemtablesadatasetforsemanticclassificationontablesinchemicalpatents
AT cohntrevor chemtablesadatasetforsemanticclassificationontablesinchemicalpatents
AT verspoorkarin chemtablesadatasetforsemanticclassificationontablesinchemicalpatents