Cargando…

From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents

Chemical reactions and experimental conditions are fundamental information for chemical research and pharmaceutical applications. However, the latest information of chemical reactions is usually embedded in the free text of patents. The rapidly accumulating chemical patents urge automatic tools base...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Jingqi, Ren, Yuankai, Zhang, Zhi, Xu, Hua, Zhang, Yaoyun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8727901/
https://www.ncbi.nlm.nih.gov/pubmed/35005421
http://dx.doi.org/10.3389/frma.2021.691105
_version_ 1784626609708335104
author Wang, Jingqi
Ren, Yuankai
Zhang, Zhi
Xu, Hua
Zhang, Yaoyun
author_facet Wang, Jingqi
Ren, Yuankai
Zhang, Zhi
Xu, Hua
Zhang, Yaoyun
author_sort Wang, Jingqi
collection PubMed
description Chemical reactions and experimental conditions are fundamental information for chemical research and pharmaceutical applications. However, the latest information of chemical reactions is usually embedded in the free text of patents. The rapidly accumulating chemical patents urge automatic tools based on natural language processing (NLP) techniques for efficient and accurate information extraction. This work describes the participation of the Melax Tech team in the CLEF 2020—ChEMU Task of Chemical Reaction Extraction from Patent. The task consisted of two subtasks: (1) named entity recognition to identify compounds and different semantic roles in the chemical reaction and (2) event extraction to identify event triggers of chemical reaction and their relations with the semantic roles recognized in subtask 1. To build an end-to-end system with high performance, multiple strategies tailored to chemical patents were applied and evaluated, ranging from optimizing the tokenization, pre-training patent language models based on self-supervision, to domain knowledge-based rules. Our hybrid approaches combining different strategies achieved state-of-the-art results in both subtasks, with the top-ranked F1 of 0.957 for entity recognition and the top-ranked F1 of 0.9536 for event extraction, indicating that the proposed approaches are promising.
format Online
Article
Text
id pubmed-8727901
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-87279012022-01-06 From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents Wang, Jingqi Ren, Yuankai Zhang, Zhi Xu, Hua Zhang, Yaoyun Front Res Metr Anal Research Metrics and Analytics Chemical reactions and experimental conditions are fundamental information for chemical research and pharmaceutical applications. However, the latest information of chemical reactions is usually embedded in the free text of patents. The rapidly accumulating chemical patents urge automatic tools based on natural language processing (NLP) techniques for efficient and accurate information extraction. This work describes the participation of the Melax Tech team in the CLEF 2020—ChEMU Task of Chemical Reaction Extraction from Patent. The task consisted of two subtasks: (1) named entity recognition to identify compounds and different semantic roles in the chemical reaction and (2) event extraction to identify event triggers of chemical reaction and their relations with the semantic roles recognized in subtask 1. To build an end-to-end system with high performance, multiple strategies tailored to chemical patents were applied and evaluated, ranging from optimizing the tokenization, pre-training patent language models based on self-supervision, to domain knowledge-based rules. Our hybrid approaches combining different strategies achieved state-of-the-art results in both subtasks, with the top-ranked F1 of 0.957 for entity recognition and the top-ranked F1 of 0.9536 for event extraction, indicating that the proposed approaches are promising. Frontiers Media S.A. 2021-12-22 /pmc/articles/PMC8727901/ /pubmed/35005421 http://dx.doi.org/10.3389/frma.2021.691105 Text en Copyright © 2021 Wang, Ren, Zhang, Xu and Zhang. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Research Metrics and Analytics
Wang, Jingqi
Ren, Yuankai
Zhang, Zhi
Xu, Hua
Zhang, Yaoyun
From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents
title From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents
title_full From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents
title_fullStr From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents
title_full_unstemmed From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents
title_short From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents
title_sort from tokenization to self-supervision: building a high-performance information extraction system for chemical reactions in patents
topic Research Metrics and Analytics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8727901/
https://www.ncbi.nlm.nih.gov/pubmed/35005421
http://dx.doi.org/10.3389/frma.2021.691105
work_keys_str_mv AT wangjingqi fromtokenizationtoselfsupervisionbuildingahighperformanceinformationextractionsystemforchemicalreactionsinpatents
AT renyuankai fromtokenizationtoselfsupervisionbuildingahighperformanceinformationextractionsystemforchemicalreactionsinpatents
AT zhangzhi fromtokenizationtoselfsupervisionbuildingahighperformanceinformationextractionsystemforchemicalreactionsinpatents
AT xuhua fromtokenizationtoselfsupervisionbuildingahighperformanceinformationextractionsystemforchemicalreactionsinpatents
AT zhangyaoyun fromtokenizationtoselfsupervisionbuildingahighperformanceinformationextractionsystemforchemicalreactionsinpatents