Cargando…

ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents

Chemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic pro...

Descripción completa

Detalles Bibliográficos
Autores principales: He, Jiayuan, Nguyen, Dat Quoc, Akhondi, Saber A., Druckenbrodt, Christian, Thorne, Camilo, Hoessel, Ralph, Afzal, Zubair, Zhai, Zenan, Fang, Biaoyan, Yoshikawa, Hiyori, Albahem, Ameer, Cavedon, Lawrence, Cohn, Trevor, Baldwin, Timothy, Verspoor, Karin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8028406/
https://www.ncbi.nlm.nih.gov/pubmed/33870071
http://dx.doi.org/10.3389/frma.2021.654438
_version_ 1783675967333466112
author He, Jiayuan
Nguyen, Dat Quoc
Akhondi, Saber A.
Druckenbrodt, Christian
Thorne, Camilo
Hoessel, Ralph
Afzal, Zubair
Zhai, Zenan
Fang, Biaoyan
Yoshikawa, Hiyori
Albahem, Ameer
Cavedon, Lawrence
Cohn, Trevor
Baldwin, Timothy
Verspoor, Karin
author_facet He, Jiayuan
Nguyen, Dat Quoc
Akhondi, Saber A.
Druckenbrodt, Christian
Thorne, Camilo
Hoessel, Ralph
Afzal, Zubair
Zhai, Zenan
Fang, Biaoyan
Yoshikawa, Hiyori
Albahem, Ameer
Cavedon, Lawrence
Cohn, Trevor
Baldwin, Timothy
Verspoor, Karin
author_sort He, Jiayuan
collection PubMed
description Chemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic properties of chemical patents. The Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020), was introduced to support the development of advanced text mining techniques for chemical patents. The ChEMU 2020 lab proposed two fundamental information extraction tasks focusing on chemical reaction processes described in chemical patents: (1) chemical named entity recognition, requiring identification of essential chemical entities and their roles in chemical reactions, as well as reaction conditions; and (2) event extraction, which aims at identification of event steps relating the entities involved in chemical reactions. The ChEMU 2020 lab received 37 team registrations and 46 runs. Overall, the performance of submissions for these tasks exceeded our expectations, with the top systems outperforming strong baselines. We further show the methods to be robust to variations in sampling of the test data. We provide a detailed overview of the ChEMU 2020 corpus and its annotation, showing that inter-annotator agreement is very strong. We also present the methods adopted by participants, provide a detailed analysis of their performance, and carefully consider the potential impact of data leakage on interpretation of the results. The ChEMU 2020 Lab has shown the viability of automated methods to support information extraction of key information in chemical patents.
format Online
Article
Text
id pubmed-8028406
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-80284062021-04-15 ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents He, Jiayuan Nguyen, Dat Quoc Akhondi, Saber A. Druckenbrodt, Christian Thorne, Camilo Hoessel, Ralph Afzal, Zubair Zhai, Zenan Fang, Biaoyan Yoshikawa, Hiyori Albahem, Ameer Cavedon, Lawrence Cohn, Trevor Baldwin, Timothy Verspoor, Karin Front Res Metr Anal Research Metrics and Analytics Chemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic properties of chemical patents. The Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020), was introduced to support the development of advanced text mining techniques for chemical patents. The ChEMU 2020 lab proposed two fundamental information extraction tasks focusing on chemical reaction processes described in chemical patents: (1) chemical named entity recognition, requiring identification of essential chemical entities and their roles in chemical reactions, as well as reaction conditions; and (2) event extraction, which aims at identification of event steps relating the entities involved in chemical reactions. The ChEMU 2020 lab received 37 team registrations and 46 runs. Overall, the performance of submissions for these tasks exceeded our expectations, with the top systems outperforming strong baselines. We further show the methods to be robust to variations in sampling of the test data. We provide a detailed overview of the ChEMU 2020 corpus and its annotation, showing that inter-annotator agreement is very strong. We also present the methods adopted by participants, provide a detailed analysis of their performance, and carefully consider the potential impact of data leakage on interpretation of the results. The ChEMU 2020 Lab has shown the viability of automated methods to support information extraction of key information in chemical patents. Frontiers Media S.A. 2021-03-25 /pmc/articles/PMC8028406/ /pubmed/33870071 http://dx.doi.org/10.3389/frma.2021.654438 Text en Copyright © 2021 He, Nguyen, Akhondi, Druckenbrodt, Thorne, Hoessel, Afzal, Zhai, Fang, Yoshikawa, Albahem, Cavedon, Cohn, Baldwin and Verspoor. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Research Metrics and Analytics
He, Jiayuan
Nguyen, Dat Quoc
Akhondi, Saber A.
Druckenbrodt, Christian
Thorne, Camilo
Hoessel, Ralph
Afzal, Zubair
Zhai, Zenan
Fang, Biaoyan
Yoshikawa, Hiyori
Albahem, Ameer
Cavedon, Lawrence
Cohn, Trevor
Baldwin, Timothy
Verspoor, Karin
ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents
title ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents
title_full ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents
title_fullStr ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents
title_full_unstemmed ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents
title_short ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents
title_sort chemu 2020: natural language processing methods are effective for information extraction from chemical patents
topic Research Metrics and Analytics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8028406/
https://www.ncbi.nlm.nih.gov/pubmed/33870071
http://dx.doi.org/10.3389/frma.2021.654438
work_keys_str_mv AT hejiayuan chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT nguyendatquoc chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT akhondisabera chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT druckenbrodtchristian chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT thornecamilo chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT hoesselralph chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT afzalzubair chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT zhaizenan chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT fangbiaoyan chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT yoshikawahiyori chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT albahemameer chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT cavedonlawrence chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT cohntrevor chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT baldwintimothy chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT verspoorkarin chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents