Cargando…

BioRED: a rich biomedical relation extraction dataset

Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein–protein interactions)...

Descripción completa

Detalles Bibliográficos
Autores principales:	Luo, Ling, Lai, Po-Ting, Wei, Chih-Hsuan, Arighi, Cecilia N, Lu, Zhiyong
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Review
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9487702/ https://www.ncbi.nlm.nih.gov/pubmed/35849818 http://dx.doi.org/10.1093/bib/bbac282

_version_	1784792509868670976
author	Luo, Ling Lai, Po-Ting Wei, Chih-Hsuan Arighi, Cecilia N Lu, Zhiyong
author_facet	Luo, Ling Lai, Po-Ting Wei, Chih-Hsuan Arighi, Cecilia N Lu, Zhiyong
author_sort	Luo, Ling
collection	PubMed
description	Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein–protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.
format	Online Article Text
id	pubmed-9487702
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-94877022022-09-21 BioRED: a rich biomedical relation extraction dataset Luo, Ling Lai, Po-Ting Wei, Chih-Hsuan Arighi, Cecilia N Lu, Zhiyong Brief Bioinform Review Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein–protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/. Oxford University Press 2022-07-19 /pmc/articles/PMC9487702/ /pubmed/35849818 http://dx.doi.org/10.1093/bib/bbac282 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Review Luo, Ling Lai, Po-Ting Wei, Chih-Hsuan Arighi, Cecilia N Lu, Zhiyong BioRED: a rich biomedical relation extraction dataset
title	BioRED: a rich biomedical relation extraction dataset
title_full	BioRED: a rich biomedical relation extraction dataset
title_fullStr	BioRED: a rich biomedical relation extraction dataset
title_full_unstemmed	BioRED: a rich biomedical relation extraction dataset
title_short	BioRED: a rich biomedical relation extraction dataset
title_sort	biored: a rich biomedical relation extraction dataset
topic	Review
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9487702/ https://www.ncbi.nlm.nih.gov/pubmed/35849818 http://dx.doi.org/10.1093/bib/bbac282
work_keys_str_mv	AT luoling bioredarichbiomedicalrelationextractiondataset AT laipoting bioredarichbiomedicalrelationextractiondataset AT weichihhsuan bioredarichbiomedicalrelationextractiondataset AT arighicecilian bioredarichbiomedicalrelationextractiondataset AT luzhiyong bioredarichbiomedicalrelationextractiondataset

BioRED: a rich biomedical relation extraction dataset

Ejemplares similares