Cargando…

Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms

Named entity recognition (NER) is a widely used text-mining and natural language processing (NLP) subtask. In recent years, deep learning methods have superseded traditional dictionary- and rule-based NER approaches. A high-quality dataset is essential to fully leverage recent deep learning advancem...

Descripción completa

Detalles Bibliográficos
Autores principales: Yang, Xiao, Saha, Shyamasree, Venkatesan, Aravind, Tirunagari, Santosh, Vartak, Vid, McEntyre, Johanna
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10587067/
https://www.ncbi.nlm.nih.gov/pubmed/37857688
http://dx.doi.org/10.1038/s41597-023-02617-x
_version_ 1785123277482491904
author Yang, Xiao
Saha, Shyamasree
Venkatesan, Aravind
Tirunagari, Santosh
Vartak, Vid
McEntyre, Johanna
author_facet Yang, Xiao
Saha, Shyamasree
Venkatesan, Aravind
Tirunagari, Santosh
Vartak, Vid
McEntyre, Johanna
author_sort Yang, Xiao
collection PubMed
description Named entity recognition (NER) is a widely used text-mining and natural language processing (NLP) subtask. In recent years, deep learning methods have superseded traditional dictionary- and rule-based NER approaches. A high-quality dataset is essential to fully leverage recent deep learning advancements. While several gold-standard corpora for biomedical entities in abstracts exist, only a few are based on full-text research articles. The Europe PMC literature database routinely annotates Gene/Proteins, Diseases, and Organisms entities. To transition this pipeline from a dictionary-based to a machine learning-based approach, we have developed a human-annotated full-text corpus for these entities, comprising 300 full-text open-access research articles. Over 72,000 mentions of biomedical concepts have been identified within approximately 114,000 sentences. This article describes the corpus and details how to access and reuse this open community resource.
format Online
Article
Text
id pubmed-10587067
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-105870672023-10-21 Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms Yang, Xiao Saha, Shyamasree Venkatesan, Aravind Tirunagari, Santosh Vartak, Vid McEntyre, Johanna Sci Data Data Descriptor Named entity recognition (NER) is a widely used text-mining and natural language processing (NLP) subtask. In recent years, deep learning methods have superseded traditional dictionary- and rule-based NER approaches. A high-quality dataset is essential to fully leverage recent deep learning advancements. While several gold-standard corpora for biomedical entities in abstracts exist, only a few are based on full-text research articles. The Europe PMC literature database routinely annotates Gene/Proteins, Diseases, and Organisms entities. To transition this pipeline from a dictionary-based to a machine learning-based approach, we have developed a human-annotated full-text corpus for these entities, comprising 300 full-text open-access research articles. Over 72,000 mentions of biomedical concepts have been identified within approximately 114,000 sentences. This article describes the corpus and details how to access and reuse this open community resource. Nature Publishing Group UK 2023-10-19 /pmc/articles/PMC10587067/ /pubmed/37857688 http://dx.doi.org/10.1038/s41597-023-02617-x Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Data Descriptor
Yang, Xiao
Saha, Shyamasree
Venkatesan, Aravind
Tirunagari, Santosh
Vartak, Vid
McEntyre, Johanna
Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms
title Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms
title_full Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms
title_fullStr Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms
title_full_unstemmed Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms
title_short Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms
title_sort europe pmc annotated full-text corpus for gene/proteins, diseases and organisms
topic Data Descriptor
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10587067/
https://www.ncbi.nlm.nih.gov/pubmed/37857688
http://dx.doi.org/10.1038/s41597-023-02617-x
work_keys_str_mv AT yangxiao europepmcannotatedfulltextcorpusforgeneproteinsdiseasesandorganisms
AT sahashyamasree europepmcannotatedfulltextcorpusforgeneproteinsdiseasesandorganisms
AT venkatesanaravind europepmcannotatedfulltextcorpusforgeneproteinsdiseasesandorganisms
AT tirunagarisantosh europepmcannotatedfulltextcorpusforgeneproteinsdiseasesandorganisms
AT vartakvid europepmcannotatedfulltextcorpusforgeneproteinsdiseasesandorganisms
AT mcentyrejohanna europepmcannotatedfulltextcorpusforgeneproteinsdiseasesandorganisms