Cargando…
Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms
Named entity recognition (NER) is a widely used text-mining and natural language processing (NLP) subtask. In recent years, deep learning methods have superseded traditional dictionary- and rule-based NER approaches. A high-quality dataset is essential to fully leverage recent deep learning advancem...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10587067/ https://www.ncbi.nlm.nih.gov/pubmed/37857688 http://dx.doi.org/10.1038/s41597-023-02617-x |
_version_ | 1785123277482491904 |
---|---|
author | Yang, Xiao Saha, Shyamasree Venkatesan, Aravind Tirunagari, Santosh Vartak, Vid McEntyre, Johanna |
author_facet | Yang, Xiao Saha, Shyamasree Venkatesan, Aravind Tirunagari, Santosh Vartak, Vid McEntyre, Johanna |
author_sort | Yang, Xiao |
collection | PubMed |
description | Named entity recognition (NER) is a widely used text-mining and natural language processing (NLP) subtask. In recent years, deep learning methods have superseded traditional dictionary- and rule-based NER approaches. A high-quality dataset is essential to fully leverage recent deep learning advancements. While several gold-standard corpora for biomedical entities in abstracts exist, only a few are based on full-text research articles. The Europe PMC literature database routinely annotates Gene/Proteins, Diseases, and Organisms entities. To transition this pipeline from a dictionary-based to a machine learning-based approach, we have developed a human-annotated full-text corpus for these entities, comprising 300 full-text open-access research articles. Over 72,000 mentions of biomedical concepts have been identified within approximately 114,000 sentences. This article describes the corpus and details how to access and reuse this open community resource. |
format | Online Article Text |
id | pubmed-10587067 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-105870672023-10-21 Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms Yang, Xiao Saha, Shyamasree Venkatesan, Aravind Tirunagari, Santosh Vartak, Vid McEntyre, Johanna Sci Data Data Descriptor Named entity recognition (NER) is a widely used text-mining and natural language processing (NLP) subtask. In recent years, deep learning methods have superseded traditional dictionary- and rule-based NER approaches. A high-quality dataset is essential to fully leverage recent deep learning advancements. While several gold-standard corpora for biomedical entities in abstracts exist, only a few are based on full-text research articles. The Europe PMC literature database routinely annotates Gene/Proteins, Diseases, and Organisms entities. To transition this pipeline from a dictionary-based to a machine learning-based approach, we have developed a human-annotated full-text corpus for these entities, comprising 300 full-text open-access research articles. Over 72,000 mentions of biomedical concepts have been identified within approximately 114,000 sentences. This article describes the corpus and details how to access and reuse this open community resource. Nature Publishing Group UK 2023-10-19 /pmc/articles/PMC10587067/ /pubmed/37857688 http://dx.doi.org/10.1038/s41597-023-02617-x Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Data Descriptor Yang, Xiao Saha, Shyamasree Venkatesan, Aravind Tirunagari, Santosh Vartak, Vid McEntyre, Johanna Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms |
title | Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms |
title_full | Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms |
title_fullStr | Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms |
title_full_unstemmed | Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms |
title_short | Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms |
title_sort | europe pmc annotated full-text corpus for gene/proteins, diseases and organisms |
topic | Data Descriptor |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10587067/ https://www.ncbi.nlm.nih.gov/pubmed/37857688 http://dx.doi.org/10.1038/s41597-023-02617-x |
work_keys_str_mv | AT yangxiao europepmcannotatedfulltextcorpusforgeneproteinsdiseasesandorganisms AT sahashyamasree europepmcannotatedfulltextcorpusforgeneproteinsdiseasesandorganisms AT venkatesanaravind europepmcannotatedfulltextcorpusforgeneproteinsdiseasesandorganisms AT tirunagarisantosh europepmcannotatedfulltextcorpusforgeneproteinsdiseasesandorganisms AT vartakvid europepmcannotatedfulltextcorpusforgeneproteinsdiseasesandorganisms AT mcentyrejohanna europepmcannotatedfulltextcorpusforgeneproteinsdiseasesandorganisms |