Cargando…

OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition

Due to the rapid evolution of high-throughput technologies, a tremendous amount of data is being produced in the biological domain, which poses a challenging task for information extraction and natural language understanding. Biological named entity recognition (NER) and named entity normalisation (...

Descripción completa

Detalles Bibliográficos
Autores principales: Larmande, Pierre, Liu, Yusha, Yao, Xinzhi, Xia, Jingbo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Korea Genome Organization 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8510865/
https://www.ncbi.nlm.nih.gov/pubmed/34638174
http://dx.doi.org/10.5808/gi.21015
_version_ 1784582663432044544
author Larmande, Pierre
Liu, Yusha
Yao, Xinzhi
Xia, Jingbo
author_facet Larmande, Pierre
Liu, Yusha
Yao, Xinzhi
Xia, Jingbo
author_sort Larmande, Pierre
collection PubMed
description Due to the rapid evolution of high-throughput technologies, a tremendous amount of data is being produced in the biological domain, which poses a challenging task for information extraction and natural language understanding. Biological named entity recognition (NER) and named entity normalisation (NEN) are two common tasks aiming at identifying and linking biologically important entities such as genes or gene products mentioned in the literature to biological databases. In this paper, we present an updated version of OryzaGP, a gene and protein dataset for rice species created to help natural language processing (NLP) tools in processing NER and NEN tasks. To create the dataset, we selected more than 15,000 abstracts associated with articles previously curated for rice genes. We developed four dictionaries of gene and protein names associated with database identifiers. We used these dictionaries to annotate the dataset. We also annotated the dataset using pre-trained NLP models. Finally, we analysed the annotation results and discussed how to improve OryzaGP.
format Online
Article
Text
id pubmed-8510865
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Korea Genome Organization
record_format MEDLINE/PubMed
spelling pubmed-85108652021-10-22 OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition Larmande, Pierre Liu, Yusha Yao, Xinzhi Xia, Jingbo Genomics Inform Blah7 Due to the rapid evolution of high-throughput technologies, a tremendous amount of data is being produced in the biological domain, which poses a challenging task for information extraction and natural language understanding. Biological named entity recognition (NER) and named entity normalisation (NEN) are two common tasks aiming at identifying and linking biologically important entities such as genes or gene products mentioned in the literature to biological databases. In this paper, we present an updated version of OryzaGP, a gene and protein dataset for rice species created to help natural language processing (NLP) tools in processing NER and NEN tasks. To create the dataset, we selected more than 15,000 abstracts associated with articles previously curated for rice genes. We developed four dictionaries of gene and protein names associated with database identifiers. We used these dictionaries to annotate the dataset. We also annotated the dataset using pre-trained NLP models. Finally, we analysed the annotation results and discussed how to improve OryzaGP. Korea Genome Organization 2021-09-30 /pmc/articles/PMC8510865/ /pubmed/34638174 http://dx.doi.org/10.5808/gi.21015 Text en (c) 2021, Korea Genome Organization https://creativecommons.org/licenses/by/4.0/(CC) This is an open-access article distributed under the terms of the Creative Commons Attribution license(https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Blah7
Larmande, Pierre
Liu, Yusha
Yao, Xinzhi
Xia, Jingbo
OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition
title OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition
title_full OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition
title_fullStr OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition
title_full_unstemmed OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition
title_short OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition
title_sort oryzagp 2021 update: a rice gene and protein dataset for named-entity recognition
topic Blah7
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8510865/
https://www.ncbi.nlm.nih.gov/pubmed/34638174
http://dx.doi.org/10.5808/gi.21015
work_keys_str_mv AT larmandepierre oryzagp2021updatearicegeneandproteindatasetfornamedentityrecognition
AT liuyusha oryzagp2021updatearicegeneandproteindatasetfornamedentityrecognition
AT yaoxinzhi oryzagp2021updatearicegeneandproteindatasetfornamedentityrecognition
AT xiajingbo oryzagp2021updatearicegeneandproteindatasetfornamedentityrecognition