Cargando…

EduNER: a Chinese named entity recognition dataset for education research

A high-quality domain-oriented dataset is crucial for the domain-specific named entity recognition (NER) task. In this study, we introduce a novel education-oriented Chinese NER dataset (EduNER). To provide representative and diverse training data, we collect data from multiple sources, including te...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Xu, Wei, Chengkun, Jiang, Zhuoren, Meng, Wenlong, Ouyang, Fan, Zhang, Zihui, Chen, Wenzhi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer London 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10199663/
https://www.ncbi.nlm.nih.gov/pubmed/37362570
http://dx.doi.org/10.1007/s00521-023-08635-5
_version_ 1785044979179061248
author Li, Xu
Wei, Chengkun
Jiang, Zhuoren
Meng, Wenlong
Ouyang, Fan
Zhang, Zihui
Chen, Wenzhi
author_facet Li, Xu
Wei, Chengkun
Jiang, Zhuoren
Meng, Wenlong
Ouyang, Fan
Zhang, Zihui
Chen, Wenzhi
author_sort Li, Xu
collection PubMed
description A high-quality domain-oriented dataset is crucial for the domain-specific named entity recognition (NER) task. In this study, we introduce a novel education-oriented Chinese NER dataset (EduNER). To provide representative and diverse training data, we collect data from multiple sources, including textbooks, academic papers, and education-related web pages. The collected documents span ten years (2012–2021). A team of domain experts is invited to accomplish the education NER schema definition, and a group of trained annotators is hired to complete the annotation. A collaborative labeling platform is built for accelerating human annotation. The constructed EduNER dataset includes 16 entity types, 11k+ sentences, and 35,731 entities. We conduct a thorough statistical analysis of EduNER and summarize its distinctive characteristics by comparing it with eight open-domain or domain-specific NER datasets. Sixteen state-of-the-art models are further utilized for NER tasks validation. The experimental results can enlighten further exploration. To the best of our knowledge, EduNER is the first publicly available dataset for NER task in the education domain, which may promote the development of education-oriented NER models.
format Online
Article
Text
id pubmed-10199663
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Springer London
record_format MEDLINE/PubMed
spelling pubmed-101996632023-05-23 EduNER: a Chinese named entity recognition dataset for education research Li, Xu Wei, Chengkun Jiang, Zhuoren Meng, Wenlong Ouyang, Fan Zhang, Zihui Chen, Wenzhi Neural Comput Appl Original Article A high-quality domain-oriented dataset is crucial for the domain-specific named entity recognition (NER) task. In this study, we introduce a novel education-oriented Chinese NER dataset (EduNER). To provide representative and diverse training data, we collect data from multiple sources, including textbooks, academic papers, and education-related web pages. The collected documents span ten years (2012–2021). A team of domain experts is invited to accomplish the education NER schema definition, and a group of trained annotators is hired to complete the annotation. A collaborative labeling platform is built for accelerating human annotation. The constructed EduNER dataset includes 16 entity types, 11k+ sentences, and 35,731 entities. We conduct a thorough statistical analysis of EduNER and summarize its distinctive characteristics by comparing it with eight open-domain or domain-specific NER datasets. Sixteen state-of-the-art models are further utilized for NER tasks validation. The experimental results can enlighten further exploration. To the best of our knowledge, EduNER is the first publicly available dataset for NER task in the education domain, which may promote the development of education-oriented NER models. Springer London 2023-05-20 /pmc/articles/PMC10199663/ /pubmed/37362570 http://dx.doi.org/10.1007/s00521-023-08635-5 Text en © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Original Article
Li, Xu
Wei, Chengkun
Jiang, Zhuoren
Meng, Wenlong
Ouyang, Fan
Zhang, Zihui
Chen, Wenzhi
EduNER: a Chinese named entity recognition dataset for education research
title EduNER: a Chinese named entity recognition dataset for education research
title_full EduNER: a Chinese named entity recognition dataset for education research
title_fullStr EduNER: a Chinese named entity recognition dataset for education research
title_full_unstemmed EduNER: a Chinese named entity recognition dataset for education research
title_short EduNER: a Chinese named entity recognition dataset for education research
title_sort eduner: a chinese named entity recognition dataset for education research
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10199663/
https://www.ncbi.nlm.nih.gov/pubmed/37362570
http://dx.doi.org/10.1007/s00521-023-08635-5
work_keys_str_mv AT lixu edunerachinesenamedentityrecognitiondatasetforeducationresearch
AT weichengkun edunerachinesenamedentityrecognitiondatasetforeducationresearch
AT jiangzhuoren edunerachinesenamedentityrecognitiondatasetforeducationresearch
AT mengwenlong edunerachinesenamedentityrecognitiondatasetforeducationresearch
AT ouyangfan edunerachinesenamedentityrecognitiondatasetforeducationresearch
AT zhangzihui edunerachinesenamedentityrecognitiondatasetforeducationresearch
AT chenwenzhi edunerachinesenamedentityrecognitiondatasetforeducationresearch