Cargando…
EduNER: a Chinese named entity recognition dataset for education research
A high-quality domain-oriented dataset is crucial for the domain-specific named entity recognition (NER) task. In this study, we introduce a novel education-oriented Chinese NER dataset (EduNER). To provide representative and diverse training data, we collect data from multiple sources, including te...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer London
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10199663/ https://www.ncbi.nlm.nih.gov/pubmed/37362570 http://dx.doi.org/10.1007/s00521-023-08635-5 |
_version_ | 1785044979179061248 |
---|---|
author | Li, Xu Wei, Chengkun Jiang, Zhuoren Meng, Wenlong Ouyang, Fan Zhang, Zihui Chen, Wenzhi |
author_facet | Li, Xu Wei, Chengkun Jiang, Zhuoren Meng, Wenlong Ouyang, Fan Zhang, Zihui Chen, Wenzhi |
author_sort | Li, Xu |
collection | PubMed |
description | A high-quality domain-oriented dataset is crucial for the domain-specific named entity recognition (NER) task. In this study, we introduce a novel education-oriented Chinese NER dataset (EduNER). To provide representative and diverse training data, we collect data from multiple sources, including textbooks, academic papers, and education-related web pages. The collected documents span ten years (2012–2021). A team of domain experts is invited to accomplish the education NER schema definition, and a group of trained annotators is hired to complete the annotation. A collaborative labeling platform is built for accelerating human annotation. The constructed EduNER dataset includes 16 entity types, 11k+ sentences, and 35,731 entities. We conduct a thorough statistical analysis of EduNER and summarize its distinctive characteristics by comparing it with eight open-domain or domain-specific NER datasets. Sixteen state-of-the-art models are further utilized for NER tasks validation. The experimental results can enlighten further exploration. To the best of our knowledge, EduNER is the first publicly available dataset for NER task in the education domain, which may promote the development of education-oriented NER models. |
format | Online Article Text |
id | pubmed-10199663 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Springer London |
record_format | MEDLINE/PubMed |
spelling | pubmed-101996632023-05-23 EduNER: a Chinese named entity recognition dataset for education research Li, Xu Wei, Chengkun Jiang, Zhuoren Meng, Wenlong Ouyang, Fan Zhang, Zihui Chen, Wenzhi Neural Comput Appl Original Article A high-quality domain-oriented dataset is crucial for the domain-specific named entity recognition (NER) task. In this study, we introduce a novel education-oriented Chinese NER dataset (EduNER). To provide representative and diverse training data, we collect data from multiple sources, including textbooks, academic papers, and education-related web pages. The collected documents span ten years (2012–2021). A team of domain experts is invited to accomplish the education NER schema definition, and a group of trained annotators is hired to complete the annotation. A collaborative labeling platform is built for accelerating human annotation. The constructed EduNER dataset includes 16 entity types, 11k+ sentences, and 35,731 entities. We conduct a thorough statistical analysis of EduNER and summarize its distinctive characteristics by comparing it with eight open-domain or domain-specific NER datasets. Sixteen state-of-the-art models are further utilized for NER tasks validation. The experimental results can enlighten further exploration. To the best of our knowledge, EduNER is the first publicly available dataset for NER task in the education domain, which may promote the development of education-oriented NER models. Springer London 2023-05-20 /pmc/articles/PMC10199663/ /pubmed/37362570 http://dx.doi.org/10.1007/s00521-023-08635-5 Text en © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Original Article Li, Xu Wei, Chengkun Jiang, Zhuoren Meng, Wenlong Ouyang, Fan Zhang, Zihui Chen, Wenzhi EduNER: a Chinese named entity recognition dataset for education research |
title | EduNER: a Chinese named entity recognition dataset for education research |
title_full | EduNER: a Chinese named entity recognition dataset for education research |
title_fullStr | EduNER: a Chinese named entity recognition dataset for education research |
title_full_unstemmed | EduNER: a Chinese named entity recognition dataset for education research |
title_short | EduNER: a Chinese named entity recognition dataset for education research |
title_sort | eduner: a chinese named entity recognition dataset for education research |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10199663/ https://www.ncbi.nlm.nih.gov/pubmed/37362570 http://dx.doi.org/10.1007/s00521-023-08635-5 |
work_keys_str_mv | AT lixu edunerachinesenamedentityrecognitiondatasetforeducationresearch AT weichengkun edunerachinesenamedentityrecognitiondatasetforeducationresearch AT jiangzhuoren edunerachinesenamedentityrecognitiondatasetforeducationresearch AT mengwenlong edunerachinesenamedentityrecognitiondatasetforeducationresearch AT ouyangfan edunerachinesenamedentityrecognitiondatasetforeducationresearch AT zhangzihui edunerachinesenamedentityrecognitiondatasetforeducationresearch AT chenwenzhi edunerachinesenamedentityrecognitiondatasetforeducationresearch |