Cargando…

DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect

DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organi...

Descripción completa

Detalles Bibliográficos
Autores principales: Moussa, Hanane Nour, Mourhir, Asmaa
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10293988/
https://www.ncbi.nlm.nih.gov/pubmed/37383818
http://dx.doi.org/10.1016/j.dib.2023.109234
_version_ 1785063102684856320
author Moussa, Hanane Nour
Mourhir, Asmaa
author_facet Moussa, Hanane Nour
Mourhir, Asmaa
author_sort Moussa, Hanane Nour
collection PubMed
description DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organization, and miscellaneous. The data were scraped from the Moroccan Dialect section of Wikipedia and processed and annotated using open-source libraries and tools. The data are useful for the Arabic natural language processing (NLP) community as they address the lack in dialectal Arabic annotated corpora. This dataset can be used to train and evaluate named entity recognition systems in dialectal and mixed Arabic.
format Online
Article
Text
id pubmed-10293988
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-102939882023-06-28 DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect Moussa, Hanane Nour Mourhir, Asmaa Data Brief Data Article DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organization, and miscellaneous. The data were scraped from the Moroccan Dialect section of Wikipedia and processed and annotated using open-source libraries and tools. The data are useful for the Arabic natural language processing (NLP) community as they address the lack in dialectal Arabic annotated corpora. This dataset can be used to train and evaluate named entity recognition systems in dialectal and mixed Arabic. Elsevier 2023-05-12 /pmc/articles/PMC10293988/ /pubmed/37383818 http://dx.doi.org/10.1016/j.dib.2023.109234 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Moussa, Hanane Nour
Mourhir, Asmaa
DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_full DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_fullStr DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_full_unstemmed DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_short DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_sort darnercorp: an annotated named entity recognition dataset in the moroccan dialect
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10293988/
https://www.ncbi.nlm.nih.gov/pubmed/37383818
http://dx.doi.org/10.1016/j.dib.2023.109234
work_keys_str_mv AT moussahananenour darnercorpanannotatednamedentityrecognitiondatasetinthemoroccandialect
AT mourhirasmaa darnercorpanannotatednamedentityrecognitiondatasetinthemoroccandialect