Cargando…
DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organi...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10293988/ https://www.ncbi.nlm.nih.gov/pubmed/37383818 http://dx.doi.org/10.1016/j.dib.2023.109234 |
_version_ | 1785063102684856320 |
---|---|
author | Moussa, Hanane Nour Mourhir, Asmaa |
author_facet | Moussa, Hanane Nour Mourhir, Asmaa |
author_sort | Moussa, Hanane Nour |
collection | PubMed |
description | DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organization, and miscellaneous. The data were scraped from the Moroccan Dialect section of Wikipedia and processed and annotated using open-source libraries and tools. The data are useful for the Arabic natural language processing (NLP) community as they address the lack in dialectal Arabic annotated corpora. This dataset can be used to train and evaluate named entity recognition systems in dialectal and mixed Arabic. |
format | Online Article Text |
id | pubmed-10293988 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-102939882023-06-28 DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect Moussa, Hanane Nour Mourhir, Asmaa Data Brief Data Article DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organization, and miscellaneous. The data were scraped from the Moroccan Dialect section of Wikipedia and processed and annotated using open-source libraries and tools. The data are useful for the Arabic natural language processing (NLP) community as they address the lack in dialectal Arabic annotated corpora. This dataset can be used to train and evaluate named entity recognition systems in dialectal and mixed Arabic. Elsevier 2023-05-12 /pmc/articles/PMC10293988/ /pubmed/37383818 http://dx.doi.org/10.1016/j.dib.2023.109234 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Data Article Moussa, Hanane Nour Mourhir, Asmaa DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect |
title | DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect |
title_full | DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect |
title_fullStr | DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect |
title_full_unstemmed | DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect |
title_short | DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect |
title_sort | darnercorp: an annotated named entity recognition dataset in the moroccan dialect |
topic | Data Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10293988/ https://www.ncbi.nlm.nih.gov/pubmed/37383818 http://dx.doi.org/10.1016/j.dib.2023.109234 |
work_keys_str_mv | AT moussahananenour darnercorpanannotatednamedentityrecognitiondatasetinthemoroccandialect AT mourhirasmaa darnercorpanannotatednamedentityrecognitiondatasetinthemoroccandialect |