Cargando…
Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati
This data article presents a linguistically annotated data set for four official South African languages with a conjunctive orthography, namely isiNdebele, isiXhosa, isiZulu and Siswati. The data set is parallel for all four languages and can be used for language-specific as well as cross-language d...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8914324/ https://www.ncbi.nlm.nih.gov/pubmed/35282178 http://dx.doi.org/10.1016/j.dib.2022.107994 |
_version_ | 1784667677455810560 |
---|---|
author | Gaustad, Tanja Puttkammer, Martin J. |
author_facet | Gaustad, Tanja Puttkammer, Martin J. |
author_sort | Gaustad, Tanja |
collection | PubMed |
description | This data article presents a linguistically annotated data set for four official South African languages with a conjunctive orthography, namely isiNdebele, isiXhosa, isiZulu and Siswati. The data set is parallel for all four languages and can be used for language-specific as well as cross-language development and evaluation of Natural Language Processing (NLP) core technologies. In addition, it can be used for corpus linguistic studies. The article describes how the data was collected, what type of texts it contains and it provides some details on the three different types of linguistic annotation added (morphology, part-of-speech and lemmas), including an example. |
format | Online Article Text |
id | pubmed-8914324 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-89143242022-03-12 Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati Gaustad, Tanja Puttkammer, Martin J. Data Brief Data Article This data article presents a linguistically annotated data set for four official South African languages with a conjunctive orthography, namely isiNdebele, isiXhosa, isiZulu and Siswati. The data set is parallel for all four languages and can be used for language-specific as well as cross-language development and evaluation of Natural Language Processing (NLP) core technologies. In addition, it can be used for corpus linguistic studies. The article describes how the data was collected, what type of texts it contains and it provides some details on the three different types of linguistic annotation added (morphology, part-of-speech and lemmas), including an example. Elsevier 2022-02-25 /pmc/articles/PMC8914324/ /pubmed/35282178 http://dx.doi.org/10.1016/j.dib.2022.107994 Text en © 2022 The Author(s). Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Data Article Gaustad, Tanja Puttkammer, Martin J. Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati |
title | Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati |
title_full | Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati |
title_fullStr | Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati |
title_full_unstemmed | Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati |
title_short | Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati |
title_sort | linguistically annotated dataset for four official south african languages with a conjunctive orthography: isindebele, isixhosa, isizulu, and siswati |
topic | Data Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8914324/ https://www.ncbi.nlm.nih.gov/pubmed/35282178 http://dx.doi.org/10.1016/j.dib.2022.107994 |
work_keys_str_mv | AT gaustadtanja linguisticallyannotateddatasetforfourofficialsouthafricanlanguageswithaconjunctiveorthographyisindebeleisixhosaisizuluandsiswati AT puttkammermartinj linguisticallyannotateddatasetforfourofficialsouthafricanlanguageswithaconjunctiveorthographyisindebeleisixhosaisizuluandsiswati |