Cargando…

Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati

This data article presents a linguistically annotated data set for four official South African languages with a conjunctive orthography, namely isiNdebele, isiXhosa, isiZulu and Siswati. The data set is parallel for all four languages and can be used for language-specific as well as cross-language d...

Descripción completa

Detalles Bibliográficos
Autores principales: Gaustad, Tanja, Puttkammer, Martin J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8914324/
https://www.ncbi.nlm.nih.gov/pubmed/35282178
http://dx.doi.org/10.1016/j.dib.2022.107994
_version_ 1784667677455810560
author Gaustad, Tanja
Puttkammer, Martin J.
author_facet Gaustad, Tanja
Puttkammer, Martin J.
author_sort Gaustad, Tanja
collection PubMed
description This data article presents a linguistically annotated data set for four official South African languages with a conjunctive orthography, namely isiNdebele, isiXhosa, isiZulu and Siswati. The data set is parallel for all four languages and can be used for language-specific as well as cross-language development and evaluation of Natural Language Processing (NLP) core technologies. In addition, it can be used for corpus linguistic studies. The article describes how the data was collected, what type of texts it contains and it provides some details on the three different types of linguistic annotation added (morphology, part-of-speech and lemmas), including an example.
format Online
Article
Text
id pubmed-8914324
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-89143242022-03-12 Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati Gaustad, Tanja Puttkammer, Martin J. Data Brief Data Article This data article presents a linguistically annotated data set for four official South African languages with a conjunctive orthography, namely isiNdebele, isiXhosa, isiZulu and Siswati. The data set is parallel for all four languages and can be used for language-specific as well as cross-language development and evaluation of Natural Language Processing (NLP) core technologies. In addition, it can be used for corpus linguistic studies. The article describes how the data was collected, what type of texts it contains and it provides some details on the three different types of linguistic annotation added (morphology, part-of-speech and lemmas), including an example. Elsevier 2022-02-25 /pmc/articles/PMC8914324/ /pubmed/35282178 http://dx.doi.org/10.1016/j.dib.2022.107994 Text en © 2022 The Author(s). Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Gaustad, Tanja
Puttkammer, Martin J.
Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati
title Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati
title_full Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati
title_fullStr Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati
title_full_unstemmed Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati
title_short Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati
title_sort linguistically annotated dataset for four official south african languages with a conjunctive orthography: isindebele, isixhosa, isizulu, and siswati
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8914324/
https://www.ncbi.nlm.nih.gov/pubmed/35282178
http://dx.doi.org/10.1016/j.dib.2022.107994
work_keys_str_mv AT gaustadtanja linguisticallyannotateddatasetforfourofficialsouthafricanlanguageswithaconjunctiveorthographyisindebeleisixhosaisizuluandsiswati
AT puttkammermartinj linguisticallyannotateddatasetforfourofficialsouthafricanlanguageswithaconjunctiveorthographyisindebeleisixhosaisizuluandsiswati