Cargando…

Enhancing African low-resource languages: Swahili data for language modelling

Language modelling using neural networks requires adequate data to guarantee quality word representation which is important for natural language processing (NLP) tasks. However, African languages, Swahili in particular, have been disadvantaged and most of them are classified as low resource language...

Descripción completa

Detalles Bibliográficos
Autores principales: Shikali, Casper S., Mokhosi, Refuoe
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7339006/
https://www.ncbi.nlm.nih.gov/pubmed/32671155
http://dx.doi.org/10.1016/j.dib.2020.105951
_version_ 1783554804573798400
author Shikali, Casper S.
Mokhosi, Refuoe
author_facet Shikali, Casper S.
Mokhosi, Refuoe
author_sort Shikali, Casper S.
collection PubMed
description Language modelling using neural networks requires adequate data to guarantee quality word representation which is important for natural language processing (NLP) tasks. However, African languages, Swahili in particular, have been disadvantaged and most of them are classified as low resource languages because of inadequate data for NLP. In this article, we derive and contribute unannotated Swahili dataset, Swahili syllabic alphabet and Swahili word analogy dataset to address the need for language processing resources especially for low resource languages. Therefore, we derive the unannotated Swahili dataset by pre-processing raw Swahili data using a Python script, formulate the syllabic alphabet and develop the Swahili word analogy dataset based on an existing English dataset. We envisage that the datasets will not only support language models but also other NLP downstream tasks such as part-of-speech tagging, machine translation and sentiment analysis.
format Online
Article
Text
id pubmed-7339006
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-73390062020-07-14 Enhancing African low-resource languages: Swahili data for language modelling Shikali, Casper S. Mokhosi, Refuoe Data Brief Computer Science Language modelling using neural networks requires adequate data to guarantee quality word representation which is important for natural language processing (NLP) tasks. However, African languages, Swahili in particular, have been disadvantaged and most of them are classified as low resource languages because of inadequate data for NLP. In this article, we derive and contribute unannotated Swahili dataset, Swahili syllabic alphabet and Swahili word analogy dataset to address the need for language processing resources especially for low resource languages. Therefore, we derive the unannotated Swahili dataset by pre-processing raw Swahili data using a Python script, formulate the syllabic alphabet and develop the Swahili word analogy dataset based on an existing English dataset. We envisage that the datasets will not only support language models but also other NLP downstream tasks such as part-of-speech tagging, machine translation and sentiment analysis. Elsevier 2020-06-30 /pmc/articles/PMC7339006/ /pubmed/32671155 http://dx.doi.org/10.1016/j.dib.2020.105951 Text en © 2020 The Authors. Published by Elsevier Inc. http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Computer Science
Shikali, Casper S.
Mokhosi, Refuoe
Enhancing African low-resource languages: Swahili data for language modelling
title Enhancing African low-resource languages: Swahili data for language modelling
title_full Enhancing African low-resource languages: Swahili data for language modelling
title_fullStr Enhancing African low-resource languages: Swahili data for language modelling
title_full_unstemmed Enhancing African low-resource languages: Swahili data for language modelling
title_short Enhancing African low-resource languages: Swahili data for language modelling
title_sort enhancing african low-resource languages: swahili data for language modelling
topic Computer Science
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7339006/
https://www.ncbi.nlm.nih.gov/pubmed/32671155
http://dx.doi.org/10.1016/j.dib.2020.105951
work_keys_str_mv AT shikalicaspers enhancingafricanlowresourcelanguagesswahilidataforlanguagemodelling
AT mokhosirefuoe enhancingafricanlowresourcelanguagesswahilidataforlanguagemodelling