Cargando…
Enhancing African low-resource languages: Swahili data for language modelling
Language modelling using neural networks requires adequate data to guarantee quality word representation which is important for natural language processing (NLP) tasks. However, African languages, Swahili in particular, have been disadvantaged and most of them are classified as low resource language...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7339006/ https://www.ncbi.nlm.nih.gov/pubmed/32671155 http://dx.doi.org/10.1016/j.dib.2020.105951 |
_version_ | 1783554804573798400 |
---|---|
author | Shikali, Casper S. Mokhosi, Refuoe |
author_facet | Shikali, Casper S. Mokhosi, Refuoe |
author_sort | Shikali, Casper S. |
collection | PubMed |
description | Language modelling using neural networks requires adequate data to guarantee quality word representation which is important for natural language processing (NLP) tasks. However, African languages, Swahili in particular, have been disadvantaged and most of them are classified as low resource languages because of inadequate data for NLP. In this article, we derive and contribute unannotated Swahili dataset, Swahili syllabic alphabet and Swahili word analogy dataset to address the need for language processing resources especially for low resource languages. Therefore, we derive the unannotated Swahili dataset by pre-processing raw Swahili data using a Python script, formulate the syllabic alphabet and develop the Swahili word analogy dataset based on an existing English dataset. We envisage that the datasets will not only support language models but also other NLP downstream tasks such as part-of-speech tagging, machine translation and sentiment analysis. |
format | Online Article Text |
id | pubmed-7339006 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-73390062020-07-14 Enhancing African low-resource languages: Swahili data for language modelling Shikali, Casper S. Mokhosi, Refuoe Data Brief Computer Science Language modelling using neural networks requires adequate data to guarantee quality word representation which is important for natural language processing (NLP) tasks. However, African languages, Swahili in particular, have been disadvantaged and most of them are classified as low resource languages because of inadequate data for NLP. In this article, we derive and contribute unannotated Swahili dataset, Swahili syllabic alphabet and Swahili word analogy dataset to address the need for language processing resources especially for low resource languages. Therefore, we derive the unannotated Swahili dataset by pre-processing raw Swahili data using a Python script, formulate the syllabic alphabet and develop the Swahili word analogy dataset based on an existing English dataset. We envisage that the datasets will not only support language models but also other NLP downstream tasks such as part-of-speech tagging, machine translation and sentiment analysis. Elsevier 2020-06-30 /pmc/articles/PMC7339006/ /pubmed/32671155 http://dx.doi.org/10.1016/j.dib.2020.105951 Text en © 2020 The Authors. Published by Elsevier Inc. http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Computer Science Shikali, Casper S. Mokhosi, Refuoe Enhancing African low-resource languages: Swahili data for language modelling |
title | Enhancing African low-resource languages: Swahili data for language modelling |
title_full | Enhancing African low-resource languages: Swahili data for language modelling |
title_fullStr | Enhancing African low-resource languages: Swahili data for language modelling |
title_full_unstemmed | Enhancing African low-resource languages: Swahili data for language modelling |
title_short | Enhancing African low-resource languages: Swahili data for language modelling |
title_sort | enhancing african low-resource languages: swahili data for language modelling |
topic | Computer Science |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7339006/ https://www.ncbi.nlm.nih.gov/pubmed/32671155 http://dx.doi.org/10.1016/j.dib.2020.105951 |
work_keys_str_mv | AT shikalicaspers enhancingafricanlowresourcelanguagesswahilidataforlanguagemodelling AT mokhosirefuoe enhancingafricanlowresourcelanguagesswahilidataforlanguagemodelling |