Cargando…

Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words

Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are us...

Descripción completa

Detalles Bibliográficos
Autores principales: Masua, Bernard, Masasi, Noel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7689026/
https://www.ncbi.nlm.nih.gov/pubmed/33294515
http://dx.doi.org/10.1016/j.dib.2020.106517
_version_ 1783613776625401856
author Masua, Bernard
Masasi, Noel
author_facet Masua, Bernard
Masasi, Noel
author_sort Masua, Bernard
collection PubMed
description Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets. The main source for these datasets were short Swahili messages collected from Tanzanian platform that is used by young people to convey their opinions on things that matters to them. Therefore, we derive list of common Swahili stop-words by reviewing most frequent words that are generated with Python script from our corpus, review common slang with help of Swahili experts with their corresponding proper words, and generate common Swahili typos by analysing least frequent words generated by a Python script from corpus. The datasets were exported into files for easy access and reuse. These datasets can be reused in natural language processing as resources in pre-processing phase for Swahili textual data.
format Online
Article
Text
id pubmed-7689026
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-76890262020-12-07 Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words Masua, Bernard Masasi, Noel Data Brief Data Article Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets. The main source for these datasets were short Swahili messages collected from Tanzanian platform that is used by young people to convey their opinions on things that matters to them. Therefore, we derive list of common Swahili stop-words by reviewing most frequent words that are generated with Python script from our corpus, review common slang with help of Swahili experts with their corresponding proper words, and generate common Swahili typos by analysing least frequent words generated by a Python script from corpus. The datasets were exported into files for easy access and reuse. These datasets can be reused in natural language processing as resources in pre-processing phase for Swahili textual data. Elsevier 2020-11-10 /pmc/articles/PMC7689026/ /pubmed/33294515 http://dx.doi.org/10.1016/j.dib.2020.106517 Text en © 2020 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Masua, Bernard
Masasi, Noel
Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words
title Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words
title_full Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words
title_fullStr Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words
title_full_unstemmed Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words
title_short Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words
title_sort enhancing text pre-processing for swahili language: datasets for common swahili stop-words, slangs and typos with equivalent proper words
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7689026/
https://www.ncbi.nlm.nih.gov/pubmed/33294515
http://dx.doi.org/10.1016/j.dib.2020.106517
work_keys_str_mv AT masuabernard enhancingtextpreprocessingforswahililanguagedatasetsforcommonswahilistopwordsslangsandtyposwithequivalentproperwords
AT masasinoel enhancingtextpreprocessingforswahililanguagedatasetsforcommonswahilistopwordsslangsandtyposwithequivalentproperwords