Cargando…
Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words
Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are us...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7689026/ https://www.ncbi.nlm.nih.gov/pubmed/33294515 http://dx.doi.org/10.1016/j.dib.2020.106517 |
_version_ | 1783613776625401856 |
---|---|
author | Masua, Bernard Masasi, Noel |
author_facet | Masua, Bernard Masasi, Noel |
author_sort | Masua, Bernard |
collection | PubMed |
description | Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets. The main source for these datasets were short Swahili messages collected from Tanzanian platform that is used by young people to convey their opinions on things that matters to them. Therefore, we derive list of common Swahili stop-words by reviewing most frequent words that are generated with Python script from our corpus, review common slang with help of Swahili experts with their corresponding proper words, and generate common Swahili typos by analysing least frequent words generated by a Python script from corpus. The datasets were exported into files for easy access and reuse. These datasets can be reused in natural language processing as resources in pre-processing phase for Swahili textual data. |
format | Online Article Text |
id | pubmed-7689026 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-76890262020-12-07 Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words Masua, Bernard Masasi, Noel Data Brief Data Article Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets. The main source for these datasets were short Swahili messages collected from Tanzanian platform that is used by young people to convey their opinions on things that matters to them. Therefore, we derive list of common Swahili stop-words by reviewing most frequent words that are generated with Python script from our corpus, review common slang with help of Swahili experts with their corresponding proper words, and generate common Swahili typos by analysing least frequent words generated by a Python script from corpus. The datasets were exported into files for easy access and reuse. These datasets can be reused in natural language processing as resources in pre-processing phase for Swahili textual data. Elsevier 2020-11-10 /pmc/articles/PMC7689026/ /pubmed/33294515 http://dx.doi.org/10.1016/j.dib.2020.106517 Text en © 2020 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Data Article Masua, Bernard Masasi, Noel Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words |
title | Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words |
title_full | Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words |
title_fullStr | Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words |
title_full_unstemmed | Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words |
title_short | Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words |
title_sort | enhancing text pre-processing for swahili language: datasets for common swahili stop-words, slangs and typos with equivalent proper words |
topic | Data Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7689026/ https://www.ncbi.nlm.nih.gov/pubmed/33294515 http://dx.doi.org/10.1016/j.dib.2020.106517 |
work_keys_str_mv | AT masuabernard enhancingtextpreprocessingforswahililanguagedatasetsforcommonswahilistopwordsslangsandtyposwithequivalentproperwords AT masasinoel enhancingtextpreprocessingforswahililanguagedatasetsforcommonswahilistopwordsslangsandtyposwithequivalentproperwords |