Cargando…

Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words

Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are us...

Descripción completa

Detalles Bibliográficos
Autores principales:	Masua, Bernard, Masasi, Noel
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2020
Materias:	Data Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7689026/ https://www.ncbi.nlm.nih.gov/pubmed/33294515 http://dx.doi.org/10.1016/j.dib.2020.106517

_version_	1783613776625401856
author	Masua, Bernard Masasi, Noel
author_facet	Masua, Bernard Masasi, Noel
author_sort	Masua, Bernard
collection	PubMed
description	Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets. The main source for these datasets were short Swahili messages collected from Tanzanian platform that is used by young people to convey their opinions on things that matters to them. Therefore, we derive list of common Swahili stop-words by reviewing most frequent words that are generated with Python script from our corpus, review common slang with help of Swahili experts with their corresponding proper words, and generate common Swahili typos by analysing least frequent words generated by a Python script from corpus. The datasets were exported into files for easy access and reuse. These datasets can be reused in natural language processing as resources in pre-processing phase for Swahili textual data.
format	Online Article Text
id	pubmed-7689026
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-76890262020-12-07 Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words Masua, Bernard Masasi, Noel Data Brief Data Article Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets. The main source for these datasets were short Swahili messages collected from Tanzanian platform that is used by young people to convey their opinions on things that matters to them. Therefore, we derive list of common Swahili stop-words by reviewing most frequent words that are generated with Python script from our corpus, review common slang with help of Swahili experts with their corresponding proper words, and generate common Swahili typos by analysing least frequent words generated by a Python script from corpus. The datasets were exported into files for easy access and reuse. These datasets can be reused in natural language processing as resources in pre-processing phase for Swahili textual data. Elsevier 2020-11-10 /pmc/articles/PMC7689026/ /pubmed/33294515 http://dx.doi.org/10.1016/j.dib.2020.106517 Text en © 2020 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Data Article Masua, Bernard Masasi, Noel Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words
title	Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words
title_full	Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words
title_fullStr	Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words
title_full_unstemmed	Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words
title_short	Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words
title_sort	enhancing text pre-processing for swahili language: datasets for common swahili stop-words, slangs and typos with equivalent proper words
topic	Data Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7689026/ https://www.ncbi.nlm.nih.gov/pubmed/33294515 http://dx.doi.org/10.1016/j.dib.2020.106517
work_keys_str_mv	AT masuabernard enhancingtextpreprocessingforswahililanguagedatasetsforcommonswahilistopwordsslangsandtyposwithequivalentproperwords AT masasinoel enhancingtextpreprocessingforswahililanguagedatasetsforcommonswahilistopwordsslangsandtyposwithequivalentproperwords

Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words

Ejemplares similares