Cargando…

Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages

Extracting insights from data obtained from the web in order to identify people’s views and opinions on various topics is a growing practice. The standard insights extraction pipeline is typically an unsupervised machine learning task composed of processes that preprocess the text, visualize it, clu...

Descripción completa

Detalles Bibliográficos
Autores principales: Nchabeleng, Mathibele, Byamugisha, Joan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148238/
http://dx.doi.org/10.1007/978-3-030-45439-5_11
_version_ 1783520550737412096
author Nchabeleng, Mathibele
Byamugisha, Joan
author_facet Nchabeleng, Mathibele
Byamugisha, Joan
author_sort Nchabeleng, Mathibele
collection PubMed
description Extracting insights from data obtained from the web in order to identify people’s views and opinions on various topics is a growing practice. The standard insights extraction pipeline is typically an unsupervised machine learning task composed of processes that preprocess the text, visualize it, cluster and identify the topics and sentiment in each cluster, and then graph the network. Given the increasing amount of data being generated on the internet in Africa today, and the multilingual state of African countries, we evaluated how well the standard pipeline works when applied to text wholly or partially written in indigenous African languages, specifically Bantu languages. We carried out an exploratory investigation using Twitter data and compared the outputs from each step of the pipeline for an English dataset and a mixed Bantu language dataset. We found that for Bantu languages, due to their complex grammatical structure, extra preprocessing steps such as part-of-speech tagging and morphological analysis are required during data cleaning, threshold values should be adjusted during topic modeling, and semantic analysis should be performed before completing text preprocessing.
format Online
Article
Text
id pubmed-7148238
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-71482382020-04-13 Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages Nchabeleng, Mathibele Byamugisha, Joan Advances in Information Retrieval Article Extracting insights from data obtained from the web in order to identify people’s views and opinions on various topics is a growing practice. The standard insights extraction pipeline is typically an unsupervised machine learning task composed of processes that preprocess the text, visualize it, cluster and identify the topics and sentiment in each cluster, and then graph the network. Given the increasing amount of data being generated on the internet in Africa today, and the multilingual state of African countries, we evaluated how well the standard pipeline works when applied to text wholly or partially written in indigenous African languages, specifically Bantu languages. We carried out an exploratory investigation using Twitter data and compared the outputs from each step of the pipeline for an English dataset and a mixed Bantu language dataset. We found that for Bantu languages, due to their complex grammatical structure, extra preprocessing steps such as part-of-speech tagging and morphological analysis are required during data cleaning, threshold values should be adjusted during topic modeling, and semantic analysis should be performed before completing text preprocessing. 2020-03-17 /pmc/articles/PMC7148238/ http://dx.doi.org/10.1007/978-3-030-45439-5_11 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Nchabeleng, Mathibele
Byamugisha, Joan
Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages
title Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages
title_full Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages
title_fullStr Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages
title_full_unstemmed Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages
title_short Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages
title_sort evaluating the effectiveness of the standard insights extraction pipeline for bantu languages
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148238/
http://dx.doi.org/10.1007/978-3-030-45439-5_11
work_keys_str_mv AT nchabelengmathibele evaluatingtheeffectivenessofthestandardinsightsextractionpipelineforbantulanguages
AT byamugishajoan evaluatingtheeffectivenessofthestandardinsightsextractionpipelineforbantulanguages