Cargando…
Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages
Extracting insights from data obtained from the web in order to identify people’s views and opinions on various topics is a growing practice. The standard insights extraction pipeline is typically an unsupervised machine learning task composed of processes that preprocess the text, visualize it, clu...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148238/ http://dx.doi.org/10.1007/978-3-030-45439-5_11 |
_version_ | 1783520550737412096 |
---|---|
author | Nchabeleng, Mathibele Byamugisha, Joan |
author_facet | Nchabeleng, Mathibele Byamugisha, Joan |
author_sort | Nchabeleng, Mathibele |
collection | PubMed |
description | Extracting insights from data obtained from the web in order to identify people’s views and opinions on various topics is a growing practice. The standard insights extraction pipeline is typically an unsupervised machine learning task composed of processes that preprocess the text, visualize it, cluster and identify the topics and sentiment in each cluster, and then graph the network. Given the increasing amount of data being generated on the internet in Africa today, and the multilingual state of African countries, we evaluated how well the standard pipeline works when applied to text wholly or partially written in indigenous African languages, specifically Bantu languages. We carried out an exploratory investigation using Twitter data and compared the outputs from each step of the pipeline for an English dataset and a mixed Bantu language dataset. We found that for Bantu languages, due to their complex grammatical structure, extra preprocessing steps such as part-of-speech tagging and morphological analysis are required during data cleaning, threshold values should be adjusted during topic modeling, and semantic analysis should be performed before completing text preprocessing. |
format | Online Article Text |
id | pubmed-7148238 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
record_format | MEDLINE/PubMed |
spelling | pubmed-71482382020-04-13 Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages Nchabeleng, Mathibele Byamugisha, Joan Advances in Information Retrieval Article Extracting insights from data obtained from the web in order to identify people’s views and opinions on various topics is a growing practice. The standard insights extraction pipeline is typically an unsupervised machine learning task composed of processes that preprocess the text, visualize it, cluster and identify the topics and sentiment in each cluster, and then graph the network. Given the increasing amount of data being generated on the internet in Africa today, and the multilingual state of African countries, we evaluated how well the standard pipeline works when applied to text wholly or partially written in indigenous African languages, specifically Bantu languages. We carried out an exploratory investigation using Twitter data and compared the outputs from each step of the pipeline for an English dataset and a mixed Bantu language dataset. We found that for Bantu languages, due to their complex grammatical structure, extra preprocessing steps such as part-of-speech tagging and morphological analysis are required during data cleaning, threshold values should be adjusted during topic modeling, and semantic analysis should be performed before completing text preprocessing. 2020-03-17 /pmc/articles/PMC7148238/ http://dx.doi.org/10.1007/978-3-030-45439-5_11 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Article Nchabeleng, Mathibele Byamugisha, Joan Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages |
title | Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages |
title_full | Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages |
title_fullStr | Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages |
title_full_unstemmed | Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages |
title_short | Evaluating the Effectiveness of the Standard Insights Extraction Pipeline for Bantu Languages |
title_sort | evaluating the effectiveness of the standard insights extraction pipeline for bantu languages |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148238/ http://dx.doi.org/10.1007/978-3-030-45439-5_11 |
work_keys_str_mv | AT nchabelengmathibele evaluatingtheeffectivenessofthestandardinsightsextractionpipelineforbantulanguages AT byamugishajoan evaluatingtheeffectivenessofthestandardinsightsextractionpipelineforbantulanguages |