Cargando…

Dataset of Karakalpak language stop words

The dataset presented in this paper aims to address the challenge of automatic extraction of stop words in Natural Language Processing (NLP) for the low-resource Karakalpak language spoken by approximately two million people in Uzbekistan. To accomplish this, we have created a corpus of 23 Karakalpa...

Descripción completa

Detalles Bibliográficos
Autores principales: Madatov, Khabibulla, Bekchanov, Shukurla, Vičič, Jernej
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126844/
https://www.ncbi.nlm.nih.gov/pubmed/37113499
http://dx.doi.org/10.1016/j.dib.2023.109111
_version_ 1785030347000381440
author Madatov, Khabibulla
Bekchanov, Shukurla
Vičič, Jernej
author_facet Madatov, Khabibulla
Bekchanov, Shukurla
Vičič, Jernej
author_sort Madatov, Khabibulla
collection PubMed
description The dataset presented in this paper aims to address the challenge of automatic extraction of stop words in Natural Language Processing (NLP) for the low-resource Karakalpak language spoken by approximately two million people in Uzbekistan. To accomplish this, we have created a corpus of 23 Karakalpak language school textbooks, which we have named the Karakalpak Language School Corpus (KAASC). Using the KAASC corpus, we have constructed lists of stop words using three methods based on Term Frequency-Inverse Document Frequency (TF-IDF): unigram, bigram, and collocation methods, respectively. The resulting lists of stop words, along with a list of URLs used to construct the corpus, make up the described dataset in this paper.
format Online
Article
Text
id pubmed-10126844
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-101268442023-04-26 Dataset of Karakalpak language stop words Madatov, Khabibulla Bekchanov, Shukurla Vičič, Jernej Data Brief Data Article The dataset presented in this paper aims to address the challenge of automatic extraction of stop words in Natural Language Processing (NLP) for the low-resource Karakalpak language spoken by approximately two million people in Uzbekistan. To accomplish this, we have created a corpus of 23 Karakalpak language school textbooks, which we have named the Karakalpak Language School Corpus (KAASC). Using the KAASC corpus, we have constructed lists of stop words using three methods based on Term Frequency-Inverse Document Frequency (TF-IDF): unigram, bigram, and collocation methods, respectively. The resulting lists of stop words, along with a list of URLs used to construct the corpus, make up the described dataset in this paper. Elsevier 2023-04-05 /pmc/articles/PMC10126844/ /pubmed/37113499 http://dx.doi.org/10.1016/j.dib.2023.109111 Text en © 2023 The Authors https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Data Article
Madatov, Khabibulla
Bekchanov, Shukurla
Vičič, Jernej
Dataset of Karakalpak language stop words
title Dataset of Karakalpak language stop words
title_full Dataset of Karakalpak language stop words
title_fullStr Dataset of Karakalpak language stop words
title_full_unstemmed Dataset of Karakalpak language stop words
title_short Dataset of Karakalpak language stop words
title_sort dataset of karakalpak language stop words
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126844/
https://www.ncbi.nlm.nih.gov/pubmed/37113499
http://dx.doi.org/10.1016/j.dib.2023.109111
work_keys_str_mv AT madatovkhabibulla datasetofkarakalpaklanguagestopwords
AT bekchanovshukurla datasetofkarakalpaklanguagestopwords
AT vicicjernej datasetofkarakalpaklanguagestopwords