Cargando…
Dataset of Karakalpak language stop words
The dataset presented in this paper aims to address the challenge of automatic extraction of stop words in Natural Language Processing (NLP) for the low-resource Karakalpak language spoken by approximately two million people in Uzbekistan. To accomplish this, we have created a corpus of 23 Karakalpa...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126844/ https://www.ncbi.nlm.nih.gov/pubmed/37113499 http://dx.doi.org/10.1016/j.dib.2023.109111 |
_version_ | 1785030347000381440 |
---|---|
author | Madatov, Khabibulla Bekchanov, Shukurla Vičič, Jernej |
author_facet | Madatov, Khabibulla Bekchanov, Shukurla Vičič, Jernej |
author_sort | Madatov, Khabibulla |
collection | PubMed |
description | The dataset presented in this paper aims to address the challenge of automatic extraction of stop words in Natural Language Processing (NLP) for the low-resource Karakalpak language spoken by approximately two million people in Uzbekistan. To accomplish this, we have created a corpus of 23 Karakalpak language school textbooks, which we have named the Karakalpak Language School Corpus (KAASC). Using the KAASC corpus, we have constructed lists of stop words using three methods based on Term Frequency-Inverse Document Frequency (TF-IDF): unigram, bigram, and collocation methods, respectively. The resulting lists of stop words, along with a list of URLs used to construct the corpus, make up the described dataset in this paper. |
format | Online Article Text |
id | pubmed-10126844 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-101268442023-04-26 Dataset of Karakalpak language stop words Madatov, Khabibulla Bekchanov, Shukurla Vičič, Jernej Data Brief Data Article The dataset presented in this paper aims to address the challenge of automatic extraction of stop words in Natural Language Processing (NLP) for the low-resource Karakalpak language spoken by approximately two million people in Uzbekistan. To accomplish this, we have created a corpus of 23 Karakalpak language school textbooks, which we have named the Karakalpak Language School Corpus (KAASC). Using the KAASC corpus, we have constructed lists of stop words using three methods based on Term Frequency-Inverse Document Frequency (TF-IDF): unigram, bigram, and collocation methods, respectively. The resulting lists of stop words, along with a list of URLs used to construct the corpus, make up the described dataset in this paper. Elsevier 2023-04-05 /pmc/articles/PMC10126844/ /pubmed/37113499 http://dx.doi.org/10.1016/j.dib.2023.109111 Text en © 2023 The Authors https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). |
spellingShingle | Data Article Madatov, Khabibulla Bekchanov, Shukurla Vičič, Jernej Dataset of Karakalpak language stop words |
title | Dataset of Karakalpak language stop words |
title_full | Dataset of Karakalpak language stop words |
title_fullStr | Dataset of Karakalpak language stop words |
title_full_unstemmed | Dataset of Karakalpak language stop words |
title_short | Dataset of Karakalpak language stop words |
title_sort | dataset of karakalpak language stop words |
topic | Data Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126844/ https://www.ncbi.nlm.nih.gov/pubmed/37113499 http://dx.doi.org/10.1016/j.dib.2023.109111 |
work_keys_str_mv | AT madatovkhabibulla datasetofkarakalpaklanguagestopwords AT bekchanovshukurla datasetofkarakalpaklanguagestopwords AT vicicjernej datasetofkarakalpaklanguagestopwords |