Cargando…

Kurdish News Dataset Headlines (KNDH) through multiclass classification

The rapid growth of technology has massively increased the amount of text data. The data can be mined and utilized for numerous natural language processing (NLP) tasks, particularly text classification. The core part of text classification is collecting the data for predicting a good model. This pap...

Descripción completa

Detalles Bibliográficos
Autores principales: Badawi, Soran, Saeed, Ari M., Ahmed, Sara A., Abdalla, Peshraw Ahmed, Hassan, Diyari A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10147969/
https://www.ncbi.nlm.nih.gov/pubmed/37128583
http://dx.doi.org/10.1016/j.dib.2023.109120
_version_ 1785034898124308480
author Badawi, Soran
Saeed, Ari M.
Ahmed, Sara A.
Abdalla, Peshraw Ahmed
Hassan, Diyari A.
author_facet Badawi, Soran
Saeed, Ari M.
Ahmed, Sara A.
Abdalla, Peshraw Ahmed
Hassan, Diyari A.
author_sort Badawi, Soran
collection PubMed
description The rapid growth of technology has massively increased the amount of text data. The data can be mined and utilized for numerous natural language processing (NLP) tasks, particularly text classification. The core part of text classification is collecting the data for predicting a good model. This paper collects Kurdish News Dataset Headlines (KNDH) for text classification. The dataset consists of 50000 news headlines which are equally distributed among five classes, with 10000 headlines for each class (Social, Sport, Health, Economic, and Technology). The percentage ratio of getting the channels of headlines is distinct, while the numbers of samples are equal for each category. There are 34 distinct channels that are used to collect the different headlines for each class, such as 8 channels for economics, 14 channels for health, 18 channels for science, 15 channels for social, and 5 channels for sport. The dataset is preprocessed using the Kurdish Language Processing Toolkit (KLPT) for tokenizing, spell-checking, stemming, and preprocessing.
format Online
Article
Text
id pubmed-10147969
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-101479692023-04-30 Kurdish News Dataset Headlines (KNDH) through multiclass classification Badawi, Soran Saeed, Ari M. Ahmed, Sara A. Abdalla, Peshraw Ahmed Hassan, Diyari A. Data Brief Data Article The rapid growth of technology has massively increased the amount of text data. The data can be mined and utilized for numerous natural language processing (NLP) tasks, particularly text classification. The core part of text classification is collecting the data for predicting a good model. This paper collects Kurdish News Dataset Headlines (KNDH) for text classification. The dataset consists of 50000 news headlines which are equally distributed among five classes, with 10000 headlines for each class (Social, Sport, Health, Economic, and Technology). The percentage ratio of getting the channels of headlines is distinct, while the numbers of samples are equal for each category. There are 34 distinct channels that are used to collect the different headlines for each class, such as 8 channels for economics, 14 channels for health, 18 channels for science, 15 channels for social, and 5 channels for sport. The dataset is preprocessed using the Kurdish Language Processing Toolkit (KLPT) for tokenizing, spell-checking, stemming, and preprocessing. Elsevier 2023-04-13 /pmc/articles/PMC10147969/ /pubmed/37128583 http://dx.doi.org/10.1016/j.dib.2023.109120 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Badawi, Soran
Saeed, Ari M.
Ahmed, Sara A.
Abdalla, Peshraw Ahmed
Hassan, Diyari A.
Kurdish News Dataset Headlines (KNDH) through multiclass classification
title Kurdish News Dataset Headlines (KNDH) through multiclass classification
title_full Kurdish News Dataset Headlines (KNDH) through multiclass classification
title_fullStr Kurdish News Dataset Headlines (KNDH) through multiclass classification
title_full_unstemmed Kurdish News Dataset Headlines (KNDH) through multiclass classification
title_short Kurdish News Dataset Headlines (KNDH) through multiclass classification
title_sort kurdish news dataset headlines (kndh) through multiclass classification
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10147969/
https://www.ncbi.nlm.nih.gov/pubmed/37128583
http://dx.doi.org/10.1016/j.dib.2023.109120
work_keys_str_mv AT badawisoran kurdishnewsdatasetheadlineskndhthroughmulticlassclassification
AT saeedarim kurdishnewsdatasetheadlineskndhthroughmulticlassclassification
AT ahmedsaraa kurdishnewsdatasetheadlineskndhthroughmulticlassclassification
AT abdallapeshrawahmed kurdishnewsdatasetheadlineskndhthroughmulticlassclassification
AT hassandiyaria kurdishnewsdatasetheadlineskndhthroughmulticlassclassification