Cargando…

MadureseSet: Madurese-Indonesian Dataset

MadureseSet is a digitized version of the physical document of Kamus Lengkap Bahasa Madura-Indonesia (The Complete Dictionary of Madurese-Indonesian). It stores the list of lemmata in Madurese, i.e., 17809 basic lemmata and 53722 substitution lemmata, and their translation in Indonesian. The details...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ifada, Noor, Rachman, Fika Hastarita, Syauqy, M Wildan Mubarok Asy, Wahyuni, Sri, Pawitra, Adrian
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2023
Materias:	Data Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10040506/ https://www.ncbi.nlm.nih.gov/pubmed/36994143 http://dx.doi.org/10.1016/j.dib.2023.109035

_version_	1784912486795837440
author	Ifada, Noor Rachman, Fika Hastarita Syauqy, M Wildan Mubarok Asy Wahyuni, Sri Pawitra, Adrian
author_facet	Ifada, Noor Rachman, Fika Hastarita Syauqy, M Wildan Mubarok Asy Wahyuni, Sri Pawitra, Adrian
author_sort	Ifada, Noor
collection	PubMed
description	MadureseSet is a digitized version of the physical document of Kamus Lengkap Bahasa Madura-Indonesia (The Complete Dictionary of Madurese-Indonesian). It stores the list of lemmata in Madurese, i.e., 17809 basic lemmata and 53722 substitution lemmata, and their translation in Indonesian. The details of each lemma may include its pronunciation, part of speech, synonym and homonym relations, speech level, dialect, and loanword. The framework of dataset creation consists of three stages. First, the data extraction stage processes the scanned results of the physical document to produce corrected data in a text file. Second, the data structural review stage processes the text file in terms of the paragraph, homonym, synonym, linguistic, poem, short poem, proverb, and metaphor structures to create the data structure that best represents the information in the dictionary. Finally, the database construction stage builds the physical data model and populates the MadureseSet database. MadureseSet is validated by a Madurese language expert who is also the author of the physical document source of this dataset. Thus, this dataset can be a primary source for Natural Language Processing (NLP) research, especially for the Madurese language.
format	Online Article Text
id	pubmed-10040506
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-100405062023-03-28 MadureseSet: Madurese-Indonesian Dataset Ifada, Noor Rachman, Fika Hastarita Syauqy, M Wildan Mubarok Asy Wahyuni, Sri Pawitra, Adrian Data Brief Data Article MadureseSet is a digitized version of the physical document of Kamus Lengkap Bahasa Madura-Indonesia (The Complete Dictionary of Madurese-Indonesian). It stores the list of lemmata in Madurese, i.e., 17809 basic lemmata and 53722 substitution lemmata, and their translation in Indonesian. The details of each lemma may include its pronunciation, part of speech, synonym and homonym relations, speech level, dialect, and loanword. The framework of dataset creation consists of three stages. First, the data extraction stage processes the scanned results of the physical document to produce corrected data in a text file. Second, the data structural review stage processes the text file in terms of the paragraph, homonym, synonym, linguistic, poem, short poem, proverb, and metaphor structures to create the data structure that best represents the information in the dictionary. Finally, the database construction stage builds the physical data model and populates the MadureseSet database. MadureseSet is validated by a Madurese language expert who is also the author of the physical document source of this dataset. Thus, this dataset can be a primary source for Natural Language Processing (NLP) research, especially for the Madurese language. Elsevier 2023-03-07 /pmc/articles/PMC10040506/ /pubmed/36994143 http://dx.doi.org/10.1016/j.dib.2023.109035 Text en © 2023 The Authors. Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Data Article Ifada, Noor Rachman, Fika Hastarita Syauqy, M Wildan Mubarok Asy Wahyuni, Sri Pawitra, Adrian MadureseSet: Madurese-Indonesian Dataset
title	MadureseSet: Madurese-Indonesian Dataset
title_full	MadureseSet: Madurese-Indonesian Dataset
title_fullStr	MadureseSet: Madurese-Indonesian Dataset
title_full_unstemmed	MadureseSet: Madurese-Indonesian Dataset
title_short	MadureseSet: Madurese-Indonesian Dataset
title_sort	madureseset: madurese-indonesian dataset
topic	Data Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10040506/ https://www.ncbi.nlm.nih.gov/pubmed/36994143 http://dx.doi.org/10.1016/j.dib.2023.109035
work_keys_str_mv	AT ifadanoor maduresesetmadureseindonesiandataset AT rachmanfikahastarita maduresesetmadureseindonesiandataset AT syauqymwildanmubarokasy maduresesetmadureseindonesiandataset AT wahyunisri maduresesetmadureseindonesiandataset AT pawitraadrian maduresesetmadureseindonesiandataset

MadureseSet: Madurese-Indonesian Dataset

Ejemplares similares