Cargando…

Grammar-aware phrase dataset generated using a novel python package

The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that ac...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gemechu, Ebisa A., Kanagachidambaresan, G.R.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2023
Materias:	Data Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10293991/ https://www.ncbi.nlm.nih.gov/pubmed/37383805 http://dx.doi.org/10.1016/j.dib.2023.109237

_version_	1785063103361187840
author	Gemechu, Ebisa A. Kanagachidambaresan, G.R.
author_facet	Gemechu, Ebisa A. Kanagachidambaresan, G.R.
author_sort	Gemechu, Ebisa A.
collection	PubMed
description	The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm.
format	Online Article Text
id	pubmed-10293991
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-102939912023-06-28 Grammar-aware phrase dataset generated using a novel python package Gemechu, Ebisa A. Kanagachidambaresan, G.R. Data Brief Data Article The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm. Elsevier 2023-05-19 /pmc/articles/PMC10293991/ /pubmed/37383805 http://dx.doi.org/10.1016/j.dib.2023.109237 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Data Article Gemechu, Ebisa A. Kanagachidambaresan, G.R. Grammar-aware phrase dataset generated using a novel python package
title	Grammar-aware phrase dataset generated using a novel python package
title_full	Grammar-aware phrase dataset generated using a novel python package
title_fullStr	Grammar-aware phrase dataset generated using a novel python package
title_full_unstemmed	Grammar-aware phrase dataset generated using a novel python package
title_short	Grammar-aware phrase dataset generated using a novel python package
title_sort	grammar-aware phrase dataset generated using a novel python package
topic	Data Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10293991/ https://www.ncbi.nlm.nih.gov/pubmed/37383805 http://dx.doi.org/10.1016/j.dib.2023.109237
work_keys_str_mv	AT gemechuebisaa grammarawarephrasedatasetgeneratedusinganovelpythonpackage AT kanagachidambaresangr grammarawarephrasedatasetgeneratedusinganovelpythonpackage

Grammar-aware phrase dataset generated using a novel python package

Ejemplares similares