Cargando…

Grammar-aware phrase dataset generated using a novel python package

The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that ac...

Descripción completa

Detalles Bibliográficos
Autores principales: Gemechu, Ebisa A., Kanagachidambaresan, G.R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10293991/
https://www.ncbi.nlm.nih.gov/pubmed/37383805
http://dx.doi.org/10.1016/j.dib.2023.109237
_version_ 1785063103361187840
author Gemechu, Ebisa A.
Kanagachidambaresan, G.R.
author_facet Gemechu, Ebisa A.
Kanagachidambaresan, G.R.
author_sort Gemechu, Ebisa A.
collection PubMed
description The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm.
format Online
Article
Text
id pubmed-10293991
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-102939912023-06-28 Grammar-aware phrase dataset generated using a novel python package Gemechu, Ebisa A. Kanagachidambaresan, G.R. Data Brief Data Article The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm. Elsevier 2023-05-19 /pmc/articles/PMC10293991/ /pubmed/37383805 http://dx.doi.org/10.1016/j.dib.2023.109237 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Gemechu, Ebisa A.
Kanagachidambaresan, G.R.
Grammar-aware phrase dataset generated using a novel python package
title Grammar-aware phrase dataset generated using a novel python package
title_full Grammar-aware phrase dataset generated using a novel python package
title_fullStr Grammar-aware phrase dataset generated using a novel python package
title_full_unstemmed Grammar-aware phrase dataset generated using a novel python package
title_short Grammar-aware phrase dataset generated using a novel python package
title_sort grammar-aware phrase dataset generated using a novel python package
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10293991/
https://www.ncbi.nlm.nih.gov/pubmed/37383805
http://dx.doi.org/10.1016/j.dib.2023.109237
work_keys_str_mv AT gemechuebisaa grammarawarephrasedatasetgeneratedusinganovelpythonpackage
AT kanagachidambaresangr grammarawarephrasedatasetgeneratedusinganovelpythonpackage