Cargando…
Grammar-aware phrase dataset generated using a novel python package
The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that ac...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10293991/ https://www.ncbi.nlm.nih.gov/pubmed/37383805 http://dx.doi.org/10.1016/j.dib.2023.109237 |
_version_ | 1785063103361187840 |
---|---|
author | Gemechu, Ebisa A. Kanagachidambaresan, G.R. |
author_facet | Gemechu, Ebisa A. Kanagachidambaresan, G.R. |
author_sort | Gemechu, Ebisa A. |
collection | PubMed |
description | The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm. |
format | Online Article Text |
id | pubmed-10293991 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-102939912023-06-28 Grammar-aware phrase dataset generated using a novel python package Gemechu, Ebisa A. Kanagachidambaresan, G.R. Data Brief Data Article The past technique of manual dataset preparation was time-consuming and needed much effort. Another attempt to the data acquisition method was using web scraping. Such web scraping tools also produce a bunch of data errors. For this reason, we developed “Oromo-grammar” a novel Python package that accepts a raw text file from the user, extracts every possible root verb from the text, and stores the verbs into a Python list. Our algorithm then iterates over list of root verbs to form their corresponding list of stems. Finally, our algorithm synthesizes grammatical phrases using the appropriate affixations and personal pronouns. The generated phrase dataset can indicate grammatical elements like numbers, gender, and cases. The output is a grammar-rich dataset, which is applicable to modern NLP applications like machine translation, sentence completion, and grammar and spell checker. The dataset also helps linguists and academia in teaching language grammar structures. The method can easily be reproducible to any other language with a systematic analysis and slight modifications to its affix structures in the algorithm. Elsevier 2023-05-19 /pmc/articles/PMC10293991/ /pubmed/37383805 http://dx.doi.org/10.1016/j.dib.2023.109237 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Data Article Gemechu, Ebisa A. Kanagachidambaresan, G.R. Grammar-aware phrase dataset generated using a novel python package |
title | Grammar-aware phrase dataset generated using a novel python package |
title_full | Grammar-aware phrase dataset generated using a novel python package |
title_fullStr | Grammar-aware phrase dataset generated using a novel python package |
title_full_unstemmed | Grammar-aware phrase dataset generated using a novel python package |
title_short | Grammar-aware phrase dataset generated using a novel python package |
title_sort | grammar-aware phrase dataset generated using a novel python package |
topic | Data Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10293991/ https://www.ncbi.nlm.nih.gov/pubmed/37383805 http://dx.doi.org/10.1016/j.dib.2023.109237 |
work_keys_str_mv | AT gemechuebisaa grammarawarephrasedatasetgeneratedusinganovelpythonpackage AT kanagachidambaresangr grammarawarephrasedatasetgeneratedusinganovelpythonpackage |