Cargando…

Augmented words to improve a deep learning-based Indonesian syllabification()

Recent deep learning-based syllabification models generally give low error rates for high-resource languages with big datasets but sometimes produce high error rates for the low-resource ones. In this paper, two procedures: massive data augmentation and validation, are proposed to improve a deep lea...

Descripción completa

Detalles Bibliográficos
Autores principales: Suyanto, Suyanto, Romadhony, Ade, Sthevanie, Febryanti, Ismail, Rezza Nafi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8511842/
https://www.ncbi.nlm.nih.gov/pubmed/34693050
http://dx.doi.org/10.1016/j.heliyon.2021.e08115
_version_ 1784582849273266176
author Suyanto, Suyanto
Romadhony, Ade
Sthevanie, Febryanti
Ismail, Rezza Nafi
author_facet Suyanto, Suyanto
Romadhony, Ade
Sthevanie, Febryanti
Ismail, Rezza Nafi
author_sort Suyanto, Suyanto
collection PubMed
description Recent deep learning-based syllabification models generally give low error rates for high-resource languages with big datasets but sometimes produce high error rates for the low-resource ones. In this paper, two procedures: massive data augmentation and validation, are proposed to improve a deep learning-based syllabification, using a combination of bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNN), and conditional random fields (CRF) for a low-resource Indonesian language. The massive data augmentation comprises four methods: transposing nuclei, swapping consonant-graphemes, flipping onsets, and creating acronyms. Meanwhile, the validation is implemented using a phonotactic-based scheme. A preliminary investigation on 50k Indonesian words informs that those augmentation methods significantly enlarge the dataset size by 12.8M valid words based on the phonotactic rules. An examination is then performed using 5-fold cross-validation. It reports that the augmentation methods significantly improve the BiLSTM-CNN-CRF model for 50k formal words and 100k named-entities datasets. A detailed investigation informs that augmenting the training set can reduce the word error rate (WER) coming from the long formal words and named entities.
format Online
Article
Text
id pubmed-8511842
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-85118422021-10-21 Augmented words to improve a deep learning-based Indonesian syllabification() Suyanto, Suyanto Romadhony, Ade Sthevanie, Febryanti Ismail, Rezza Nafi Heliyon Research Article Recent deep learning-based syllabification models generally give low error rates for high-resource languages with big datasets but sometimes produce high error rates for the low-resource ones. In this paper, two procedures: massive data augmentation and validation, are proposed to improve a deep learning-based syllabification, using a combination of bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNN), and conditional random fields (CRF) for a low-resource Indonesian language. The massive data augmentation comprises four methods: transposing nuclei, swapping consonant-graphemes, flipping onsets, and creating acronyms. Meanwhile, the validation is implemented using a phonotactic-based scheme. A preliminary investigation on 50k Indonesian words informs that those augmentation methods significantly enlarge the dataset size by 12.8M valid words based on the phonotactic rules. An examination is then performed using 5-fold cross-validation. It reports that the augmentation methods significantly improve the BiLSTM-CNN-CRF model for 50k formal words and 100k named-entities datasets. A detailed investigation informs that augmenting the training set can reduce the word error rate (WER) coming from the long formal words and named entities. Elsevier 2021-10-05 /pmc/articles/PMC8511842/ /pubmed/34693050 http://dx.doi.org/10.1016/j.heliyon.2021.e08115 Text en © 2021 The Author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Research Article
Suyanto, Suyanto
Romadhony, Ade
Sthevanie, Febryanti
Ismail, Rezza Nafi
Augmented words to improve a deep learning-based Indonesian syllabification()
title Augmented words to improve a deep learning-based Indonesian syllabification()
title_full Augmented words to improve a deep learning-based Indonesian syllabification()
title_fullStr Augmented words to improve a deep learning-based Indonesian syllabification()
title_full_unstemmed Augmented words to improve a deep learning-based Indonesian syllabification()
title_short Augmented words to improve a deep learning-based Indonesian syllabification()
title_sort augmented words to improve a deep learning-based indonesian syllabification()
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8511842/
https://www.ncbi.nlm.nih.gov/pubmed/34693050
http://dx.doi.org/10.1016/j.heliyon.2021.e08115
work_keys_str_mv AT suyantosuyanto augmentedwordstoimproveadeeplearningbasedindonesiansyllabification
AT romadhonyade augmentedwordstoimproveadeeplearningbasedindonesiansyllabification
AT sthevaniefebryanti augmentedwordstoimproveadeeplearningbasedindonesiansyllabification
AT ismailrezzanafi augmentedwordstoimproveadeeplearningbasedindonesiansyllabification