Cargando…

Augmented words to improve a deep learning-based Indonesian syllabification()

Recent deep learning-based syllabification models generally give low error rates for high-resource languages with big datasets but sometimes produce high error rates for the low-resource ones. In this paper, two procedures: massive data augmentation and validation, are proposed to improve a deep lea...

Descripción completa

Detalles Bibliográficos
Autores principales:	Suyanto, Suyanto, Romadhony, Ade, Sthevanie, Febryanti, Ismail, Rezza Nafi
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2021
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8511842/ https://www.ncbi.nlm.nih.gov/pubmed/34693050 http://dx.doi.org/10.1016/j.heliyon.2021.e08115

_version_	1784582849273266176
author	Suyanto, Suyanto Romadhony, Ade Sthevanie, Febryanti Ismail, Rezza Nafi
author_facet	Suyanto, Suyanto Romadhony, Ade Sthevanie, Febryanti Ismail, Rezza Nafi
author_sort	Suyanto, Suyanto
collection	PubMed
description	Recent deep learning-based syllabification models generally give low error rates for high-resource languages with big datasets but sometimes produce high error rates for the low-resource ones. In this paper, two procedures: massive data augmentation and validation, are proposed to improve a deep learning-based syllabification, using a combination of bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNN), and conditional random fields (CRF) for a low-resource Indonesian language. The massive data augmentation comprises four methods: transposing nuclei, swapping consonant-graphemes, flipping onsets, and creating acronyms. Meanwhile, the validation is implemented using a phonotactic-based scheme. A preliminary investigation on 50k Indonesian words informs that those augmentation methods significantly enlarge the dataset size by 12.8M valid words based on the phonotactic rules. An examination is then performed using 5-fold cross-validation. It reports that the augmentation methods significantly improve the BiLSTM-CNN-CRF model for 50k formal words and 100k named-entities datasets. A detailed investigation informs that augmenting the training set can reduce the word error rate (WER) coming from the long formal words and named entities.
format	Online Article Text
id	pubmed-8511842
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-85118422021-10-21 Augmented words to improve a deep learning-based Indonesian syllabification() Suyanto, Suyanto Romadhony, Ade Sthevanie, Febryanti Ismail, Rezza Nafi Heliyon Research Article Recent deep learning-based syllabification models generally give low error rates for high-resource languages with big datasets but sometimes produce high error rates for the low-resource ones. In this paper, two procedures: massive data augmentation and validation, are proposed to improve a deep learning-based syllabification, using a combination of bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNN), and conditional random fields (CRF) for a low-resource Indonesian language. The massive data augmentation comprises four methods: transposing nuclei, swapping consonant-graphemes, flipping onsets, and creating acronyms. Meanwhile, the validation is implemented using a phonotactic-based scheme. A preliminary investigation on 50k Indonesian words informs that those augmentation methods significantly enlarge the dataset size by 12.8M valid words based on the phonotactic rules. An examination is then performed using 5-fold cross-validation. It reports that the augmentation methods significantly improve the BiLSTM-CNN-CRF model for 50k formal words and 100k named-entities datasets. A detailed investigation informs that augmenting the training set can reduce the word error rate (WER) coming from the long formal words and named entities. Elsevier 2021-10-05 /pmc/articles/PMC8511842/ /pubmed/34693050 http://dx.doi.org/10.1016/j.heliyon.2021.e08115 Text en © 2021 The Author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle	Research Article Suyanto, Suyanto Romadhony, Ade Sthevanie, Febryanti Ismail, Rezza Nafi Augmented words to improve a deep learning-based Indonesian syllabification()
title	Augmented words to improve a deep learning-based Indonesian syllabification()
title_full	Augmented words to improve a deep learning-based Indonesian syllabification()
title_fullStr	Augmented words to improve a deep learning-based Indonesian syllabification()
title_full_unstemmed	Augmented words to improve a deep learning-based Indonesian syllabification()
title_short	Augmented words to improve a deep learning-based Indonesian syllabification()
title_sort	augmented words to improve a deep learning-based indonesian syllabification()
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8511842/ https://www.ncbi.nlm.nih.gov/pubmed/34693050 http://dx.doi.org/10.1016/j.heliyon.2021.e08115
work_keys_str_mv	AT suyantosuyanto augmentedwordstoimproveadeeplearningbasedindonesiansyllabification AT romadhonyade augmentedwordstoimproveadeeplearningbasedindonesiansyllabification AT sthevaniefebryanti augmentedwordstoimproveadeeplearningbasedindonesiansyllabification AT ismailrezzanafi augmentedwordstoimproveadeeplearningbasedindonesiansyllabification

Augmented words to improve a deep learning-based Indonesian syllabification()

Ejemplares similares