Cargando…

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive amounts...

Descripción completa

Detalles Bibliográficos
Autores principales: Ren, Zeyu, Yolwas, Nurmemet, Slamu, Wushour, Cao, Ronghe, Wang, Huiru
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9571619/
https://www.ncbi.nlm.nih.gov/pubmed/36236419
http://dx.doi.org/10.3390/s22197319
_version_ 1784810407573061632
author Ren, Zeyu
Yolwas, Nurmemet
Slamu, Wushour
Cao, Ronghe
Wang, Huiru
author_facet Ren, Zeyu
Yolwas, Nurmemet
Slamu, Wushour
Cao, Ronghe
Wang, Huiru
author_sort Ren, Zeyu
collection PubMed
description Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive amounts of training data. Recently, hybrid CTC/attention ASR systems have become more popular and have achieved good performance even under low-resource conditions, but they are rarely used in Central Asian languages such as Turkish and Uzbek. We extend the dataset by adding noise to the original audio and using speed perturbation. To develop the performance of an E2E agglutinative language speech recognition system, we propose a new feature extractor, MSPC, which uses different sizes of convolution kernels to extract and fuse features of different scales. The experimental results show that this structure is superior to VGGnet. In addition to this, the attention module is improved. By using the CTC objective function in training and the BERT model to initialize the language model in the decoding stage, the proposed method accelerates the convergence of the model and improves the accuracy of speech recognition. Compared with the baseline model, the character error rate (CER) and word error rate (WER) on the LibriSpeech test-other dataset increases by 2.42% and 2.96%, respectively. We apply the model structure to the Common Voice—Turkish (35 h) and Uzbek (78 h) datasets, and the WER is reduced by 7.07% and 7.08%, respectively. The results show that our method is close to the advanced E2E systems.
format Online
Article
Text
id pubmed-9571619
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-95716192022-10-17 Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition Ren, Zeyu Yolwas, Nurmemet Slamu, Wushour Cao, Ronghe Wang, Huiru Sensors (Basel) Article Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive amounts of training data. Recently, hybrid CTC/attention ASR systems have become more popular and have achieved good performance even under low-resource conditions, but they are rarely used in Central Asian languages such as Turkish and Uzbek. We extend the dataset by adding noise to the original audio and using speed perturbation. To develop the performance of an E2E agglutinative language speech recognition system, we propose a new feature extractor, MSPC, which uses different sizes of convolution kernels to extract and fuse features of different scales. The experimental results show that this structure is superior to VGGnet. In addition to this, the attention module is improved. By using the CTC objective function in training and the BERT model to initialize the language model in the decoding stage, the proposed method accelerates the convergence of the model and improves the accuracy of speech recognition. Compared with the baseline model, the character error rate (CER) and word error rate (WER) on the LibriSpeech test-other dataset increases by 2.42% and 2.96%, respectively. We apply the model structure to the Common Voice—Turkish (35 h) and Uzbek (78 h) datasets, and the WER is reduced by 7.07% and 7.08%, respectively. The results show that our method is close to the advanced E2E systems. MDPI 2022-09-27 /pmc/articles/PMC9571619/ /pubmed/36236419 http://dx.doi.org/10.3390/s22197319 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Ren, Zeyu
Yolwas, Nurmemet
Slamu, Wushour
Cao, Ronghe
Wang, Huiru
Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
title Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
title_full Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
title_fullStr Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
title_full_unstemmed Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
title_short Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
title_sort improving hybrid ctc/attention architecture for agglutinative language speech recognition
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9571619/
https://www.ncbi.nlm.nih.gov/pubmed/36236419
http://dx.doi.org/10.3390/s22197319
work_keys_str_mv AT renzeyu improvinghybridctcattentionarchitectureforagglutinativelanguagespeechrecognition
AT yolwasnurmemet improvinghybridctcattentionarchitectureforagglutinativelanguagespeechrecognition
AT slamuwushour improvinghybridctcattentionarchitectureforagglutinativelanguagespeechrecognition
AT caoronghe improvinghybridctcattentionarchitectureforagglutinativelanguagespeechrecognition
AT wanghuiru improvinghybridctcattentionarchitectureforagglutinativelanguagespeechrecognition