Cargando…

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive amounts...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ren, Zeyu, Yolwas, Nurmemet, Slamu, Wushour, Cao, Ronghe, Wang, Huiru
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9571619/ https://www.ncbi.nlm.nih.gov/pubmed/36236419 http://dx.doi.org/10.3390/s22197319

_version_	1784810407573061632
author	Ren, Zeyu Yolwas, Nurmemet Slamu, Wushour Cao, Ronghe Wang, Huiru
author_facet	Ren, Zeyu Yolwas, Nurmemet Slamu, Wushour Cao, Ronghe Wang, Huiru
author_sort	Ren, Zeyu
collection	PubMed
description	Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive amounts of training data. Recently, hybrid CTC/attention ASR systems have become more popular and have achieved good performance even under low-resource conditions, but they are rarely used in Central Asian languages such as Turkish and Uzbek. We extend the dataset by adding noise to the original audio and using speed perturbation. To develop the performance of an E2E agglutinative language speech recognition system, we propose a new feature extractor, MSPC, which uses different sizes of convolution kernels to extract and fuse features of different scales. The experimental results show that this structure is superior to VGGnet. In addition to this, the attention module is improved. By using the CTC objective function in training and the BERT model to initialize the language model in the decoding stage, the proposed method accelerates the convergence of the model and improves the accuracy of speech recognition. Compared with the baseline model, the character error rate (CER) and word error rate (WER) on the LibriSpeech test-other dataset increases by 2.42% and 2.96%, respectively. We apply the model structure to the Common Voice—Turkish (35 h) and Uzbek (78 h) datasets, and the WER is reduced by 7.07% and 7.08%, respectively. The results show that our method is close to the advanced E2E systems.
format	Online Article Text
id	pubmed-9571619
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-95716192022-10-17 Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition Ren, Zeyu Yolwas, Nurmemet Slamu, Wushour Cao, Ronghe Wang, Huiru Sensors (Basel) Article Unlike the traditional model, the end-to-end (E2E) ASR model does not require speech information such as a pronunciation dictionary, and its system is built through a single neural network and obtains performance comparable to that of traditional methods. However, the model requires massive amounts of training data. Recently, hybrid CTC/attention ASR systems have become more popular and have achieved good performance even under low-resource conditions, but they are rarely used in Central Asian languages such as Turkish and Uzbek. We extend the dataset by adding noise to the original audio and using speed perturbation. To develop the performance of an E2E agglutinative language speech recognition system, we propose a new feature extractor, MSPC, which uses different sizes of convolution kernels to extract and fuse features of different scales. The experimental results show that this structure is superior to VGGnet. In addition to this, the attention module is improved. By using the CTC objective function in training and the BERT model to initialize the language model in the decoding stage, the proposed method accelerates the convergence of the model and improves the accuracy of speech recognition. Compared with the baseline model, the character error rate (CER) and word error rate (WER) on the LibriSpeech test-other dataset increases by 2.42% and 2.96%, respectively. We apply the model structure to the Common Voice—Turkish (35 h) and Uzbek (78 h) datasets, and the WER is reduced by 7.07% and 7.08%, respectively. The results show that our method is close to the advanced E2E systems. MDPI 2022-09-27 /pmc/articles/PMC9571619/ /pubmed/36236419 http://dx.doi.org/10.3390/s22197319 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Ren, Zeyu Yolwas, Nurmemet Slamu, Wushour Cao, Ronghe Wang, Huiru Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
title	Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
title_full	Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
title_fullStr	Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
title_full_unstemmed	Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
title_short	Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition
title_sort	improving hybrid ctc/attention architecture for agglutinative language speech recognition
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9571619/ https://www.ncbi.nlm.nih.gov/pubmed/36236419 http://dx.doi.org/10.3390/s22197319
work_keys_str_mv	AT renzeyu improvinghybridctcattentionarchitectureforagglutinativelanguagespeechrecognition AT yolwasnurmemet improvinghybridctcattentionarchitectureforagglutinativelanguagespeechrecognition AT slamuwushour improvinghybridctcattentionarchitectureforagglutinativelanguagespeechrecognition AT caoronghe improvinghybridctcattentionarchitectureforagglutinativelanguagespeechrecognition AT wanghuiru improvinghybridctcattentionarchitectureforagglutinativelanguagespeechrecognition

Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition

Ejemplares similares