Cargando…

Speech Recognition of Accented Mandarin Based on Improved Conformer

The convolution module in Conformer is capable of providing translationally invariant convolution in time and space. This is often used in Mandarin recognition tasks to address the diversity of speech signals by treating the time-frequency maps of speech signals as images. However, convolutional net...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yang, Xing-Yao, Zhang, Shao-Dong, Xiao, Rui, Yu, Jiong, Li, Zi-Yang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10143886/ https://www.ncbi.nlm.nih.gov/pubmed/37112366 http://dx.doi.org/10.3390/s23084025

_version_	1785033967029714944
author	Yang, Xing-Yao Zhang, Shao-Dong Xiao, Rui Yu, Jiong Li, Zi-Yang
author_facet	Yang, Xing-Yao Zhang, Shao-Dong Xiao, Rui Yu, Jiong Li, Zi-Yang
author_sort	Yang, Xing-Yao
collection	PubMed
description	The convolution module in Conformer is capable of providing translationally invariant convolution in time and space. This is often used in Mandarin recognition tasks to address the diversity of speech signals by treating the time-frequency maps of speech signals as images. However, convolutional networks are more effective in local feature modeling, while dialect recognition tasks require the extraction of a long sequence of contextual information features; therefore, the SE-Conformer-TCN is proposed in this paper. By embedding the squeeze-excitation block into the Conformer, the interdependence between the features of channels can be explicitly modeled to enhance the model’s ability to select interrelated channels, thus increasing the weight of effective speech spectrogram features and decreasing the weight of ineffective or less effective feature maps. The multi-head self-attention and temporal convolutional network is built in parallel, in which the dilated causal convolutions module can cover the input time series by increasing the expansion factor and convolutional kernel to capture the location information implied between the sequences and enhance the model’s access to location information. Experiments on four public datasets demonstrate that the proposed model has a higher performance for the recognition of Mandarin with an accent, and the sentence error rate is reduced by 2.1% compared to the Conformer, with only 4.9% character error rate.
format	Online Article Text
id	pubmed-10143886
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-101438862023-04-29 Speech Recognition of Accented Mandarin Based on Improved Conformer Yang, Xing-Yao Zhang, Shao-Dong Xiao, Rui Yu, Jiong Li, Zi-Yang Sensors (Basel) Article The convolution module in Conformer is capable of providing translationally invariant convolution in time and space. This is often used in Mandarin recognition tasks to address the diversity of speech signals by treating the time-frequency maps of speech signals as images. However, convolutional networks are more effective in local feature modeling, while dialect recognition tasks require the extraction of a long sequence of contextual information features; therefore, the SE-Conformer-TCN is proposed in this paper. By embedding the squeeze-excitation block into the Conformer, the interdependence between the features of channels can be explicitly modeled to enhance the model’s ability to select interrelated channels, thus increasing the weight of effective speech spectrogram features and decreasing the weight of ineffective or less effective feature maps. The multi-head self-attention and temporal convolutional network is built in parallel, in which the dilated causal convolutions module can cover the input time series by increasing the expansion factor and convolutional kernel to capture the location information implied between the sequences and enhance the model’s access to location information. Experiments on four public datasets demonstrate that the proposed model has a higher performance for the recognition of Mandarin with an accent, and the sentence error rate is reduced by 2.1% compared to the Conformer, with only 4.9% character error rate. MDPI 2023-04-16 /pmc/articles/PMC10143886/ /pubmed/37112366 http://dx.doi.org/10.3390/s23084025 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Yang, Xing-Yao Zhang, Shao-Dong Xiao, Rui Yu, Jiong Li, Zi-Yang Speech Recognition of Accented Mandarin Based on Improved Conformer
title	Speech Recognition of Accented Mandarin Based on Improved Conformer
title_full	Speech Recognition of Accented Mandarin Based on Improved Conformer
title_fullStr	Speech Recognition of Accented Mandarin Based on Improved Conformer
title_full_unstemmed	Speech Recognition of Accented Mandarin Based on Improved Conformer
title_short	Speech Recognition of Accented Mandarin Based on Improved Conformer
title_sort	speech recognition of accented mandarin based on improved conformer
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10143886/ https://www.ncbi.nlm.nih.gov/pubmed/37112366 http://dx.doi.org/10.3390/s23084025
work_keys_str_mv	AT yangxingyao speechrecognitionofaccentedmandarinbasedonimprovedconformer AT zhangshaodong speechrecognitionofaccentedmandarinbasedonimprovedconformer AT xiaorui speechrecognitionofaccentedmandarinbasedonimprovedconformer AT yujiong speechrecognitionofaccentedmandarinbasedonimprovedconformer AT liziyang speechrecognitionofaccentedmandarinbasedonimprovedconformer

Speech Recognition of Accented Mandarin Based on Improved Conformer

Ejemplares similares