Cargando…

A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training

Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and o...

Descripción completa

Detalles Bibliográficos
Autores principales: Meng, Weijing, Yolwas, Nurmemet
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9863384/
https://www.ncbi.nlm.nih.gov/pubmed/36679666
http://dx.doi.org/10.3390/s23020870
_version_ 1784875320767152128
author Meng, Weijing
Yolwas, Nurmemet
author_facet Meng, Weijing
Yolwas, Nurmemet
author_sort Meng, Weijing
collection PubMed
description Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and other Central and West Asian languages. In this paper, wav2vec2.0 is improved by integrating a Factorized TDNN layer to better preserve the relationship between the voice and the time step before and after the quantization, which is called wav2vec-F. The unsupervised pre-training strategy was used to learn the potential speech representation from a large number of unlabeled audio data and was applied to the cross-language ASR task, which was optimized using the noise contrast binary classification task. At the same time, speech synthesis is used to promote the performance of speech recognition. The experiment shows that wav2vec-F can effectively utilize the unlabeled data from non-target languages, and the multi-language pre-training is obviously better than the single-language pre-training. The data enhancement method using speech synthesis can bring huge benefits. Compared with the baseline model, Librispeech’s test-clean dataset has an average reduction of 1.9% in the word error rate. On the Kazakh KSC test set, the pre-training using only Kazakh reduced the word error rate by 3.8%. The pre-training of a small amount of Kazakh speech data synthesized by multi-language combined with TTS achieved a word error rate of 8.6% on the KSC test set when the labeled data were only 10 h, which was comparable to the results of the previous end-to-end model when the labeled data were 30 times less.
format Online
Article
Text
id pubmed-9863384
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-98633842023-01-22 A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training Meng, Weijing Yolwas, Nurmemet Sensors (Basel) Article Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and other Central and West Asian languages. In this paper, wav2vec2.0 is improved by integrating a Factorized TDNN layer to better preserve the relationship between the voice and the time step before and after the quantization, which is called wav2vec-F. The unsupervised pre-training strategy was used to learn the potential speech representation from a large number of unlabeled audio data and was applied to the cross-language ASR task, which was optimized using the noise contrast binary classification task. At the same time, speech synthesis is used to promote the performance of speech recognition. The experiment shows that wav2vec-F can effectively utilize the unlabeled data from non-target languages, and the multi-language pre-training is obviously better than the single-language pre-training. The data enhancement method using speech synthesis can bring huge benefits. Compared with the baseline model, Librispeech’s test-clean dataset has an average reduction of 1.9% in the word error rate. On the Kazakh KSC test set, the pre-training using only Kazakh reduced the word error rate by 3.8%. The pre-training of a small amount of Kazakh speech data synthesized by multi-language combined with TTS achieved a word error rate of 8.6% on the KSC test set when the labeled data were only 10 h, which was comparable to the results of the previous end-to-end model when the labeled data were 30 times less. MDPI 2023-01-12 /pmc/articles/PMC9863384/ /pubmed/36679666 http://dx.doi.org/10.3390/s23020870 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Meng, Weijing
Yolwas, Nurmemet
A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training
title A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training
title_full A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training
title_fullStr A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training
title_full_unstemmed A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training
title_short A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training
title_sort study of speech recognition for kazakh based on unsupervised pre-training
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9863384/
https://www.ncbi.nlm.nih.gov/pubmed/36679666
http://dx.doi.org/10.3390/s23020870
work_keys_str_mv AT mengweijing astudyofspeechrecognitionforkazakhbasedonunsupervisedpretraining
AT yolwasnurmemet astudyofspeechrecognitionforkazakhbasedonunsupervisedpretraining
AT mengweijing studyofspeechrecognitionforkazakhbasedonunsupervisedpretraining
AT yolwasnurmemet studyofspeechrecognitionforkazakhbasedonunsupervisedpretraining