Cargando…

End-to-end emotional speech recognition using acoustic model adaptation based on knowledge distillation

The end-to-end approach provides better performance in speech recognition compared to the traditional hidden Markov model-deep neural network (HMM-DNN)-based approach, but still shows poor performance in abnormal speech, especially emotional speech. The optimal solution is to build an acoustic model...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yun, Hong-In, Park, Jeong-Sik
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer US 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9923643/ https://www.ncbi.nlm.nih.gov/pubmed/36817556 http://dx.doi.org/10.1007/s11042-023-14680-y

_version_	1784887758998732800
author	Yun, Hong-In Park, Jeong-Sik
author_facet	Yun, Hong-In Park, Jeong-Sik
author_sort	Yun, Hong-In
collection	PubMed
description	The end-to-end approach provides better performance in speech recognition compared to the traditional hidden Markov model-deep neural network (HMM-DNN)-based approach, but still shows poor performance in abnormal speech, especially emotional speech. The optimal solution is to build an acoustic model suitable for emotional speech recognition using only emotional speech data for each emotion, but it is impossible because it is difficult to collect sufficient amount of emotional speech data for each emotion. In this study, we propose a method to improve the emotional speech recognition performance by using the knowledge distillation technique that was originally introduced to decrease computational intensity of deep learning-based approaches by reducing the number of model parameters. In addition to its use as model compression, we employ this technique for model adaptation to emotional speech. The proposed method builds a basic model (referred to as a teacher model) with a number of model parameters using an amount of normal speech data, and then constructs a target model (referred to as a student model) with fewer model parameters using a small amount of emotional speech data (i.e., adaptation data). Since the student model is built with emotional speech data, it is expected to reflect the emotional characteristics of each emotion well. In the emotional speech recognition experiment, the student model maintained recognition performance regardless of the number of model parameters, whereas the teacher model degraded performance significantly as the number of parameters decreased, showing performance degradation of about 10% in word error rate. This result demonstrates that the student model serves as an acoustic model suitable for emotional speech recognition even though it does not require much emotional speech data.
format	Online Article Text
id	pubmed-9923643
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Springer US
record_format	MEDLINE/PubMed
spelling	pubmed-99236432023-02-13 End-to-end emotional speech recognition using acoustic model adaptation based on knowledge distillation Yun, Hong-In Park, Jeong-Sik Multimed Tools Appl Article The end-to-end approach provides better performance in speech recognition compared to the traditional hidden Markov model-deep neural network (HMM-DNN)-based approach, but still shows poor performance in abnormal speech, especially emotional speech. The optimal solution is to build an acoustic model suitable for emotional speech recognition using only emotional speech data for each emotion, but it is impossible because it is difficult to collect sufficient amount of emotional speech data for each emotion. In this study, we propose a method to improve the emotional speech recognition performance by using the knowledge distillation technique that was originally introduced to decrease computational intensity of deep learning-based approaches by reducing the number of model parameters. In addition to its use as model compression, we employ this technique for model adaptation to emotional speech. The proposed method builds a basic model (referred to as a teacher model) with a number of model parameters using an amount of normal speech data, and then constructs a target model (referred to as a student model) with fewer model parameters using a small amount of emotional speech data (i.e., adaptation data). Since the student model is built with emotional speech data, it is expected to reflect the emotional characteristics of each emotion well. In the emotional speech recognition experiment, the student model maintained recognition performance regardless of the number of model parameters, whereas the teacher model degraded performance significantly as the number of parameters decreased, showing performance degradation of about 10% in word error rate. This result demonstrates that the student model serves as an acoustic model suitable for emotional speech recognition even though it does not require much emotional speech data. Springer US 2023-02-13 2023 /pmc/articles/PMC9923643/ /pubmed/36817556 http://dx.doi.org/10.1007/s11042-023-14680-y Text en © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Article Yun, Hong-In Park, Jeong-Sik End-to-end emotional speech recognition using acoustic model adaptation based on knowledge distillation
title	End-to-end emotional speech recognition using acoustic model adaptation based on knowledge distillation
title_full	End-to-end emotional speech recognition using acoustic model adaptation based on knowledge distillation
title_fullStr	End-to-end emotional speech recognition using acoustic model adaptation based on knowledge distillation
title_full_unstemmed	End-to-end emotional speech recognition using acoustic model adaptation based on knowledge distillation
title_short	End-to-end emotional speech recognition using acoustic model adaptation based on knowledge distillation
title_sort	end-to-end emotional speech recognition using acoustic model adaptation based on knowledge distillation
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9923643/ https://www.ncbi.nlm.nih.gov/pubmed/36817556 http://dx.doi.org/10.1007/s11042-023-14680-y
work_keys_str_mv	AT yunhongin endtoendemotionalspeechrecognitionusingacousticmodeladaptationbasedonknowledgedistillation AT parkjeongsik endtoendemotionalspeechrecognitionusingacousticmodeladaptationbasedonknowledgedistillation

End-to-end emotional speech recognition using acoustic model adaptation based on knowledge distillation

Ejemplares similares