Cargando…

End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network

Amongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tang, Duowei, Kuppens, Peter, Geurts, Luc, van Waterschoot, Toon
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2021
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8550764/ https://www.ncbi.nlm.nih.gov/pubmed/34721556 http://dx.doi.org/10.1186/s13636-021-00208-5

_version_	1784591025374756864
author	Tang, Duowei Kuppens, Peter Geurts, Luc van Waterschoot, Toon
author_facet	Tang, Duowei Kuppens, Peter Geurts, Luc van Waterschoot, Toon
author_sort	Tang, Duowei
collection	PubMed
description	Amongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal dependencies in the analysed speech signal. Therefore, in this work, we propose a novel end-to-end neural network architecture based on the concept of dilated causal convolution with context stacking. Firstly, the proposed model consists only of parallelisable layers and is hence suitable for parallel processing, while avoiding the inherent lack of parallelisability occurring with recurrent neural network (RNN) layers. Secondly, the design of a dedicated dilated causal convolution block allows the model to have a receptive field as large as the input sequence length, while maintaining a reasonably low computational cost. Thirdly, by introducing a context stacking structure, the proposed model is capable of exploiting long-term temporal dependencies hence providing an alternative to the use of RNN layers. We evaluate the proposed model in SER regression and classification tasks and provide a comparison with a state-of-the-art end-to-end SER model. Experimental results indicate that the proposed model requires only 1/3 of the number of model parameters used in the state-of-the-art model, while also significantly improving SER performance. Further experiments are reported to understand the impact of using various types of input representations (i.e. raw audio samples vs log mel-spectrograms) and to illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features. Moreover, we show that the proposed model can efficiently learn intermediate embeddings preserving speech emotion information.
format	Online Article Text
id	pubmed-8550764
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-85507642021-10-29 End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network Tang, Duowei Kuppens, Peter Geurts, Luc van Waterschoot, Toon EURASIP J Audio Speech Music Process Research Amongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal dependencies in the analysed speech signal. Therefore, in this work, we propose a novel end-to-end neural network architecture based on the concept of dilated causal convolution with context stacking. Firstly, the proposed model consists only of parallelisable layers and is hence suitable for parallel processing, while avoiding the inherent lack of parallelisability occurring with recurrent neural network (RNN) layers. Secondly, the design of a dedicated dilated causal convolution block allows the model to have a receptive field as large as the input sequence length, while maintaining a reasonably low computational cost. Thirdly, by introducing a context stacking structure, the proposed model is capable of exploiting long-term temporal dependencies hence providing an alternative to the use of RNN layers. We evaluate the proposed model in SER regression and classification tasks and provide a comparison with a state-of-the-art end-to-end SER model. Experimental results indicate that the proposed model requires only 1/3 of the number of model parameters used in the state-of-the-art model, while also significantly improving SER performance. Further experiments are reported to understand the impact of using various types of input representations (i.e. raw audio samples vs log mel-spectrograms) and to illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features. Moreover, we show that the proposed model can efficiently learn intermediate embeddings preserving speech emotion information. Springer International Publishing 2021-05-12 2021 /pmc/articles/PMC8550764/ /pubmed/34721556 http://dx.doi.org/10.1186/s13636-021-00208-5 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Research Tang, Duowei Kuppens, Peter Geurts, Luc van Waterschoot, Toon End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network
title	End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network
title_full	End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network
title_fullStr	End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network
title_full_unstemmed	End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network
title_short	End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network
title_sort	end-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8550764/ https://www.ncbi.nlm.nih.gov/pubmed/34721556 http://dx.doi.org/10.1186/s13636-021-00208-5
work_keys_str_mv	AT tangduowei endtoendspeechemotionrecognitionusinganovelcontextstackingdilatedconvolutionneuralnetwork AT kuppenspeter endtoendspeechemotionrecognitionusinganovelcontextstackingdilatedconvolutionneuralnetwork AT geurtsluc endtoendspeechemotionrecognitionusinganovelcontextstackingdilatedconvolutionneuralnetwork AT vanwaterschoottoon endtoendspeechemotionrecognitionusinganovelcontextstackingdilatedconvolutionneuralnetwork

End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network

Ejemplares similares