Cargando…

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is s...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhang, Hua, Gou, Ruoyun, Shang, Jili, Shen, Fangyao, Wu, Yifan, Dai, Guojun
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Physiology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7962985/ https://www.ncbi.nlm.nih.gov/pubmed/33737889 http://dx.doi.org/10.3389/fphys.2021.643202

_version_	1783665555461373952
author	Zhang, Hua Gou, Ruoyun Shang, Jili Shen, Fangyao Wu, Yifan Dai, Guojun
author_facet	Zhang, Hua Gou, Ruoyun Shang, Jili Shen, Fangyao Wu, Yifan Dai, Guojun
author_sort	Zhang, Hua
collection	PubMed
description	Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.
format	Online Article Text
id	pubmed-7962985
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-79629852021-03-17 Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition Zhang, Hua Gou, Ruoyun Shang, Jili Shen, Fangyao Wu, Yifan Dai, Guojun Front Physiol Physiology Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method. Frontiers Media S.A. 2021-03-02 /pmc/articles/PMC7962985/ /pubmed/33737889 http://dx.doi.org/10.3389/fphys.2021.643202 Text en Copyright © 2021 Zhang, Gou, Shang, Shen, Wu and Dai. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Physiology Zhang, Hua Gou, Ruoyun Shang, Jili Shen, Fangyao Wu, Yifan Dai, Guojun Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition
title	Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition
title_full	Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition
title_fullStr	Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition
title_full_unstemmed	Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition
title_short	Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition
title_sort	pre-trained deep convolution neural network model with attention for speech emotion recognition
topic	Physiology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7962985/ https://www.ncbi.nlm.nih.gov/pubmed/33737889 http://dx.doi.org/10.3389/fphys.2021.643202
work_keys_str_mv	AT zhanghua pretraineddeepconvolutionneuralnetworkmodelwithattentionforspeechemotionrecognition AT gouruoyun pretraineddeepconvolutionneuralnetworkmodelwithattentionforspeechemotionrecognition AT shangjili pretraineddeepconvolutionneuralnetworkmodelwithattentionforspeechemotionrecognition AT shenfangyao pretraineddeepconvolutionneuralnetworkmodelwithattentionforspeechemotionrecognition AT wuyifan pretraineddeepconvolutionneuralnetworkmodelwithattentionforspeechemotionrecognition AT daiguojun pretraineddeepconvolutionneuralnetworkmodelwithattentionforspeechemotionrecognition

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

Ejemplares similares