Cargando…

Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions

Convolutional neural networks (CNNs) are a state-of-the-art technique for speech emotion recognition. However, CNNs have mostly been applied to noise-free emotional speech data, and limited evidence is available for their applicability in emotional speech denoising. In this study, a cascaded denoisi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nam, Youngja, Lee, Chankyu
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8271804/ https://www.ncbi.nlm.nih.gov/pubmed/34199027 http://dx.doi.org/10.3390/s21134399

_version_	1783721077288992768
author	Nam, Youngja Lee, Chankyu
author_facet	Nam, Youngja Lee, Chankyu
author_sort	Nam, Youngja
collection	PubMed
description	Convolutional neural networks (CNNs) are a state-of-the-art technique for speech emotion recognition. However, CNNs have mostly been applied to noise-free emotional speech data, and limited evidence is available for their applicability in emotional speech denoising. In this study, a cascaded denoising CNN (DnCNN)–CNN architecture is proposed to classify emotions from Korean and German speech in noisy conditions. The proposed architecture consists of two stages. In the first stage, the DnCNN exploits the concept of residual learning to perform denoising; in the second stage, the CNN performs the classification. The classification results for real datasets show that the DnCNN–CNN outperforms the baseline CNN in overall accuracy for both languages. For Korean speech, the DnCNN–CNN achieves an accuracy of 95.8%, whereas the accuracy of the CNN is marginally lower (93.6%). For German speech, the DnCNN–CNN has an overall accuracy of 59.3–76.6%, whereas the CNN has an overall accuracy of 39.4–58.1%. These results demonstrate the feasibility of applying the DnCNN with residual learning to speech denoising and the effectiveness of the CNN-based approach in speech emotion recognition. Our findings provide new insights into speech emotion recognition in adverse conditions and have implications for language-universal speech emotion recognition.
format	Online Article Text
id	pubmed-8271804
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-82718042021-07-11 Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions Nam, Youngja Lee, Chankyu Sensors (Basel) Article Convolutional neural networks (CNNs) are a state-of-the-art technique for speech emotion recognition. However, CNNs have mostly been applied to noise-free emotional speech data, and limited evidence is available for their applicability in emotional speech denoising. In this study, a cascaded denoising CNN (DnCNN)–CNN architecture is proposed to classify emotions from Korean and German speech in noisy conditions. The proposed architecture consists of two stages. In the first stage, the DnCNN exploits the concept of residual learning to perform denoising; in the second stage, the CNN performs the classification. The classification results for real datasets show that the DnCNN–CNN outperforms the baseline CNN in overall accuracy for both languages. For Korean speech, the DnCNN–CNN achieves an accuracy of 95.8%, whereas the accuracy of the CNN is marginally lower (93.6%). For German speech, the DnCNN–CNN has an overall accuracy of 59.3–76.6%, whereas the CNN has an overall accuracy of 39.4–58.1%. These results demonstrate the feasibility of applying the DnCNN with residual learning to speech denoising and the effectiveness of the CNN-based approach in speech emotion recognition. Our findings provide new insights into speech emotion recognition in adverse conditions and have implications for language-universal speech emotion recognition. MDPI 2021-06-27 /pmc/articles/PMC8271804/ /pubmed/34199027 http://dx.doi.org/10.3390/s21134399 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Nam, Youngja Lee, Chankyu Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions
title	Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions
title_full	Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions
title_fullStr	Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions
title_full_unstemmed	Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions
title_short	Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions
title_sort	cascaded convolutional neural network architecture for speech emotion recognition in noisy conditions
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8271804/ https://www.ncbi.nlm.nih.gov/pubmed/34199027 http://dx.doi.org/10.3390/s21134399
work_keys_str_mv	AT namyoungja cascadedconvolutionalneuralnetworkarchitectureforspeechemotionrecognitioninnoisyconditions AT leechankyu cascadedconvolutionalneuralnetworkarchitectureforspeechemotionrecognitioninnoisyconditions

Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions

Ejemplares similares