Cargando…

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Speech is the most significant mode of communication among human beings and a potential method for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion recognition using these sensors from speech signals is an emerging area of research in HCI, which applies to multiple...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mustaqeem, Kwon, Soonil
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2019
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6982825/ https://www.ncbi.nlm.nih.gov/pubmed/31905692 http://dx.doi.org/10.3390/s20010183

_version_	1783491378032934912
author	Mustaqeem, Kwon, Soonil
author_facet	Mustaqeem, Kwon, Soonil
author_sort	Mustaqeem,
collection	PubMed
description	Speech is the most significant mode of communication among human beings and a potential method for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion recognition using these sensors from speech signals is an emerging area of research in HCI, which applies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment, healthcare, and emergency call centers to determine the speaker’s emotional state from an individual’s speech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion recognition (SER) compared to state of the art and (ii) reducing the computational complexity of the presented SER model. We propose an artificial intelligence-assisted deep stride convolutional neural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative features from spectrogram of speech signals that are enhanced in prior steps to perform better. Local hidden patterns are learned in convolutional layers with special strides to down-sample the feature maps rather than pooling layer and global discriminative features are learned in fully connected layers. A SoftMax classifier is used for the classification of emotions in speech. The proposed technique is evaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%, respectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of the proposed SER technique and reveals its applicability in real-world applications.
format	Online Article Text
id	pubmed-6982825
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-69828252020-02-06 A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition Mustaqeem, Kwon, Soonil Sensors (Basel) Article Speech is the most significant mode of communication among human beings and a potential method for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion recognition using these sensors from speech signals is an emerging area of research in HCI, which applies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment, healthcare, and emergency call centers to determine the speaker’s emotional state from an individual’s speech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion recognition (SER) compared to state of the art and (ii) reducing the computational complexity of the presented SER model. We propose an artificial intelligence-assisted deep stride convolutional neural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative features from spectrogram of speech signals that are enhanced in prior steps to perform better. Local hidden patterns are learned in convolutional layers with special strides to down-sample the feature maps rather than pooling layer and global discriminative features are learned in fully connected layers. A SoftMax classifier is used for the classification of emotions in speech. The proposed technique is evaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%, respectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of the proposed SER technique and reveals its applicability in real-world applications. MDPI 2019-12-28 /pmc/articles/PMC6982825/ /pubmed/31905692 http://dx.doi.org/10.3390/s20010183 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Mustaqeem, Kwon, Soonil A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition
title	A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition
title_full	A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition
title_fullStr	A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition
title_full_unstemmed	A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition
title_short	A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition
title_sort	cnn-assisted enhanced audio signal processing for speech emotion recognition
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6982825/ https://www.ncbi.nlm.nih.gov/pubmed/31905692 http://dx.doi.org/10.3390/s20010183
work_keys_str_mv	AT mustaqeem acnnassistedenhancedaudiosignalprocessingforspeechemotionrecognition AT kwonsoonil acnnassistedenhancedaudiosignalprocessingforspeechemotionrecognition AT mustaqeem cnnassistedenhancedaudiosignalprocessingforspeechemotionrecognition AT kwonsoonil cnnassistedenhancedaudiosignalprocessingforspeechemotionrecognition

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Ejemplares similares