Cargando…

Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention

Emotions have a crucial function in the mental existence of humans. They are vital for identifying a person’s behaviour and mental condition. Speech Emotion Recognition (SER) is extracting a speaker’s emotional state from their speech signal. SER is a growing discipline in human–computer interaction...

Descripción completa

Detalles Bibliográficos
Autor principal: Alsabhan, Waleed
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9921095/
https://www.ncbi.nlm.nih.gov/pubmed/36772427
http://dx.doi.org/10.3390/s23031386
_version_ 1784887229701685248
author Alsabhan, Waleed
author_facet Alsabhan, Waleed
author_sort Alsabhan, Waleed
collection PubMed
description Emotions have a crucial function in the mental existence of humans. They are vital for identifying a person’s behaviour and mental condition. Speech Emotion Recognition (SER) is extracting a speaker’s emotional state from their speech signal. SER is a growing discipline in human–computer interaction, and it has recently attracted more significant interest. This is because there are not so many universal emotions; therefore, any intelligent system with enough computational capacity can educate itself to recognise them. However, the issue is that human speech is immensely diverse, making it difficult to create a single, standardised recipe for detecting hidden emotions. This work attempted to solve this research difficulty by combining a multilingual emotional dataset with building a more generalised and effective model for recognising human emotions. A two-step process was used to develop the model. The first stage involved the extraction of features, and the second stage involved the classification of the features that were extracted. ZCR, RMSE, and the renowned MFC coefficients were retrieved as features. Two proposed models, 1D CNN combined with LSTM and attention and a proprietary 2D CNN architecture, were used for classification. The outcomes demonstrated that the suggested 1D CNN with LSTM and attention performed better than the 2D CNN. For the EMO-DB, SAVEE, ANAD, and BAVED datasets, the model’s accuracy was 96.72%, 97.13%, 96.72%, and 88.39%, respectively. The model beat several earlier efforts on the same datasets, demonstrating the generality and efficacy of recognising multiple emotions from various languages.
format Online
Article
Text
id pubmed-9921095
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-99210952023-02-12 Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention Alsabhan, Waleed Sensors (Basel) Article Emotions have a crucial function in the mental existence of humans. They are vital for identifying a person’s behaviour and mental condition. Speech Emotion Recognition (SER) is extracting a speaker’s emotional state from their speech signal. SER is a growing discipline in human–computer interaction, and it has recently attracted more significant interest. This is because there are not so many universal emotions; therefore, any intelligent system with enough computational capacity can educate itself to recognise them. However, the issue is that human speech is immensely diverse, making it difficult to create a single, standardised recipe for detecting hidden emotions. This work attempted to solve this research difficulty by combining a multilingual emotional dataset with building a more generalised and effective model for recognising human emotions. A two-step process was used to develop the model. The first stage involved the extraction of features, and the second stage involved the classification of the features that were extracted. ZCR, RMSE, and the renowned MFC coefficients were retrieved as features. Two proposed models, 1D CNN combined with LSTM and attention and a proprietary 2D CNN architecture, were used for classification. The outcomes demonstrated that the suggested 1D CNN with LSTM and attention performed better than the 2D CNN. For the EMO-DB, SAVEE, ANAD, and BAVED datasets, the model’s accuracy was 96.72%, 97.13%, 96.72%, and 88.39%, respectively. The model beat several earlier efforts on the same datasets, demonstrating the generality and efficacy of recognising multiple emotions from various languages. MDPI 2023-01-26 /pmc/articles/PMC9921095/ /pubmed/36772427 http://dx.doi.org/10.3390/s23031386 Text en © 2023 by the author. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Alsabhan, Waleed
Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention
title Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention
title_full Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention
title_fullStr Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention
title_full_unstemmed Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention
title_short Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention
title_sort human–computer interaction with a real-time speech emotion recognition with ensembling techniques 1d convolution neural network and attention
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9921095/
https://www.ncbi.nlm.nih.gov/pubmed/36772427
http://dx.doi.org/10.3390/s23031386
work_keys_str_mv AT alsabhanwaleed humancomputerinteractionwitharealtimespeechemotionrecognitionwithensemblingtechniques1dconvolutionneuralnetworkandattention