Cargando…

3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms

Detecting human intentions and emotions helps improve human–robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlappi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hajarolasvadi, Noushin, Demirel, Hasan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2019
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514968/ https://www.ncbi.nlm.nih.gov/pubmed/33267193 http://dx.doi.org/10.3390/e21050479

_version_	1783586710157787136
author	Hajarolasvadi, Noushin Demirel, Hasan
author_facet	Hajarolasvadi, Noushin Demirel, Hasan
author_sort	Hajarolasvadi, Noushin
collection	PubMed
description	Detecting human intentions and emotions helps improve human–robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlapping frames of the same length. Next, we extract an 88-dimensional vector of audio features including Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity for each of the respective frames. In parallel, the spectrogram of each frame is generated. In the final preprocessing step, by applying k-means clustering on the extracted features of all frames of each audio signal, we select k most discriminant frames, namely keyframes, to summarize the speech signal. Then, the sequence of the corresponding spectrograms of keyframes is encapsulated in a 3D tensor. These tensors are used to train and test a 3D Convolutional Neural network using a 10-fold cross-validation approach. The proposed 3D CNN has two convolutional layers and one fully connected layer. Experiments are conducted on the Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE’05 databases. The results are superior to the state-of-the-art methods reported in the literature.
format	Online Article Text
id	pubmed-7514968
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-75149682020-11-09 3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms Hajarolasvadi, Noushin Demirel, Hasan Entropy (Basel) Article Detecting human intentions and emotions helps improve human–robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlapping frames of the same length. Next, we extract an 88-dimensional vector of audio features including Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity for each of the respective frames. In parallel, the spectrogram of each frame is generated. In the final preprocessing step, by applying k-means clustering on the extracted features of all frames of each audio signal, we select k most discriminant frames, namely keyframes, to summarize the speech signal. Then, the sequence of the corresponding spectrograms of keyframes is encapsulated in a 3D tensor. These tensors are used to train and test a 3D Convolutional Neural network using a 10-fold cross-validation approach. The proposed 3D CNN has two convolutional layers and one fully connected layer. Experiments are conducted on the Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE’05 databases. The results are superior to the state-of-the-art methods reported in the literature. MDPI 2019-05-08 /pmc/articles/PMC7514968/ /pubmed/33267193 http://dx.doi.org/10.3390/e21050479 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Hajarolasvadi, Noushin Demirel, Hasan 3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms
title	3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms
title_full	3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms
title_fullStr	3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms
title_full_unstemmed	3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms
title_short	3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms
title_sort	3d cnn-based speech emotion recognition using k-means clustering and spectrograms
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514968/ https://www.ncbi.nlm.nih.gov/pubmed/33267193 http://dx.doi.org/10.3390/e21050479
work_keys_str_mv	AT hajarolasvadinoushin 3dcnnbasedspeechemotionrecognitionusingkmeansclusteringandspectrograms AT demirelhasan 3dcnnbasedspeechemotionrecognitionusingkmeansclusteringandspectrograms

3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms

Ejemplares similares