Cargando…

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and lan...

Descripción completa

Detalles Bibliográficos
Autores principales:	Seo, Minji, Kim, Myungho
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7583996/ https://www.ncbi.nlm.nih.gov/pubmed/32998382 http://dx.doi.org/10.3390/s20195559

_version_	1783599507268698112
author	Seo, Minji Kim, Myungho
author_facet	Seo, Minji Kim, Myungho
author_sort	Seo, Minji
collection	PubMed
description	Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.
format	Online Article Text
id	pubmed-7583996
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-75839962020-10-29 Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition Seo, Minji Kim, Myungho Sensors (Basel) Article Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches. MDPI 2020-09-28 /pmc/articles/PMC7583996/ /pubmed/32998382 http://dx.doi.org/10.3390/s20195559 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Seo, Minji Kim, Myungho Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition
title	Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition
title_full	Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition
title_fullStr	Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition
title_full_unstemmed	Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition
title_short	Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition
title_sort	fusing visual attention cnn and bag of visual words for cross-corpus speech emotion recognition
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7583996/ https://www.ncbi.nlm.nih.gov/pubmed/32998382 http://dx.doi.org/10.3390/s20195559
work_keys_str_mv	AT seominji fusingvisualattentioncnnandbagofvisualwordsforcrosscorpusspeechemotionrecognition AT kimmyungho fusingvisualattentioncnnandbagofvisualwordsforcrosscorpusspeechemotionrecognition

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Ejemplares similares