Cargando…

Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †

Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on vide...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lee, Min Kyu, Kim, Dae Ha, Song, Byung Cheol
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7571042/ https://www.ncbi.nlm.nih.gov/pubmed/32932939 http://dx.doi.org/10.3390/s20185184

_version_	1783597085552017408
author	Lee, Min Kyu Kim, Dae Ha Song, Byung Cheol
author_facet	Lee, Min Kyu Kim, Dae Ha Song, Byung Cheol
author_sort	Lee, Min Kyu
collection	PubMed
description	Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on videos acquired in a wild environment suffering from varying illuminations and head poses. In order to solve this problem and improve the ultimate performance of FER, this paper proposes a new architecture that extends a state-of-the-art FER scheme and a multi-modal neural network that can effectively fuse image and landmark information. To this end, we propose three methods. To maximize the performance of the recurrent neural network (RNN) in the previous scheme, we first propose a frame substitution module that replaces the latent features of less important frames with those of important frames based on inter-frame correlation. Second, we propose a method for extracting facial landmark features based on the correlation between frames. Third, we propose a new multi-modal fusion method that effectively fuses video and facial landmark information at the feature level. By applying attention based on the characteristics of each modality to the features of the modality, novel fusion is achieved. Experimental results show that the proposed method provides remarkable performance, with 51.4% accuracy for the wild AFEW dataset, 98.5% accuracy for the CK+ dataset and 81.9% accuracy for the MMI dataset, outperforming the state-of-the-art networks.
format	Online Article Text
id	pubmed-7571042
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-75710422020-10-28 Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition † Lee, Min Kyu Kim, Dae Ha Song, Byung Cheol Sensors (Basel) Article Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on videos acquired in a wild environment suffering from varying illuminations and head poses. In order to solve this problem and improve the ultimate performance of FER, this paper proposes a new architecture that extends a state-of-the-art FER scheme and a multi-modal neural network that can effectively fuse image and landmark information. To this end, we propose three methods. To maximize the performance of the recurrent neural network (RNN) in the previous scheme, we first propose a frame substitution module that replaces the latent features of less important frames with those of important frames based on inter-frame correlation. Second, we propose a method for extracting facial landmark features based on the correlation between frames. Third, we propose a new multi-modal fusion method that effectively fuses video and facial landmark information at the feature level. By applying attention based on the characteristics of each modality to the features of the modality, novel fusion is achieved. Experimental results show that the proposed method provides remarkable performance, with 51.4% accuracy for the wild AFEW dataset, 98.5% accuracy for the CK+ dataset and 81.9% accuracy for the MMI dataset, outperforming the state-of-the-art networks. MDPI 2020-09-11 /pmc/articles/PMC7571042/ /pubmed/32932939 http://dx.doi.org/10.3390/s20185184 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Lee, Min Kyu Kim, Dae Ha Song, Byung Cheol Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †
title	Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †
title_full	Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †
title_fullStr	Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †
title_full_unstemmed	Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †
title_short	Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †
title_sort	visual scene-aware hybrid and multi-modal feature aggregation for facial expression recognition †
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7571042/ https://www.ncbi.nlm.nih.gov/pubmed/32932939 http://dx.doi.org/10.3390/s20185184
work_keys_str_mv	AT leeminkyu visualsceneawarehybridandmultimodalfeatureaggregationforfacialexpressionrecognition AT kimdaeha visualsceneawarehybridandmultimodalfeatureaggregationforfacialexpressionrecognition AT songbyungcheol visualsceneawarehybridandmultimodalfeatureaggregationforfacialexpressionrecognition

Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †

Ejemplares similares