Cargando…

Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †

Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on vide...

Descripción completa

Detalles Bibliográficos
Autores principales: Lee, Min Kyu, Kim, Dae Ha, Song, Byung Cheol
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7571042/
https://www.ncbi.nlm.nih.gov/pubmed/32932939
http://dx.doi.org/10.3390/s20185184
_version_ 1783597085552017408
author Lee, Min Kyu
Kim, Dae Ha
Song, Byung Cheol
author_facet Lee, Min Kyu
Kim, Dae Ha
Song, Byung Cheol
author_sort Lee, Min Kyu
collection PubMed
description Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on videos acquired in a wild environment suffering from varying illuminations and head poses. In order to solve this problem and improve the ultimate performance of FER, this paper proposes a new architecture that extends a state-of-the-art FER scheme and a multi-modal neural network that can effectively fuse image and landmark information. To this end, we propose three methods. To maximize the performance of the recurrent neural network (RNN) in the previous scheme, we first propose a frame substitution module that replaces the latent features of less important frames with those of important frames based on inter-frame correlation. Second, we propose a method for extracting facial landmark features based on the correlation between frames. Third, we propose a new multi-modal fusion method that effectively fuses video and facial landmark information at the feature level. By applying attention based on the characteristics of each modality to the features of the modality, novel fusion is achieved. Experimental results show that the proposed method provides remarkable performance, with 51.4% accuracy for the wild AFEW dataset, 98.5% accuracy for the CK+ dataset and 81.9% accuracy for the MMI dataset, outperforming the state-of-the-art networks.
format Online
Article
Text
id pubmed-7571042
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-75710422020-10-28 Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition † Lee, Min Kyu Kim, Dae Ha Song, Byung Cheol Sensors (Basel) Article Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on videos acquired in a wild environment suffering from varying illuminations and head poses. In order to solve this problem and improve the ultimate performance of FER, this paper proposes a new architecture that extends a state-of-the-art FER scheme and a multi-modal neural network that can effectively fuse image and landmark information. To this end, we propose three methods. To maximize the performance of the recurrent neural network (RNN) in the previous scheme, we first propose a frame substitution module that replaces the latent features of less important frames with those of important frames based on inter-frame correlation. Second, we propose a method for extracting facial landmark features based on the correlation between frames. Third, we propose a new multi-modal fusion method that effectively fuses video and facial landmark information at the feature level. By applying attention based on the characteristics of each modality to the features of the modality, novel fusion is achieved. Experimental results show that the proposed method provides remarkable performance, with 51.4% accuracy for the wild AFEW dataset, 98.5% accuracy for the CK+ dataset and 81.9% accuracy for the MMI dataset, outperforming the state-of-the-art networks. MDPI 2020-09-11 /pmc/articles/PMC7571042/ /pubmed/32932939 http://dx.doi.org/10.3390/s20185184 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Lee, Min Kyu
Kim, Dae Ha
Song, Byung Cheol
Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †
title Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †
title_full Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †
title_fullStr Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †
title_full_unstemmed Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †
title_short Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †
title_sort visual scene-aware hybrid and multi-modal feature aggregation for facial expression recognition †
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7571042/
https://www.ncbi.nlm.nih.gov/pubmed/32932939
http://dx.doi.org/10.3390/s20185184
work_keys_str_mv AT leeminkyu visualsceneawarehybridandmultimodalfeatureaggregationforfacialexpressionrecognition
AT kimdaeha visualsceneawarehybridandmultimodalfeatureaggregationforfacialexpressionrecognition
AT songbyungcheol visualsceneawarehybridandmultimodalfeatureaggregationforfacialexpressionrecognition