Cargando…
Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition †
Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on vide...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7571042/ https://www.ncbi.nlm.nih.gov/pubmed/32932939 http://dx.doi.org/10.3390/s20185184 |
_version_ | 1783597085552017408 |
---|---|
author | Lee, Min Kyu Kim, Dae Ha Song, Byung Cheol |
author_facet | Lee, Min Kyu Kim, Dae Ha Song, Byung Cheol |
author_sort | Lee, Min Kyu |
collection | PubMed |
description | Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on videos acquired in a wild environment suffering from varying illuminations and head poses. In order to solve this problem and improve the ultimate performance of FER, this paper proposes a new architecture that extends a state-of-the-art FER scheme and a multi-modal neural network that can effectively fuse image and landmark information. To this end, we propose three methods. To maximize the performance of the recurrent neural network (RNN) in the previous scheme, we first propose a frame substitution module that replaces the latent features of less important frames with those of important frames based on inter-frame correlation. Second, we propose a method for extracting facial landmark features based on the correlation between frames. Third, we propose a new multi-modal fusion method that effectively fuses video and facial landmark information at the feature level. By applying attention based on the characteristics of each modality to the features of the modality, novel fusion is achieved. Experimental results show that the proposed method provides remarkable performance, with 51.4% accuracy for the wild AFEW dataset, 98.5% accuracy for the CK+ dataset and 81.9% accuracy for the MMI dataset, outperforming the state-of-the-art networks. |
format | Online Article Text |
id | pubmed-7571042 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-75710422020-10-28 Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition † Lee, Min Kyu Kim, Dae Ha Song, Byung Cheol Sensors (Basel) Article Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on videos acquired in a wild environment suffering from varying illuminations and head poses. In order to solve this problem and improve the ultimate performance of FER, this paper proposes a new architecture that extends a state-of-the-art FER scheme and a multi-modal neural network that can effectively fuse image and landmark information. To this end, we propose three methods. To maximize the performance of the recurrent neural network (RNN) in the previous scheme, we first propose a frame substitution module that replaces the latent features of less important frames with those of important frames based on inter-frame correlation. Second, we propose a method for extracting facial landmark features based on the correlation between frames. Third, we propose a new multi-modal fusion method that effectively fuses video and facial landmark information at the feature level. By applying attention based on the characteristics of each modality to the features of the modality, novel fusion is achieved. Experimental results show that the proposed method provides remarkable performance, with 51.4% accuracy for the wild AFEW dataset, 98.5% accuracy for the CK+ dataset and 81.9% accuracy for the MMI dataset, outperforming the state-of-the-art networks. MDPI 2020-09-11 /pmc/articles/PMC7571042/ /pubmed/32932939 http://dx.doi.org/10.3390/s20185184 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Lee, Min Kyu Kim, Dae Ha Song, Byung Cheol Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition † |
title | Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition † |
title_full | Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition † |
title_fullStr | Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition † |
title_full_unstemmed | Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition † |
title_short | Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition † |
title_sort | visual scene-aware hybrid and multi-modal feature aggregation for facial expression recognition † |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7571042/ https://www.ncbi.nlm.nih.gov/pubmed/32932939 http://dx.doi.org/10.3390/s20185184 |
work_keys_str_mv | AT leeminkyu visualsceneawarehybridandmultimodalfeatureaggregationforfacialexpressionrecognition AT kimdaeha visualsceneawarehybridandmultimodalfeatureaggregationforfacialexpressionrecognition AT songbyungcheol visualsceneawarehybridandmultimodalfeatureaggregationforfacialexpressionrecognition |