Cargando…

SMaTE: A Segment-Level Feature Mixing and Temporal Encoding Framework for Facial Expression Recognition

Despite advanced machine learning methods, the implementation of emotion recognition systems based on real-world video content remains challenging. Videos may contain data such as images, audio, and text. However, the application of multimodal models using two or more types of data to real-world vid...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kim, Nayeon, Cho, Sukhee, Bae, Byungjun
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9371125/ https://www.ncbi.nlm.nih.gov/pubmed/35957313 http://dx.doi.org/10.3390/s22155753

_version_	1784767039202656256
author	Kim, Nayeon Cho, Sukhee Bae, Byungjun
author_facet	Kim, Nayeon Cho, Sukhee Bae, Byungjun
author_sort	Kim, Nayeon
collection	PubMed
description	Despite advanced machine learning methods, the implementation of emotion recognition systems based on real-world video content remains challenging. Videos may contain data such as images, audio, and text. However, the application of multimodal models using two or more types of data to real-world video media (CCTV, illegally filmed content, etc.) lacking sound or subtitles is difficult. Although facial expressions in image sequences can be utilized in emotion recognition, the diverse identities of individuals in real-world content limits computational models of relationships between facial expressions. This study proposed a transformation model which employed a video vision transformer to focus on facial expression sequences in videos. It effectively understood and extracted facial expression information from the identities of individuals, instead of fusing multimodal models. The design entailed capture of higher-quality facial expression information through mixed-token embedding facial expression sequences augmented via various methods into a single data representation, and comprised two modules: spatial and temporal encoders. Further, temporal position embedding, focusing on relationships between video frames, was proposed and subsequently applied to the temporal encoder module. The performance of the proposed algorithm was compared with that of conventional methods on two emotion recognition datasets of video content, with results demonstrating its superiority.
format	Online Article Text
id	pubmed-9371125
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-93711252022-08-12 SMaTE: A Segment-Level Feature Mixing and Temporal Encoding Framework for Facial Expression Recognition Kim, Nayeon Cho, Sukhee Bae, Byungjun Sensors (Basel) Article Despite advanced machine learning methods, the implementation of emotion recognition systems based on real-world video content remains challenging. Videos may contain data such as images, audio, and text. However, the application of multimodal models using two or more types of data to real-world video media (CCTV, illegally filmed content, etc.) lacking sound or subtitles is difficult. Although facial expressions in image sequences can be utilized in emotion recognition, the diverse identities of individuals in real-world content limits computational models of relationships between facial expressions. This study proposed a transformation model which employed a video vision transformer to focus on facial expression sequences in videos. It effectively understood and extracted facial expression information from the identities of individuals, instead of fusing multimodal models. The design entailed capture of higher-quality facial expression information through mixed-token embedding facial expression sequences augmented via various methods into a single data representation, and comprised two modules: spatial and temporal encoders. Further, temporal position embedding, focusing on relationships between video frames, was proposed and subsequently applied to the temporal encoder module. The performance of the proposed algorithm was compared with that of conventional methods on two emotion recognition datasets of video content, with results demonstrating its superiority. MDPI 2022-08-01 /pmc/articles/PMC9371125/ /pubmed/35957313 http://dx.doi.org/10.3390/s22155753 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Kim, Nayeon Cho, Sukhee Bae, Byungjun SMaTE: A Segment-Level Feature Mixing and Temporal Encoding Framework for Facial Expression Recognition
title	SMaTE: A Segment-Level Feature Mixing and Temporal Encoding Framework for Facial Expression Recognition
title_full	SMaTE: A Segment-Level Feature Mixing and Temporal Encoding Framework for Facial Expression Recognition
title_fullStr	SMaTE: A Segment-Level Feature Mixing and Temporal Encoding Framework for Facial Expression Recognition
title_full_unstemmed	SMaTE: A Segment-Level Feature Mixing and Temporal Encoding Framework for Facial Expression Recognition
title_short	SMaTE: A Segment-Level Feature Mixing and Temporal Encoding Framework for Facial Expression Recognition
title_sort	smate: a segment-level feature mixing and temporal encoding framework for facial expression recognition
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9371125/ https://www.ncbi.nlm.nih.gov/pubmed/35957313 http://dx.doi.org/10.3390/s22155753
work_keys_str_mv	AT kimnayeon smateasegmentlevelfeaturemixingandtemporalencodingframeworkforfacialexpressionrecognition AT chosukhee smateasegmentlevelfeaturemixingandtemporalencodingframeworkforfacialexpressionrecognition AT baebyungjun smateasegmentlevelfeaturemixingandtemporalencodingframeworkforfacialexpressionrecognition

SMaTE: A Segment-Level Feature Mixing and Temporal Encoding Framework for Facial Expression Recognition

Ejemplares similares