Cargando…

Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation

Spatiotemporal and motion feature representations are the key to video action recognition. Typical previous approaches are to utilize 3D CNNs to cope with both spatial and temporal features, but they suffer from huge computations. Other approaches are to utilize (1+2)D CNNs to learn spatial and temp...

Descripción completa

Detalles Bibliográficos
Autores principales:	Joefrie, Yuri Yudhaswana, Aono, Masaki
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9689149/ https://www.ncbi.nlm.nih.gov/pubmed/36421524 http://dx.doi.org/10.3390/e24111663

_version_	1784836456888401920
author	Joefrie, Yuri Yudhaswana Aono, Masaki
author_facet	Joefrie, Yuri Yudhaswana Aono, Masaki
author_sort	Joefrie, Yuri Yudhaswana
collection	PubMed
description	Spatiotemporal and motion feature representations are the key to video action recognition. Typical previous approaches are to utilize 3D CNNs to cope with both spatial and temporal features, but they suffer from huge computations. Other approaches are to utilize (1+2)D CNNs to learn spatial and temporal features in an efficient way, but they neglect the importance of motion representations. To overcome problems with previous approaches, we propose a novel block which makes it possible to alleviate the aforementioned problems, since our block can capture spatial and temporal features more faithfully and efficiently learn motion features. This proposed block includes Motion Excitation (ME), Multi-view Excitation (MvE), and Densely Connected Temporal Aggregation (DCTA). The purpose of ME is to encode feature-level frame differences; MvE is designed to enrich spatiotemporal features with multiple view representations adaptively; and DCTA is to model long-range temporal dependencies. We inject the proposed building block, which we refer to as the META block (or simply “META”), into 2D ResNet-50. Through extensive experiments, we demonstrate that our proposed method architecture outperforms previous CNN-based methods in terms of “Val Top-1 %” measure with Something-Something v1 and Jester datasets, while the META yielded competitive results with the Moment-in-Time Mini dataset.
format	Online Article Text
id	pubmed-9689149
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-96891492022-11-25 Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation Joefrie, Yuri Yudhaswana Aono, Masaki Entropy (Basel) Article Spatiotemporal and motion feature representations are the key to video action recognition. Typical previous approaches are to utilize 3D CNNs to cope with both spatial and temporal features, but they suffer from huge computations. Other approaches are to utilize (1+2)D CNNs to learn spatial and temporal features in an efficient way, but they neglect the importance of motion representations. To overcome problems with previous approaches, we propose a novel block which makes it possible to alleviate the aforementioned problems, since our block can capture spatial and temporal features more faithfully and efficiently learn motion features. This proposed block includes Motion Excitation (ME), Multi-view Excitation (MvE), and Densely Connected Temporal Aggregation (DCTA). The purpose of ME is to encode feature-level frame differences; MvE is designed to enrich spatiotemporal features with multiple view representations adaptively; and DCTA is to model long-range temporal dependencies. We inject the proposed building block, which we refer to as the META block (or simply “META”), into 2D ResNet-50. Through extensive experiments, we demonstrate that our proposed method architecture outperforms previous CNN-based methods in terms of “Val Top-1 %” measure with Something-Something v1 and Jester datasets, while the META yielded competitive results with the Moment-in-Time Mini dataset. MDPI 2022-11-15 /pmc/articles/PMC9689149/ /pubmed/36421524 http://dx.doi.org/10.3390/e24111663 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Joefrie, Yuri Yudhaswana Aono, Masaki Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation
title	Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation
title_full	Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation
title_fullStr	Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation
title_full_unstemmed	Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation
title_short	Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation
title_sort	video action recognition using motion and multi-view excitation with temporal aggregation
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9689149/ https://www.ncbi.nlm.nih.gov/pubmed/36421524 http://dx.doi.org/10.3390/e24111663
work_keys_str_mv	AT joefrieyuriyudhaswana videoactionrecognitionusingmotionandmultiviewexcitationwithtemporalaggregation AT aonomasaki videoactionrecognitionusingmotionandmultiviewexcitationwithtemporalaggregation

Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation

Ejemplares similares