Cargando…

Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation

Spatiotemporal and motion feature representations are the key to video action recognition. Typical previous approaches are to utilize 3D CNNs to cope with both spatial and temporal features, but they suffer from huge computations. Other approaches are to utilize (1+2)D CNNs to learn spatial and temp...

Descripción completa

Detalles Bibliográficos
Autores principales: Joefrie, Yuri Yudhaswana, Aono, Masaki
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9689149/
https://www.ncbi.nlm.nih.gov/pubmed/36421524
http://dx.doi.org/10.3390/e24111663
_version_ 1784836456888401920
author Joefrie, Yuri Yudhaswana
Aono, Masaki
author_facet Joefrie, Yuri Yudhaswana
Aono, Masaki
author_sort Joefrie, Yuri Yudhaswana
collection PubMed
description Spatiotemporal and motion feature representations are the key to video action recognition. Typical previous approaches are to utilize 3D CNNs to cope with both spatial and temporal features, but they suffer from huge computations. Other approaches are to utilize (1+2)D CNNs to learn spatial and temporal features in an efficient way, but they neglect the importance of motion representations. To overcome problems with previous approaches, we propose a novel block which makes it possible to alleviate the aforementioned problems, since our block can capture spatial and temporal features more faithfully and efficiently learn motion features. This proposed block includes Motion Excitation (ME), Multi-view Excitation (MvE), and Densely Connected Temporal Aggregation (DCTA). The purpose of ME is to encode feature-level frame differences; MvE is designed to enrich spatiotemporal features with multiple view representations adaptively; and DCTA is to model long-range temporal dependencies. We inject the proposed building block, which we refer to as the META block (or simply “META”), into 2D ResNet-50. Through extensive experiments, we demonstrate that our proposed method architecture outperforms previous CNN-based methods in terms of “Val Top-1 %” measure with Something-Something v1 and Jester datasets, while the META yielded competitive results with the Moment-in-Time Mini dataset.
format Online
Article
Text
id pubmed-9689149
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-96891492022-11-25 Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation Joefrie, Yuri Yudhaswana Aono, Masaki Entropy (Basel) Article Spatiotemporal and motion feature representations are the key to video action recognition. Typical previous approaches are to utilize 3D CNNs to cope with both spatial and temporal features, but they suffer from huge computations. Other approaches are to utilize (1+2)D CNNs to learn spatial and temporal features in an efficient way, but they neglect the importance of motion representations. To overcome problems with previous approaches, we propose a novel block which makes it possible to alleviate the aforementioned problems, since our block can capture spatial and temporal features more faithfully and efficiently learn motion features. This proposed block includes Motion Excitation (ME), Multi-view Excitation (MvE), and Densely Connected Temporal Aggregation (DCTA). The purpose of ME is to encode feature-level frame differences; MvE is designed to enrich spatiotemporal features with multiple view representations adaptively; and DCTA is to model long-range temporal dependencies. We inject the proposed building block, which we refer to as the META block (or simply “META”), into 2D ResNet-50. Through extensive experiments, we demonstrate that our proposed method architecture outperforms previous CNN-based methods in terms of “Val Top-1 %” measure with Something-Something v1 and Jester datasets, while the META yielded competitive results with the Moment-in-Time Mini dataset. MDPI 2022-11-15 /pmc/articles/PMC9689149/ /pubmed/36421524 http://dx.doi.org/10.3390/e24111663 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Joefrie, Yuri Yudhaswana
Aono, Masaki
Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation
title Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation
title_full Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation
title_fullStr Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation
title_full_unstemmed Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation
title_short Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation
title_sort video action recognition using motion and multi-view excitation with temporal aggregation
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9689149/
https://www.ncbi.nlm.nih.gov/pubmed/36421524
http://dx.doi.org/10.3390/e24111663
work_keys_str_mv AT joefrieyuriyudhaswana videoactionrecognitionusingmotionandmultiviewexcitationwithtemporalaggregation
AT aonomasaki videoactionrecognitionusingmotionandmultiviewexcitationwithtemporalaggregation