Cargando…
Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation
Spatiotemporal and motion feature representations are the key to video action recognition. Typical previous approaches are to utilize 3D CNNs to cope with both spatial and temporal features, but they suffer from huge computations. Other approaches are to utilize (1+2)D CNNs to learn spatial and temp...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9689149/ https://www.ncbi.nlm.nih.gov/pubmed/36421524 http://dx.doi.org/10.3390/e24111663 |
_version_ | 1784836456888401920 |
---|---|
author | Joefrie, Yuri Yudhaswana Aono, Masaki |
author_facet | Joefrie, Yuri Yudhaswana Aono, Masaki |
author_sort | Joefrie, Yuri Yudhaswana |
collection | PubMed |
description | Spatiotemporal and motion feature representations are the key to video action recognition. Typical previous approaches are to utilize 3D CNNs to cope with both spatial and temporal features, but they suffer from huge computations. Other approaches are to utilize (1+2)D CNNs to learn spatial and temporal features in an efficient way, but they neglect the importance of motion representations. To overcome problems with previous approaches, we propose a novel block which makes it possible to alleviate the aforementioned problems, since our block can capture spatial and temporal features more faithfully and efficiently learn motion features. This proposed block includes Motion Excitation (ME), Multi-view Excitation (MvE), and Densely Connected Temporal Aggregation (DCTA). The purpose of ME is to encode feature-level frame differences; MvE is designed to enrich spatiotemporal features with multiple view representations adaptively; and DCTA is to model long-range temporal dependencies. We inject the proposed building block, which we refer to as the META block (or simply “META”), into 2D ResNet-50. Through extensive experiments, we demonstrate that our proposed method architecture outperforms previous CNN-based methods in terms of “Val Top-1 %” measure with Something-Something v1 and Jester datasets, while the META yielded competitive results with the Moment-in-Time Mini dataset. |
format | Online Article Text |
id | pubmed-9689149 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-96891492022-11-25 Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation Joefrie, Yuri Yudhaswana Aono, Masaki Entropy (Basel) Article Spatiotemporal and motion feature representations are the key to video action recognition. Typical previous approaches are to utilize 3D CNNs to cope with both spatial and temporal features, but they suffer from huge computations. Other approaches are to utilize (1+2)D CNNs to learn spatial and temporal features in an efficient way, but they neglect the importance of motion representations. To overcome problems with previous approaches, we propose a novel block which makes it possible to alleviate the aforementioned problems, since our block can capture spatial and temporal features more faithfully and efficiently learn motion features. This proposed block includes Motion Excitation (ME), Multi-view Excitation (MvE), and Densely Connected Temporal Aggregation (DCTA). The purpose of ME is to encode feature-level frame differences; MvE is designed to enrich spatiotemporal features with multiple view representations adaptively; and DCTA is to model long-range temporal dependencies. We inject the proposed building block, which we refer to as the META block (or simply “META”), into 2D ResNet-50. Through extensive experiments, we demonstrate that our proposed method architecture outperforms previous CNN-based methods in terms of “Val Top-1 %” measure with Something-Something v1 and Jester datasets, while the META yielded competitive results with the Moment-in-Time Mini dataset. MDPI 2022-11-15 /pmc/articles/PMC9689149/ /pubmed/36421524 http://dx.doi.org/10.3390/e24111663 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Joefrie, Yuri Yudhaswana Aono, Masaki Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation |
title | Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation |
title_full | Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation |
title_fullStr | Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation |
title_full_unstemmed | Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation |
title_short | Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation |
title_sort | video action recognition using motion and multi-view excitation with temporal aggregation |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9689149/ https://www.ncbi.nlm.nih.gov/pubmed/36421524 http://dx.doi.org/10.3390/e24111663 |
work_keys_str_mv | AT joefrieyuriyudhaswana videoactionrecognitionusingmotionandmultiviewexcitationwithtemporalaggregation AT aonomasaki videoactionrecognitionusingmotionandmultiviewexcitationwithtemporalaggregation |