Cargando…

Non-Local Temporal Difference Network for Temporal Action Detection

As an important part of video understanding, temporal action detection (TAD) has wide application scenarios. It aims to simultaneously predict the boundary position and class label of every action instance in an untrimmed video. Most of the existing temporal action detection methods adopt a stacked...

Descripción completa

Detalles Bibliográficos
Autores principales: He, Yilong, Han, Xiao, Zhong, Yong, Wang, Lishun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9655564/
https://www.ncbi.nlm.nih.gov/pubmed/36366106
http://dx.doi.org/10.3390/s22218396
_version_ 1784829217069858816
author He, Yilong
Han, Xiao
Zhong, Yong
Wang, Lishun
author_facet He, Yilong
Han, Xiao
Zhong, Yong
Wang, Lishun
author_sort He, Yilong
collection PubMed
description As an important part of video understanding, temporal action detection (TAD) has wide application scenarios. It aims to simultaneously predict the boundary position and class label of every action instance in an untrimmed video. Most of the existing temporal action detection methods adopt a stacked convolutional block strategy to model long temporal structures. However, most of the information between adjacent frames is redundant, and distant information is weakened after multiple convolution operations. In addition, the durations of action instances vary widely, making it difficult for single-scale modeling to fit complex video structures. To address this issue, we propose a non-local temporal difference network (NTD), including a chunk convolution (CC) module, a multiple temporal coordination (MTC) module, and a temporal difference (TD) module. The TD module adaptively enhances the motion information and boundary features with temporal attention weights. The CC module evenly divides the input sequence into N chunks, using multiple independent convolution blocks to simultaneously extract features from neighboring chunks. Therefore, it realizes the information delivered from distant frames while avoiding trapping into the local convolution. The MTC module designs a cascade residual architecture, which realizes the multiscale temporal feature aggregation without introducing additional parameters. The NTD achieves a state-of-the-art performance on two large-scale datasets, 36.2% mAP@avg and 71.6% mAP@0.5 on ActivityNet-v1.3 and THUMOS-14, respectively.
format Online
Article
Text
id pubmed-9655564
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-96555642022-11-15 Non-Local Temporal Difference Network for Temporal Action Detection He, Yilong Han, Xiao Zhong, Yong Wang, Lishun Sensors (Basel) Article As an important part of video understanding, temporal action detection (TAD) has wide application scenarios. It aims to simultaneously predict the boundary position and class label of every action instance in an untrimmed video. Most of the existing temporal action detection methods adopt a stacked convolutional block strategy to model long temporal structures. However, most of the information between adjacent frames is redundant, and distant information is weakened after multiple convolution operations. In addition, the durations of action instances vary widely, making it difficult for single-scale modeling to fit complex video structures. To address this issue, we propose a non-local temporal difference network (NTD), including a chunk convolution (CC) module, a multiple temporal coordination (MTC) module, and a temporal difference (TD) module. The TD module adaptively enhances the motion information and boundary features with temporal attention weights. The CC module evenly divides the input sequence into N chunks, using multiple independent convolution blocks to simultaneously extract features from neighboring chunks. Therefore, it realizes the information delivered from distant frames while avoiding trapping into the local convolution. The MTC module designs a cascade residual architecture, which realizes the multiscale temporal feature aggregation without introducing additional parameters. The NTD achieves a state-of-the-art performance on two large-scale datasets, 36.2% mAP@avg and 71.6% mAP@0.5 on ActivityNet-v1.3 and THUMOS-14, respectively. MDPI 2022-11-01 /pmc/articles/PMC9655564/ /pubmed/36366106 http://dx.doi.org/10.3390/s22218396 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
He, Yilong
Han, Xiao
Zhong, Yong
Wang, Lishun
Non-Local Temporal Difference Network for Temporal Action Detection
title Non-Local Temporal Difference Network for Temporal Action Detection
title_full Non-Local Temporal Difference Network for Temporal Action Detection
title_fullStr Non-Local Temporal Difference Network for Temporal Action Detection
title_full_unstemmed Non-Local Temporal Difference Network for Temporal Action Detection
title_short Non-Local Temporal Difference Network for Temporal Action Detection
title_sort non-local temporal difference network for temporal action detection
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9655564/
https://www.ncbi.nlm.nih.gov/pubmed/36366106
http://dx.doi.org/10.3390/s22218396
work_keys_str_mv AT heyilong nonlocaltemporaldifferencenetworkfortemporalactiondetection
AT hanxiao nonlocaltemporaldifferencenetworkfortemporalactiondetection
AT zhongyong nonlocaltemporaldifferencenetworkfortemporalactiondetection
AT wanglishun nonlocaltemporaldifferencenetworkfortemporalactiondetection