Cargando…

WLiT: Windows and Linear Transformer for Video Action Recognition

The emergence of Transformer has led to the rapid development of video understanding, but it also brings the problem of high computational complexity. Previously, there were methods to divide the feature maps into windows along the spatiotemporal dimensions and then calculate the attention. There ar...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sun, Ruoxi, Zhang, Tianzhao, Wan, Yong, Zhang, Fuping, Wei, Jianming
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9919352/ https://www.ncbi.nlm.nih.gov/pubmed/36772658 http://dx.doi.org/10.3390/s23031616

_version_	1784886803435618304
author	Sun, Ruoxi Zhang, Tianzhao Wan, Yong Zhang, Fuping Wei, Jianming
author_facet	Sun, Ruoxi Zhang, Tianzhao Wan, Yong Zhang, Fuping Wei, Jianming
author_sort	Sun, Ruoxi
collection	PubMed
description	The emergence of Transformer has led to the rapid development of video understanding, but it also brings the problem of high computational complexity. Previously, there were methods to divide the feature maps into windows along the spatiotemporal dimensions and then calculate the attention. There are also methods to perform down-sampling during attention computation to reduce the spatiotemporal resolution of features. Although the complexity is effectively reduced, there is still room for further optimization. Thus, we present the Windows and Linear Transformer (WLiT) for efficient video action recognition, by combining Spatial-Windows attention with Linear attention. We first divide the feature maps into multiple windows along the spatial dimensions and calculate the attention separately inside the windows. Therefore, our model further reduces the computational complexity compared with previous methods. However, the perceptual field of Spatial-Windows attention is small, and global spatiotemporal information cannot be obtained. To address this problem, we then calculate Linear attention along the channel dimension so that the model can capture complete spatiotemporal information. Our method achieves better recognition accuracy with less computational complexity through this mechanism. We conduct extensive experiments on four public datasets, namely Something-Something V2 (SSV2), Kinetics400 (K400), UCF101, and HMDB51. On the SSV2 dataset, our method reduces the computational complexity by 28% and improves the recognition accuracy by 1.6% compared to the State-Of-The-Art (SOTA) method. On the K400 and two other datasets, our method achieves SOTA-level accuracy while reducing the complexity by about 49%.
format	Online Article Text
id	pubmed-9919352
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-99193522023-02-12 WLiT: Windows and Linear Transformer for Video Action Recognition Sun, Ruoxi Zhang, Tianzhao Wan, Yong Zhang, Fuping Wei, Jianming Sensors (Basel) Article The emergence of Transformer has led to the rapid development of video understanding, but it also brings the problem of high computational complexity. Previously, there were methods to divide the feature maps into windows along the spatiotemporal dimensions and then calculate the attention. There are also methods to perform down-sampling during attention computation to reduce the spatiotemporal resolution of features. Although the complexity is effectively reduced, there is still room for further optimization. Thus, we present the Windows and Linear Transformer (WLiT) for efficient video action recognition, by combining Spatial-Windows attention with Linear attention. We first divide the feature maps into multiple windows along the spatial dimensions and calculate the attention separately inside the windows. Therefore, our model further reduces the computational complexity compared with previous methods. However, the perceptual field of Spatial-Windows attention is small, and global spatiotemporal information cannot be obtained. To address this problem, we then calculate Linear attention along the channel dimension so that the model can capture complete spatiotemporal information. Our method achieves better recognition accuracy with less computational complexity through this mechanism. We conduct extensive experiments on four public datasets, namely Something-Something V2 (SSV2), Kinetics400 (K400), UCF101, and HMDB51. On the SSV2 dataset, our method reduces the computational complexity by 28% and improves the recognition accuracy by 1.6% compared to the State-Of-The-Art (SOTA) method. On the K400 and two other datasets, our method achieves SOTA-level accuracy while reducing the complexity by about 49%. MDPI 2023-02-02 /pmc/articles/PMC9919352/ /pubmed/36772658 http://dx.doi.org/10.3390/s23031616 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Sun, Ruoxi Zhang, Tianzhao Wan, Yong Zhang, Fuping Wei, Jianming WLiT: Windows and Linear Transformer for Video Action Recognition
title	WLiT: Windows and Linear Transformer for Video Action Recognition
title_full	WLiT: Windows and Linear Transformer for Video Action Recognition
title_fullStr	WLiT: Windows and Linear Transformer for Video Action Recognition
title_full_unstemmed	WLiT: Windows and Linear Transformer for Video Action Recognition
title_short	WLiT: Windows and Linear Transformer for Video Action Recognition
title_sort	wlit: windows and linear transformer for video action recognition
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9919352/ https://www.ncbi.nlm.nih.gov/pubmed/36772658 http://dx.doi.org/10.3390/s23031616
work_keys_str_mv	AT sunruoxi wlitwindowsandlineartransformerforvideoactionrecognition AT zhangtianzhao wlitwindowsandlineartransformerforvideoactionrecognition AT wanyong wlitwindowsandlineartransformerforvideoactionrecognition AT zhangfuping wlitwindowsandlineartransformerforvideoactionrecognition AT weijianming wlitwindowsandlineartransformerforvideoactionrecognition

WLiT: Windows and Linear Transformer for Video Action Recognition

Ejemplares similares