Cargando…

Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection

Recently, Transformer-based video recognition models have achieved state-of-the-art results on major video recognition benchmarks. However, their high inference cost significantly limits research speed and practical use. In video compression, methods considering small motions and residuals that are...

Descripción completa

Detalles Bibliográficos
Autores principales: Suzuki, Tomoyuki, Aoki, Yoshimitsu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9823838/
https://www.ncbi.nlm.nih.gov/pubmed/36616842
http://dx.doi.org/10.3390/s23010244
_version_ 1784866261209972736
author Suzuki, Tomoyuki
Aoki, Yoshimitsu
author_facet Suzuki, Tomoyuki
Aoki, Yoshimitsu
author_sort Suzuki, Tomoyuki
collection PubMed
description Recently, Transformer-based video recognition models have achieved state-of-the-art results on major video recognition benchmarks. However, their high inference cost significantly limits research speed and practical use. In video compression, methods considering small motions and residuals that are less informative and assigning short code lengths to them (e.g., MPEG4) have successfully reduced the redundancy of videos. Inspired by this idea, we propose Informative Patch Selection (IPS), which efficiently reduces the inference cost by excluding redundant patches from the input of the Transformer-based video model. The redundancy of each patch is calculated from motions and residuals obtained while decoding a compressed video. The proposed method is simple and effective in that it can dynamically reduce the inference cost depending on the input without any policy model or additional loss term. Extensive experiments on action recognition demonstrated that our method could significantly improve the trade-off between the accuracy and inference cost of the Transformer-based video model. Although the method does not require any policy model or additional loss term, its performance approaches that of existing methods that do require them.
format Online
Article
Text
id pubmed-9823838
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-98238382023-01-08 Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection Suzuki, Tomoyuki Aoki, Yoshimitsu Sensors (Basel) Article Recently, Transformer-based video recognition models have achieved state-of-the-art results on major video recognition benchmarks. However, their high inference cost significantly limits research speed and practical use. In video compression, methods considering small motions and residuals that are less informative and assigning short code lengths to them (e.g., MPEG4) have successfully reduced the redundancy of videos. Inspired by this idea, we propose Informative Patch Selection (IPS), which efficiently reduces the inference cost by excluding redundant patches from the input of the Transformer-based video model. The redundancy of each patch is calculated from motions and residuals obtained while decoding a compressed video. The proposed method is simple and effective in that it can dynamically reduce the inference cost depending on the input without any policy model or additional loss term. Extensive experiments on action recognition demonstrated that our method could significantly improve the trade-off between the accuracy and inference cost of the Transformer-based video model. Although the method does not require any policy model or additional loss term, its performance approaches that of existing methods that do require them. MDPI 2022-12-26 /pmc/articles/PMC9823838/ /pubmed/36616842 http://dx.doi.org/10.3390/s23010244 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Suzuki, Tomoyuki
Aoki, Yoshimitsu
Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection
title Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection
title_full Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection
title_fullStr Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection
title_full_unstemmed Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection
title_short Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection
title_sort efficient transformer-based compressed video modeling via informative patch selection
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9823838/
https://www.ncbi.nlm.nih.gov/pubmed/36616842
http://dx.doi.org/10.3390/s23010244
work_keys_str_mv AT suzukitomoyuki efficienttransformerbasedcompressedvideomodelingviainformativepatchselection
AT aokiyoshimitsu efficienttransformerbasedcompressedvideomodelingviainformativepatchselection