Cargando…

Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?

Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Insp...

Descripción completa

Detalles Bibliográficos
Autores principales: Moutik, Oumaima, Sekkat, Hiba, Tigani, Smail, Chehri, Abdellah, Saadane, Rachid, Tchakoucht, Taha Ait, Paul, Anand
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9862752/
https://www.ncbi.nlm.nih.gov/pubmed/36679530
http://dx.doi.org/10.3390/s23020734
_version_ 1784875168102875136
author Moutik, Oumaima
Sekkat, Hiba
Tigani, Smail
Chehri, Abdellah
Saadane, Rachid
Tchakoucht, Taha Ait
Paul, Anand
author_facet Moutik, Oumaima
Sekkat, Hiba
Tigani, Smail
Chehri, Abdellah
Saadane, Rachid
Tchakoucht, Taha Ait
Paul, Anand
author_sort Moutik, Oumaima
collection PubMed
description Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis’s outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.
format Online
Article
Text
id pubmed-9862752
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-98627522023-01-22 Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data? Moutik, Oumaima Sekkat, Hiba Tigani, Smail Chehri, Abdellah Saadane, Rachid Tchakoucht, Taha Ait Paul, Anand Sensors (Basel) Review Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis’s outcome, the question of whether CNN or Vision Transformers will win the race will be discussed. MDPI 2023-01-09 /pmc/articles/PMC9862752/ /pubmed/36679530 http://dx.doi.org/10.3390/s23020734 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Review
Moutik, Oumaima
Sekkat, Hiba
Tigani, Smail
Chehri, Abdellah
Saadane, Rachid
Tchakoucht, Taha Ait
Paul, Anand
Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?
title Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?
title_full Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?
title_fullStr Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?
title_full_unstemmed Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?
title_short Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?
title_sort convolutional neural networks or vision transformers: who will win the race for action recognitions in visual data?
topic Review
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9862752/
https://www.ncbi.nlm.nih.gov/pubmed/36679530
http://dx.doi.org/10.3390/s23020734
work_keys_str_mv AT moutikoumaima convolutionalneuralnetworksorvisiontransformerswhowillwintheraceforactionrecognitionsinvisualdata
AT sekkathiba convolutionalneuralnetworksorvisiontransformerswhowillwintheraceforactionrecognitionsinvisualdata
AT tiganismail convolutionalneuralnetworksorvisiontransformerswhowillwintheraceforactionrecognitionsinvisualdata
AT chehriabdellah convolutionalneuralnetworksorvisiontransformerswhowillwintheraceforactionrecognitionsinvisualdata
AT saadanerachid convolutionalneuralnetworksorvisiontransformerswhowillwintheraceforactionrecognitionsinvisualdata
AT tchakouchttahaait convolutionalneuralnetworksorvisiontransformerswhowillwintheraceforactionrecognitionsinvisualdata
AT paulanand convolutionalneuralnetworksorvisiontransformerswhowillwintheraceforactionrecognitionsinvisualdata