Cargando…

Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos

Human Activity Recognition is an active research area with several Convolutional Neural Network (CNN) based features extraction and classification methods employed for surveillance and other applications. However, accurate identification of HAR from a sequence of frames is a challenging task due to...

Descripción completa

Detalles Bibliográficos
Autores principales: Hussain, Altaf, Hussain, Tanveer, Ullah, Waseem, Baik, Sung Wook
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9001125/
https://www.ncbi.nlm.nih.gov/pubmed/35419045
http://dx.doi.org/10.1155/2022/3454167
_version_ 1784685599532253184
author Hussain, Altaf
Hussain, Tanveer
Ullah, Waseem
Baik, Sung Wook
author_facet Hussain, Altaf
Hussain, Tanveer
Ullah, Waseem
Baik, Sung Wook
author_sort Hussain, Altaf
collection PubMed
description Human Activity Recognition is an active research area with several Convolutional Neural Network (CNN) based features extraction and classification methods employed for surveillance and other applications. However, accurate identification of HAR from a sequence of frames is a challenging task due to cluttered background, different viewpoints, low resolution, and partial occlusion. Current CNN-based techniques use large-scale computational classifiers along with convolutional operators having local receptive fields, limiting their performance to capture long-range temporal information. Therefore, in this work, we introduce a convolution-free approach for accurate HAR, which overcomes the above-mentioned problems and accurately encodes relative spatial information. In the proposed framework, the frame-level features are extracted via pretrained Vision Transformer; next, these features are passed to multilayer long short-term memory to capture the long-range dependencies of the actions in the surveillance videos. To validate the performance of the proposed framework, we carried out extensive experiments on UCF50 and HMDB51 benchmark HAR datasets and improved accuracy by 0.944% and 1.414%, respectively, when compared to state-of-the-art deep models.
format Online
Article
Text
id pubmed-9001125
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-90011252022-04-12 Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos Hussain, Altaf Hussain, Tanveer Ullah, Waseem Baik, Sung Wook Comput Intell Neurosci Research Article Human Activity Recognition is an active research area with several Convolutional Neural Network (CNN) based features extraction and classification methods employed for surveillance and other applications. However, accurate identification of HAR from a sequence of frames is a challenging task due to cluttered background, different viewpoints, low resolution, and partial occlusion. Current CNN-based techniques use large-scale computational classifiers along with convolutional operators having local receptive fields, limiting their performance to capture long-range temporal information. Therefore, in this work, we introduce a convolution-free approach for accurate HAR, which overcomes the above-mentioned problems and accurately encodes relative spatial information. In the proposed framework, the frame-level features are extracted via pretrained Vision Transformer; next, these features are passed to multilayer long short-term memory to capture the long-range dependencies of the actions in the surveillance videos. To validate the performance of the proposed framework, we carried out extensive experiments on UCF50 and HMDB51 benchmark HAR datasets and improved accuracy by 0.944% and 1.414%, respectively, when compared to state-of-the-art deep models. Hindawi 2022-04-04 /pmc/articles/PMC9001125/ /pubmed/35419045 http://dx.doi.org/10.1155/2022/3454167 Text en Copyright © 2022 Altaf Hussain et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Hussain, Altaf
Hussain, Tanveer
Ullah, Waseem
Baik, Sung Wook
Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos
title Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos
title_full Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos
title_fullStr Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos
title_full_unstemmed Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos
title_short Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos
title_sort vision transformer and deep sequence learning for human activity recognition in surveillance videos
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9001125/
https://www.ncbi.nlm.nih.gov/pubmed/35419045
http://dx.doi.org/10.1155/2022/3454167
work_keys_str_mv AT hussainaltaf visiontransformeranddeepsequencelearningforhumanactivityrecognitioninsurveillancevideos
AT hussaintanveer visiontransformeranddeepsequencelearningforhumanactivityrecognitioninsurveillancevideos
AT ullahwaseem visiontransformeranddeepsequencelearningforhumanactivityrecognitioninsurveillancevideos
AT baiksungwook visiontransformeranddeepsequencelearningforhumanactivityrecognitioninsurveillancevideos