Cargando…

Attention-Guided Disentangled Feature Aggregation for Video Object Detection

Object detection is a computer vision task that involves localisation and classification of objects in an image. Video data implicitly introduces several challenges, such as blur, occlusion and defocus, making video object detection more challenging in comparison to still image object detection, whi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Muralidhara, Shishir, Hashmi, Khurram Azeem, Pagani, Alain, Liwicki, Marcus, Stricker, Didier, Afzal, Muhammad Zeshan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9658927/ https://www.ncbi.nlm.nih.gov/pubmed/36366281 http://dx.doi.org/10.3390/s22218583

_version_	1784830074819706880
author	Muralidhara, Shishir Hashmi, Khurram Azeem Pagani, Alain Liwicki, Marcus Stricker, Didier Afzal, Muhammad Zeshan
author_facet	Muralidhara, Shishir Hashmi, Khurram Azeem Pagani, Alain Liwicki, Marcus Stricker, Didier Afzal, Muhammad Zeshan
author_sort	Muralidhara, Shishir
collection	PubMed
description	Object detection is a computer vision task that involves localisation and classification of objects in an image. Video data implicitly introduces several challenges, such as blur, occlusion and defocus, making video object detection more challenging in comparison to still image object detection, which is performed on individual and independent images. This paper tackles these challenges by proposing an attention-heavy framework for video object detection that aggregates the disentangled features extracted from individual frames. The proposed framework is a two-stage object detector based on the Faster R-CNN architecture. The disentanglement head integrates scale, spatial and task-aware attention and applies it to the features extracted by the backbone network across all the frames. Subsequently, the aggregation head incorporates temporal attention and improves detection in the target frame by aggregating the features of the support frames. These include the features extracted from the disentanglement network along with the temporal features. We evaluate the proposed framework using the ImageNet VID dataset and achieve a mean Average Precision (mAP) of 49.8 and 52.5 using the backbones of ResNet-50 and ResNet-101, respectively. The improvement in performance over the individual baseline methods validates the efficacy of the proposed approach.
format	Online Article Text
id	pubmed-9658927
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-96589272022-11-15 Attention-Guided Disentangled Feature Aggregation for Video Object Detection Muralidhara, Shishir Hashmi, Khurram Azeem Pagani, Alain Liwicki, Marcus Stricker, Didier Afzal, Muhammad Zeshan Sensors (Basel) Article Object detection is a computer vision task that involves localisation and classification of objects in an image. Video data implicitly introduces several challenges, such as blur, occlusion and defocus, making video object detection more challenging in comparison to still image object detection, which is performed on individual and independent images. This paper tackles these challenges by proposing an attention-heavy framework for video object detection that aggregates the disentangled features extracted from individual frames. The proposed framework is a two-stage object detector based on the Faster R-CNN architecture. The disentanglement head integrates scale, spatial and task-aware attention and applies it to the features extracted by the backbone network across all the frames. Subsequently, the aggregation head incorporates temporal attention and improves detection in the target frame by aggregating the features of the support frames. These include the features extracted from the disentanglement network along with the temporal features. We evaluate the proposed framework using the ImageNet VID dataset and achieve a mean Average Precision (mAP) of 49.8 and 52.5 using the backbones of ResNet-50 and ResNet-101, respectively. The improvement in performance over the individual baseline methods validates the efficacy of the proposed approach. MDPI 2022-11-07 /pmc/articles/PMC9658927/ /pubmed/36366281 http://dx.doi.org/10.3390/s22218583 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Muralidhara, Shishir Hashmi, Khurram Azeem Pagani, Alain Liwicki, Marcus Stricker, Didier Afzal, Muhammad Zeshan Attention-Guided Disentangled Feature Aggregation for Video Object Detection
title	Attention-Guided Disentangled Feature Aggregation for Video Object Detection
title_full	Attention-Guided Disentangled Feature Aggregation for Video Object Detection
title_fullStr	Attention-Guided Disentangled Feature Aggregation for Video Object Detection
title_full_unstemmed	Attention-Guided Disentangled Feature Aggregation for Video Object Detection
title_short	Attention-Guided Disentangled Feature Aggregation for Video Object Detection
title_sort	attention-guided disentangled feature aggregation for video object detection
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9658927/ https://www.ncbi.nlm.nih.gov/pubmed/36366281 http://dx.doi.org/10.3390/s22218583
work_keys_str_mv	AT muralidharashishir attentionguideddisentangledfeatureaggregationforvideoobjectdetection AT hashmikhurramazeem attentionguideddisentangledfeatureaggregationforvideoobjectdetection AT paganialain attentionguideddisentangledfeatureaggregationforvideoobjectdetection AT liwickimarcus attentionguideddisentangledfeatureaggregationforvideoobjectdetection AT strickerdidier attentionguideddisentangledfeatureaggregationforvideoobjectdetection AT afzalmuhammadzeshan attentionguideddisentangledfeatureaggregationforvideoobjectdetection

Attention-Guided Disentangled Feature Aggregation for Video Object Detection

Ejemplares similares