Cargando…

HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting

Crowd counting is a challenging task due to large perspective, density, and scale variations. CNN-based crowd counting techniques have achieved significant performance in sparse to dense environments. However, crowd counting in high perspective-varying scenes (images) is getting harder due to differ...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ilyas, Naveed, Lee, Boreom, Kim, Kiseon
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Communication
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8156381/ https://www.ncbi.nlm.nih.gov/pubmed/34067707 http://dx.doi.org/10.3390/s21103483

_version_	1783699431812497408
author	Ilyas, Naveed Lee, Boreom Kim, Kiseon
author_facet	Ilyas, Naveed Lee, Boreom Kim, Kiseon
author_sort	Ilyas, Naveed
collection	PubMed
description	Crowd counting is a challenging task due to large perspective, density, and scale variations. CNN-based crowd counting techniques have achieved significant performance in sparse to dense environments. However, crowd counting in high perspective-varying scenes (images) is getting harder due to different density levels occupied by the same number of pixels. In this way large variations for objects in the same spatial area make it difficult to count accurately. Further, existing CNN-based crowd counting methods are used to extract rich deep features; however, these features are used locally and disseminated while propagating through intermediate layers. This results in high counting errors, especially in dense and high perspective-variation scenes. Further, class-specific responses along channel dimensions are underestimated. To address these above mentioned issues, we therefore propose a CNN-based dense feature extraction network for accurate crowd counting. Our proposed model comprises three main modules: (1) backbone network, (2) dense feature extraction modules (DFEMs), and (3) channel attention module (CAM). The backbone network is used to obtain general features with strong transfer learning ability. The DFEM is composed of multiple sub-modules called dense stacked convolution modules (DSCMs), densely connected with each other. In this way features extracted from lower and middle-lower layers are propagated to higher layers through dense connections. In addition, combinations of task independent general features obtained by the former modules and task-specific features obtained by later ones are incorporated to obtain high counting accuracy in large perspective-varying scenes. Further, to exploit the class-specific response between background and foreground, CAM is incorporated at the end to obtain high-level features along channel dimensions for better counting accuracy. Moreover, we have evaluated the proposed method on three well known datasets: Shanghaitech (Part-A), Shanghaitech (Part-B), and Venice. The performance of the proposed technique justifies its relative effectiveness in terms of selected performance compared to state-of-the-art techniques.
format	Online Article Text
id	pubmed-8156381
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-81563812021-05-28 HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting Ilyas, Naveed Lee, Boreom Kim, Kiseon Sensors (Basel) Communication Crowd counting is a challenging task due to large perspective, density, and scale variations. CNN-based crowd counting techniques have achieved significant performance in sparse to dense environments. However, crowd counting in high perspective-varying scenes (images) is getting harder due to different density levels occupied by the same number of pixels. In this way large variations for objects in the same spatial area make it difficult to count accurately. Further, existing CNN-based crowd counting methods are used to extract rich deep features; however, these features are used locally and disseminated while propagating through intermediate layers. This results in high counting errors, especially in dense and high perspective-variation scenes. Further, class-specific responses along channel dimensions are underestimated. To address these above mentioned issues, we therefore propose a CNN-based dense feature extraction network for accurate crowd counting. Our proposed model comprises three main modules: (1) backbone network, (2) dense feature extraction modules (DFEMs), and (3) channel attention module (CAM). The backbone network is used to obtain general features with strong transfer learning ability. The DFEM is composed of multiple sub-modules called dense stacked convolution modules (DSCMs), densely connected with each other. In this way features extracted from lower and middle-lower layers are propagated to higher layers through dense connections. In addition, combinations of task independent general features obtained by the former modules and task-specific features obtained by later ones are incorporated to obtain high counting accuracy in large perspective-varying scenes. Further, to exploit the class-specific response between background and foreground, CAM is incorporated at the end to obtain high-level features along channel dimensions for better counting accuracy. Moreover, we have evaluated the proposed method on three well known datasets: Shanghaitech (Part-A), Shanghaitech (Part-B), and Venice. The performance of the proposed technique justifies its relative effectiveness in terms of selected performance compared to state-of-the-art techniques. MDPI 2021-05-17 /pmc/articles/PMC8156381/ /pubmed/34067707 http://dx.doi.org/10.3390/s21103483 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Communication Ilyas, Naveed Lee, Boreom Kim, Kiseon HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting
title	HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting
title_full	HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting
title_fullStr	HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting
title_full_unstemmed	HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting
title_short	HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting
title_sort	hadf-crowd: a hierarchical attention-based dense feature extraction network for single-image crowd counting
topic	Communication
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8156381/ https://www.ncbi.nlm.nih.gov/pubmed/34067707 http://dx.doi.org/10.3390/s21103483
work_keys_str_mv	AT ilyasnaveed hadfcrowdahierarchicalattentionbaseddensefeatureextractionnetworkforsingleimagecrowdcounting AT leeboreom hadfcrowdahierarchicalattentionbaseddensefeatureextractionnetworkforsingleimagecrowdcounting AT kimkiseon hadfcrowdahierarchicalattentionbaseddensefeatureextractionnetworkforsingleimagecrowdcounting

HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting

Ejemplares similares