Cargando…

Swin-MFA: A Multi-Modal Fusion Attention Network Based on Swin-Transformer for Low-Light Image Human Segmentation

In recent years, image segmentation based on deep learning has been widely used in medical imaging, automatic driving, monitoring and security. In the fields of monitoring and security, the specific location of a person is detected by image segmentation, and it is segmented from the background to an...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yi, Xunpeng, Zhang, Haonan, Wang, Yibo, Guo, Shujiang, Wu, Jingyi, Fan, Cien
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9413725/ https://www.ncbi.nlm.nih.gov/pubmed/36015990 http://dx.doi.org/10.3390/s22166229

_version_	1784775819850153984
author	Yi, Xunpeng Zhang, Haonan Wang, Yibo Guo, Shujiang Wu, Jingyi Fan, Cien
author_facet	Yi, Xunpeng Zhang, Haonan Wang, Yibo Guo, Shujiang Wu, Jingyi Fan, Cien
author_sort	Yi, Xunpeng
collection	PubMed
description	In recent years, image segmentation based on deep learning has been widely used in medical imaging, automatic driving, monitoring and security. In the fields of monitoring and security, the specific location of a person is detected by image segmentation, and it is segmented from the background to analyze the specific actions of the person. However, in low-illumination conditions, it is a great challenge to the traditional image-segmentation algorithms. Unfortunately, a scene with low light or even no light at night is often encountered in monitoring and security. Given this background, this paper proposes a multi-modal fusion network based on the encoder and decoder structure. The encoder, which contains a two-branch swin-transformer backbone instead of the traditional convolutional neural network, fuses the RGB and depth features with a multiscale fusion attention block. The decoder is also made up of the swin-transformer backbone and is finally connected via the encoder with several residual connections, which are proven to be beneficial in improving the accuracy of the network. Furthermore, this paper first proposes the low light–human segmentation (LLHS) dataset of portrait segmentation, with aligned depth and RGB images with fine annotation under low illuminance, by combining the traditional monocular camera and a depth camera with active structured light. The network is also tested in different levels of illumination. Experimental results show that the proposed network has good robustness in the scene of human segmentation in a low-light environment with varying illumination. The mean Intersection over Union (mIoU), which is often used to evaluate the performance of image segmentation model, of the Swin-MFA in the LLHS dataset is 81.0, is better than those of ACNet, 3DGNN, ESANet, RedNet and RFNet at the same level of depth in a mixed multi-modal network and is far ahead of the segmentation algorithm that only uses RGB features, so it has important practical significance.
format	Online Article Text
id	pubmed-9413725
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-94137252022-08-27 Swin-MFA: A Multi-Modal Fusion Attention Network Based on Swin-Transformer for Low-Light Image Human Segmentation Yi, Xunpeng Zhang, Haonan Wang, Yibo Guo, Shujiang Wu, Jingyi Fan, Cien Sensors (Basel) Article In recent years, image segmentation based on deep learning has been widely used in medical imaging, automatic driving, monitoring and security. In the fields of monitoring and security, the specific location of a person is detected by image segmentation, and it is segmented from the background to analyze the specific actions of the person. However, in low-illumination conditions, it is a great challenge to the traditional image-segmentation algorithms. Unfortunately, a scene with low light or even no light at night is often encountered in monitoring and security. Given this background, this paper proposes a multi-modal fusion network based on the encoder and decoder structure. The encoder, which contains a two-branch swin-transformer backbone instead of the traditional convolutional neural network, fuses the RGB and depth features with a multiscale fusion attention block. The decoder is also made up of the swin-transformer backbone and is finally connected via the encoder with several residual connections, which are proven to be beneficial in improving the accuracy of the network. Furthermore, this paper first proposes the low light–human segmentation (LLHS) dataset of portrait segmentation, with aligned depth and RGB images with fine annotation under low illuminance, by combining the traditional monocular camera and a depth camera with active structured light. The network is also tested in different levels of illumination. Experimental results show that the proposed network has good robustness in the scene of human segmentation in a low-light environment with varying illumination. The mean Intersection over Union (mIoU), which is often used to evaluate the performance of image segmentation model, of the Swin-MFA in the LLHS dataset is 81.0, is better than those of ACNet, 3DGNN, ESANet, RedNet and RFNet at the same level of depth in a mixed multi-modal network and is far ahead of the segmentation algorithm that only uses RGB features, so it has important practical significance. MDPI 2022-08-19 /pmc/articles/PMC9413725/ /pubmed/36015990 http://dx.doi.org/10.3390/s22166229 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Yi, Xunpeng Zhang, Haonan Wang, Yibo Guo, Shujiang Wu, Jingyi Fan, Cien Swin-MFA: A Multi-Modal Fusion Attention Network Based on Swin-Transformer for Low-Light Image Human Segmentation
title	Swin-MFA: A Multi-Modal Fusion Attention Network Based on Swin-Transformer for Low-Light Image Human Segmentation
title_full	Swin-MFA: A Multi-Modal Fusion Attention Network Based on Swin-Transformer for Low-Light Image Human Segmentation
title_fullStr	Swin-MFA: A Multi-Modal Fusion Attention Network Based on Swin-Transformer for Low-Light Image Human Segmentation
title_full_unstemmed	Swin-MFA: A Multi-Modal Fusion Attention Network Based on Swin-Transformer for Low-Light Image Human Segmentation
title_short	Swin-MFA: A Multi-Modal Fusion Attention Network Based on Swin-Transformer for Low-Light Image Human Segmentation
title_sort	swin-mfa: a multi-modal fusion attention network based on swin-transformer for low-light image human segmentation
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9413725/ https://www.ncbi.nlm.nih.gov/pubmed/36015990 http://dx.doi.org/10.3390/s22166229
work_keys_str_mv	AT yixunpeng swinmfaamultimodalfusionattentionnetworkbasedonswintransformerforlowlightimagehumansegmentation AT zhanghaonan swinmfaamultimodalfusionattentionnetworkbasedonswintransformerforlowlightimagehumansegmentation AT wangyibo swinmfaamultimodalfusionattentionnetworkbasedonswintransformerforlowlightimagehumansegmentation AT guoshujiang swinmfaamultimodalfusionattentionnetworkbasedonswintransformerforlowlightimagehumansegmentation AT wujingyi swinmfaamultimodalfusionattentionnetworkbasedonswintransformerforlowlightimagehumansegmentation AT fancien swinmfaamultimodalfusionattentionnetworkbasedonswintransformerforlowlightimagehumansegmentation

Swin-MFA: A Multi-Modal Fusion Attention Network Based on Swin-Transformer for Low-Light Image Human Segmentation

Ejemplares similares