Cargando…

RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers

The latest research in computer vision highlighted the effectiveness of the vision transformers (ViT) in performing several computer vision tasks; they can efficiently understand and process the image globally unlike the convolution which processes the image locally. ViTs outperform the convolutiona...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ibrahem, Hatem, Salem, Ahmed, Kang, Hyun-Soo
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9143167/ https://www.ncbi.nlm.nih.gov/pubmed/35632271 http://dx.doi.org/10.3390/s22103849

_version_	1784715739030093824
author	Ibrahem, Hatem Salem, Ahmed Kang, Hyun-Soo
author_facet	Ibrahem, Hatem Salem, Ahmed Kang, Hyun-Soo
author_sort	Ibrahem, Hatem
collection	PubMed
description	The latest research in computer vision highlighted the effectiveness of the vision transformers (ViT) in performing several computer vision tasks; they can efficiently understand and process the image globally unlike the convolution which processes the image locally. ViTs outperform the convolutional neural networks in terms of accuracy in many computer vision tasks but the speed of ViTs is still an issue, due to the excessive use of the transformer layers that include many fully connected layers. Therefore, we propose a real-time ViT-based monocular depth estimation (depth estimation from single RGB image) method with encoder-decoder architectures for indoor and outdoor scenes. This main architecture of the proposed method consists of a vision transformer encoder and a convolutional neural network decoder. We started by training the base vision transformer (ViT-b16) with 12 transformer layers then we reduced the transformer layers to six layers, namely ViT-s16 (the Small ViT) and four layers, namely ViT-t16 (the Tiny ViT) to obtain real-time processing. We also try four different configurations of the CNN decoder network. The proposed architectures can learn the task of depth estimation efficiently and can produce more accurate depth predictions than the fully convolutional-based methods taking advantage of the multi-head self-attention module. We train the proposed encoder-decoder architecture end-to-end on the challenging NYU-depthV2 and CITYSCAPES benchmarks then we evaluate the trained models on the validation and test sets of the same benchmarks showing that it outperforms many state-of-the-art methods on depth estimation while performing the task in real-time (∼20 fps). We also present a fast 3D reconstruction (∼17 fps) experiment based on the depth estimated from our method which is considered a real-world application of our method.
format	Online Article Text
id	pubmed-9143167
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-91431672022-05-29 RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers Ibrahem, Hatem Salem, Ahmed Kang, Hyun-Soo Sensors (Basel) Article The latest research in computer vision highlighted the effectiveness of the vision transformers (ViT) in performing several computer vision tasks; they can efficiently understand and process the image globally unlike the convolution which processes the image locally. ViTs outperform the convolutional neural networks in terms of accuracy in many computer vision tasks but the speed of ViTs is still an issue, due to the excessive use of the transformer layers that include many fully connected layers. Therefore, we propose a real-time ViT-based monocular depth estimation (depth estimation from single RGB image) method with encoder-decoder architectures for indoor and outdoor scenes. This main architecture of the proposed method consists of a vision transformer encoder and a convolutional neural network decoder. We started by training the base vision transformer (ViT-b16) with 12 transformer layers then we reduced the transformer layers to six layers, namely ViT-s16 (the Small ViT) and four layers, namely ViT-t16 (the Tiny ViT) to obtain real-time processing. We also try four different configurations of the CNN decoder network. The proposed architectures can learn the task of depth estimation efficiently and can produce more accurate depth predictions than the fully convolutional-based methods taking advantage of the multi-head self-attention module. We train the proposed encoder-decoder architecture end-to-end on the challenging NYU-depthV2 and CITYSCAPES benchmarks then we evaluate the trained models on the validation and test sets of the same benchmarks showing that it outperforms many state-of-the-art methods on depth estimation while performing the task in real-time (∼20 fps). We also present a fast 3D reconstruction (∼17 fps) experiment based on the depth estimated from our method which is considered a real-world application of our method. MDPI 2022-05-19 /pmc/articles/PMC9143167/ /pubmed/35632271 http://dx.doi.org/10.3390/s22103849 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Ibrahem, Hatem Salem, Ahmed Kang, Hyun-Soo RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers
title	RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers
title_full	RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers
title_fullStr	RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers
title_full_unstemmed	RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers
title_short	RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers
title_sort	rt-vit: real-time monocular depth estimation using lightweight vision transformers
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9143167/ https://www.ncbi.nlm.nih.gov/pubmed/35632271 http://dx.doi.org/10.3390/s22103849
work_keys_str_mv	AT ibrahemhatem rtvitrealtimemonoculardepthestimationusinglightweightvisiontransformers AT salemahmed rtvitrealtimemonoculardepthestimationusinglightweightvisiontransformers AT kanghyunsoo rtvitrealtimemonoculardepthestimationusinglightweightvisiontransformers

RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers

Ejemplares similares