Cargando…

EmbedFormer: Embedded Depth-Wise Convolution Layer for Token Mixing

Visual Transformers (ViTs) have shown impressive performance due to their powerful coding ability to catch spatial and channel information. MetaFormer gives us a general architecture of transformers consisting of a token mixer and a channel mixer through which we can generally understand how transfo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, Zeji, He, Xiaowei, Li, Yi, Chuai, Qinliang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9782848/ https://www.ncbi.nlm.nih.gov/pubmed/36560222 http://dx.doi.org/10.3390/s22249854

_version_	1784857436363948032
author	Wang, Zeji He, Xiaowei Li, Yi Chuai, Qinliang
author_facet	Wang, Zeji He, Xiaowei Li, Yi Chuai, Qinliang
author_sort	Wang, Zeji
collection	PubMed
description	Visual Transformers (ViTs) have shown impressive performance due to their powerful coding ability to catch spatial and channel information. MetaFormer gives us a general architecture of transformers consisting of a token mixer and a channel mixer through which we can generally understand how transformers work. It is proved that the general architecture of the ViTs is more essential to the models’ performance than self-attention mechanism. Then, Depth-wise Convolution layer (DwConv) is widely accepted to replace local self-attention in transformers. In this work, a pure convolutional "transformer" is designed. We rethink the difference between the operation of self-attention and DwConv. It is found that the self-attention layer, with an embedding layer, unavoidably affects channel information, while DwConv only mixes the token information per channel. To address the differences between DwConv and self-attention, we implement DwConv with an embedding layer before as the token mixer to instantiate a MetaFormer block and a model named EmbedFormer is introduced. Meanwhile, SEBlock is applied in the channel mixer part to improve performance. On the ImageNet-1K classification task, EmbedFormer achieves top-1 accuracy of 81.7% without additional training images, surpassing the Swin transformer by +0.4% in similar complexity. In addition, EmbedFormer is evaluated in downstream tasks and the results are entirely above those of PoolFormer, ResNet and DeiT. Compared with PoolFormer-S24, another instance of MetaFormer, our EmbedFormer improves the score by +3.0% box AP/+2.3% mask AP on the COCO dataset and +1.3% mIoU on the ADE20K.
format	Online Article Text
id	pubmed-9782848
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-97828482022-12-24 EmbedFormer: Embedded Depth-Wise Convolution Layer for Token Mixing Wang, Zeji He, Xiaowei Li, Yi Chuai, Qinliang Sensors (Basel) Article Visual Transformers (ViTs) have shown impressive performance due to their powerful coding ability to catch spatial and channel information. MetaFormer gives us a general architecture of transformers consisting of a token mixer and a channel mixer through which we can generally understand how transformers work. It is proved that the general architecture of the ViTs is more essential to the models’ performance than self-attention mechanism. Then, Depth-wise Convolution layer (DwConv) is widely accepted to replace local self-attention in transformers. In this work, a pure convolutional "transformer" is designed. We rethink the difference between the operation of self-attention and DwConv. It is found that the self-attention layer, with an embedding layer, unavoidably affects channel information, while DwConv only mixes the token information per channel. To address the differences between DwConv and self-attention, we implement DwConv with an embedding layer before as the token mixer to instantiate a MetaFormer block and a model named EmbedFormer is introduced. Meanwhile, SEBlock is applied in the channel mixer part to improve performance. On the ImageNet-1K classification task, EmbedFormer achieves top-1 accuracy of 81.7% without additional training images, surpassing the Swin transformer by +0.4% in similar complexity. In addition, EmbedFormer is evaluated in downstream tasks and the results are entirely above those of PoolFormer, ResNet and DeiT. Compared with PoolFormer-S24, another instance of MetaFormer, our EmbedFormer improves the score by +3.0% box AP/+2.3% mask AP on the COCO dataset and +1.3% mIoU on the ADE20K. MDPI 2022-12-15 /pmc/articles/PMC9782848/ /pubmed/36560222 http://dx.doi.org/10.3390/s22249854 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Wang, Zeji He, Xiaowei Li, Yi Chuai, Qinliang EmbedFormer: Embedded Depth-Wise Convolution Layer for Token Mixing
title	EmbedFormer: Embedded Depth-Wise Convolution Layer for Token Mixing
title_full	EmbedFormer: Embedded Depth-Wise Convolution Layer for Token Mixing
title_fullStr	EmbedFormer: Embedded Depth-Wise Convolution Layer for Token Mixing
title_full_unstemmed	EmbedFormer: Embedded Depth-Wise Convolution Layer for Token Mixing
title_short	EmbedFormer: Embedded Depth-Wise Convolution Layer for Token Mixing
title_sort	embedformer: embedded depth-wise convolution layer for token mixing
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9782848/ https://www.ncbi.nlm.nih.gov/pubmed/36560222 http://dx.doi.org/10.3390/s22249854
work_keys_str_mv	AT wangzeji embedformerembeddeddepthwiseconvolutionlayerfortokenmixing AT hexiaowei embedformerembeddeddepthwiseconvolutionlayerfortokenmixing AT liyi embedformerembeddeddepthwiseconvolutionlayerfortokenmixing AT chuaiqinliang embedformerembeddeddepthwiseconvolutionlayerfortokenmixing

EmbedFormer: Embedded Depth-Wise Convolution Layer for Token Mixing

Ejemplares similares